miolini · harryschaefer93 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026
diff --git a/.gitignore b/.gitignore
@@ -18,3 +18,12 @@ AGENTS.md
 
 # Experimental code/artifacts
 dev/
+
+# Training artifacts (generated per-run)
+run.log
+run_*.log
+results.tsv
+model.pt
+
+# Drafts
+linkedin_post.md
diff --git a/MODEL_CARD.md b/MODEL_CARD.md
@@ -0,0 +1,153 @@
+---
+language:
+- en
+license: mit
+tags:
+- financial
+- sec-filings
+- 10-K
+- small-language-model
+- slm
+- gpt
+- domain-specific
+metrics:
+- bits-per-byte
+pipeline_tag: text-generation
+model-index:
+- name: 10k-financial-slm
+  results:
+  - task:
+      type: text-generation
+    metrics:
+    - name: val_bpb (financial text)
+      type: bits-per-byte
+      value: 1.645
+    - name: val_bpb (general text baseline)
+      type: bits-per-byte
+      value: 2.146
+---
+
+# 10-K Financial SLM (11.5M params)
+
+A tiny GPT language model trained exclusively on SEC 10-K filings from financial companies. ~20 experiments at 5 minutes each (~2 hours total GPU time) on a MacBook Air using Apple Silicon (MPS).
+
+## Model Details
+
+| Property | Value |
+|----------|-------|
+| Parameters | 11.5M |
+| Architecture | GPT (decoder-only transformer) |
+| Layers | 4 |
+| Hidden dim | 256 |
+| Attention heads | 2 |
+| Context length | 2,048 tokens |
+| Vocab size | 8,192 (BPE) |
+| Training data | 1,131 SEC 10-K filings (financial companies, SIC 6000-6411) |
+| Training time | ~2 hours total (~20 x 5-min experiments on Apple M-series MPS) |
+
+## Performance
+
+### Compression Quality (bits-per-byte)
+
+| Model | val_bpb | Domain |
+|-------|---------|--------|
+| **This model (specialized)** | **1.645** | Financial 10-K text |
+| Same architecture (general) | 2.146 | General web text (ClimbMix) |
+
+**23.3% better compression** on financial text compared to the same architecture trained on general data.
+
+### Inference Speed (MacBook Air, MPS)
+
+| Metric | Value |
+|--------|-------|
+| Single sequence latency | 27ms (2,048 tokens) |
+| Batched throughput | 75,000+ tokens/sec |
+| Time per 10-K filing | ~1 second |
+| Full SEC EDGAR database | ~22 hours |
+
+### Cost Comparison (processing 80K filings)
+
+| Approach | Cost |
+|----------|------|
+| GPT-4o API ($2.50/1M tokens) | ~$15,000 |
+| Claude Sonnet 4.6 API ($3.00/1M tokens) | ~$18,000 |
+| Claude Haiku 4.5 API ($1.00/1M tokens) | ~$6,000 |
+| GPT-4o-mini API ($0.15/1M tokens) | ~$900 |
+| **This model (local)** | **$0** |
+
+*Prices as of March 2026. Input tokens only.*
+
+## Training Details
+
+Built using [Karpathy's autoresearch](https://github.com/miolini/autoresearch-macos) framework, which enables autonomous hyperparameter experimentation. An AI agent (Claude) iteratively modified the training configuration, ran 5-minute training sessions, and kept improvements.
+
+### Key hyperparameters (after optimization)
+
+- Learning rates: 1.5x default (Embedding: 0.9, Matrix/Muon: 0.06)
+- Warmdown ratio: 0.05 (LR stays at peak for 95% of training)
+- Optimizer: MuonAdamW (Muon for matrix params, AdamW for embeddings)
+- Batch size: 65,536 tokens per step
+
+### Data pipeline
+
+1. Downloaded 10-K filing index from SEC EDGAR (2015-2025)
+2. Filtered to financial companies (SIC codes 6000-6411): banks, insurance, investment firms
+3. Sampled 1,500 filings, downloaded full text from EDGAR
+4. Cleaned HTML/XBRL markup, removed filings that were too short or too numeric
+5. Chunked into 2,048-token sequences, split 90/10 train/val
+6. Trained a BPE tokenizer (8,192 vocab) on the financial text
+
+## Intended Use
+
+This model is a research artifact demonstrating domain-specific SLM training. Potential applications:
+
+- **Document embeddings**: Fast similarity search over financial filings
+- **Anomaly detection**: Flag filings with unusual language patterns
+- **Pre-filtering**: Cheap triage before sending documents to expensive API models
+- **Privacy-preserving analysis**: All processing stays on-device
+- **Foundation for fine-tuning**: Starting point for downstream financial NLP tasks
+
+## Limitations
+
+- **Not a chatbot**: This is a base language model. It predicts next tokens, it doesn't answer questions.
+- **Tiny model**: 11.5M parameters means limited capacity. It captures patterns and statistics of financial language, not deep reasoning.
+- **Narrow training data**: Only financial company 10-K filings. Performance on other financial documents (earnings calls, proxy statements) is untested.
+- **No safety training**: No RLHF, no content filtering. Not suitable for user-facing generation.
+
+## How to Use
+
+```python
+import torch
+from train import GPT, GPTConfig
+
+# Load checkpoint
+ckpt = torch.load("model.pt", map_location="cpu")
+config = GPTConfig(**ckpt["config"])
+
+model = GPT(config)
+model.load_state_dict(ckpt["model_state_dict"])
+model.eval()
+
+# Run inference
+tokens = torch.tensor([[1, 2, 3, ...]])  # your tokenized input
+with torch.no_grad():
+    logits = model(tokens)
+```
+
+## Citation
+
+If you use this model in your work, please cite:
+
+```
+@misc{10k-financial-slm-2026,
+  title={10-K Financial SLM: A Domain-Specific Small Language Model for SEC Filings},
+  year={2026},
+  url={https://github.com/harryschaefer93/autoresearch-10k-macos}
+}
+```
+
+## Acknowledgments
+
+- [Andrej Karpathy](https://github.com/karpathy) / [autoresearch-macos](https://github.com/miolini/autoresearch-macos) for the training framework
+- [Claude Code](https://claude.ai/claude-code) for autonomous experiment orchestration
+- SEC EDGAR for the public filing data
diff --git a/README.md b/README.md
@@ -1,4 +1,208 @@
-# autoresearch-macos
+# 10-K Financial SLM
+
+A tiny (11.5M parameter) GPT language model trained exclusively on SEC 10-K filings from financial companies. Built using [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) framework with an AI agent (Claude) autonomously running experiments on a MacBook Air. ~20 experiments at 5 minutes each, ~2 hours of total GPU time.
+
+**[Model on HuggingFace](https://huggingface.co/HarryS64/10k-financial-slm)** · **[GitHub Repo](https://github.com/harryschaefer93/autoresearch-10k-macos)**
+
+<p align="center">
+  <img src="images/training-monitor.jpeg" width="48%" alt="Live training monitor showing benchmark race, loss curve, and thermal status" />
+  <img src="images/cooling-setup.jpeg" width="48%" alt="MacBook Air on a dehumidifier for passive cooling during training" />
+</p>
+<p align="center"><em>Left: Live training dashboard tracking the optimization journey. Right: Our "cooling solution" — a MacBook Air on a dehumidifier. It worked.</em></p>
+
+## Results
+
+| Metric | Value |
+|--------|-------|
+| Compression (val_bpb) | **1.645** (vs 2.146 general baseline) |
+| Improvement over general model | **23.3%** |
+| Inference speed | 75,000+ tok/sec on MacBook Air |
+| Time per 10-K filing | ~1 second |
+| Cost to process all 80K SEC filings | **$0** (vs $21K+ via API) |
+
+## The Question
+
+Can a tiny model trained exclusively on financial text outperform a general-purpose model of the same size at understanding financial documents? And can we use autonomous AI-driven experimentation to optimize it?
+
+## What We Did
+
+### Step 1: Built a financial data pipeline
+
+We wrote `prepare_10k.py` to pull 10-K filings directly from SEC EDGAR:
+
+1. Downloaded the quarterly master index (2015-2025) — found ~80,000 10-K filings total
+2. Filtered to financial companies only (SIC codes 6000-6411: banks, insurance, investment firms) — 18,538 filings
+3. Sampled 1,500, downloaded full text, cleaned HTML/XBRL markup
+4. Kept 1,131 high-quality filings after filtering out too-short or too-numeric documents
+5. Chunked into 60,095 training sequences (2,048 tokens each) + 6,677 validation sequences
+6. Trained a domain-specific BPE tokenizer (8,192 vocab) on the financial text
+
+### Step 2: Established baselines
+
+We trained the same 11.5M parameter GPT architecture on two datasets:
+
+- **ClimbMix** (general web text): val_bpb = 2.146
+- **10-K financial text**: val_bpb = 1.711
+
+Just swapping the training data — same model, same hyperparameters — gave a **20% improvement** on financial text compression. Domain specialization works.
+
+### Step 3: Autonomous hyperparameter optimization
+
+This is where [autoresearch](https://github.com/karpathy/autoresearch) comes in. We pointed Claude at the training script and let it run experiments autonomously. Each experiment:
+
+1. Modify `train.py` (change learning rates, schedules, architecture, etc.)
+2. Train for exactly 5 minutes
+3. Check if val_bpb improved
+4. Keep the change or revert, log results, repeat
+
+We ran ~15 experiments. Here's what happened:
+
+| Experiment | Change | val_bpb | Outcome |
+|------------|--------|---------|---------|
+| Baseline | Default config | 1.711 | Starting point |
+| 1.5x learning rates | LR 0.6->0.9, 0.04->0.06 | 1.677 | **Kept** |
+| 2x learning rates | Too aggressive | 1.700 | Reverted |
+| Warmdown 0.5->0.3 | Keep LR high longer | 1.658 | **Kept** |
+| Warmdown 0.3->0.15 | Even longer | 1.646 | **Kept** |
+| Warmdown 0.15->0.05 | Nearly no cooldown | **1.645** | **Kept (best)** |
+| Add 5% warmup | Ramp LR slowly | 1.749 | Reverted |
+| Depth 4->6 | More layers | OOM | Reverted |
+| Depth 4->5 | Slightly more layers | Too slow | Reverted |
+| Batch 16->32 | Less grad accumulation | OOM | Reverted |
+| 4 heads (head_dim 64) | More attention heads | 2.042 | Reverted |
+| Half batch size | More steps, noisier | 1.707 | Reverted |
+| SSSL window pattern | Sliding window attention | 1.819 | Reverted |
+
+**Final best: 1.645 val_bpb** (3.9% improvement from tuning on top of the 20% from specialization).
+
+### Step 4: What we learned the hard way
+
+**Thermal throttling was the biggest confound.** On a MacBook Air (no fan), the M-series chip throttles aggressively under sustained load. Our step times swung from 3.6s to 148s mid-run, making experiments unreliable. We wasted several rounds before realizing the "improvements" were just thermal noise.
+
+The fix was embarrassingly simple: **put the laptop on a dehumidifier** (see photo above). After that, step times stabilized at ~3.6s and throughput went from erratic to a consistent ~18,000 tok/sec. This alone increased our steps-per-run from ~68 to ~91 — a bigger improvement than most hyperparameter changes.
+
+**What worked:**
+- Higher learning rates (1.5x default) — the 5-minute budget means the model needs to learn fast
+- Minimal warmdown — with so few steps, spending half the budget cooling down the LR wastes training time
+- Keeping the model small — deeper/wider models couldn't converge in 5 minutes even if they had more capacity
+
+**What didn't work:**
+- Architecture changes (more layers, different attention patterns) — not enough training time to benefit
+- Smaller batch sizes — more steps but noisier gradients, net negative
+- Warmup — the model needs high LR from step 0 with random weights
+- float16/bfloat16 autocast on MPS — no speedup on Apple Silicon (no tensor cores)
+- torch.compile on MPS — not supported in PyTorch 2.6
+
+## Benchmarks
+
+### 1. Compression Quality
+
+| Model | val_bpb | Notes |
+|-------|---------|-------|
+| General model (ClimbMix) | 2.146 | Same architecture, general web text |
+| **10-K specialized model** | **1.645** | Same architecture, financial text |
+
+23.3% better compression = the model captures financial language patterns significantly better.
+
+### 2. Inference Speed (MacBook Air M2, MPS)
+
+| Mode | Latency | Throughput |
+|------|---------|------------|
+| Single sequence (2,048 tokens) | 27ms | 75,000 tok/sec |
+| Batched (16 x 2,048 tokens) | ~0.4s | 75,000+ tok/sec |
+| One full 10-K filing (~75K tokens) | ~1 second | - |
+| All 80K SEC EDGAR filings | ~22 hours | - |
+
+### 3. Cost to Process Full SEC Database (~80K filings, ~8.4B tokens)
+
+*Methodology: 1,131 filings averaged 120,910 tokens each (our 8K-vocab tokenizer). Converted to GPT-equivalent tokens at 0.875x ratio (accounting for vocabulary efficiency difference). Extrapolated across all 79,513 10-K filings in SEC EDGAR (2015-2025).*
+
+| Approach | Price/1M input tokens | Cost (8.4B tokens) |
+|----------|----------------------|---------------------|
+| GPT-4o API | $2.50 | ~$21,000 |
+| Claude Sonnet 4.6 API | $3.00 | ~$25,000 |
+| Claude Haiku 4.5 API | $1.00 | ~$8,400 |
+| GPT-4o-mini API | $0.15 | ~$1,260 |
+| **This model (local)** | **$0** | **$0** |
+
+*Prices as of March 2026. Input tokens only (processing/embedding), no output generation. Batch API discounts (50% off) would roughly halve these costs.*
+
+## Potential Uses
+
+This model won't replace GPT-4 for deep financial analysis. It's a **specialized tool** for specific use cases where speed, cost, and privacy matter:
+
+- **Document embeddings** — fast similarity search across thousands of filings
+- **Anomaly detection** — flag filings with unusual language patterns
+- **Pre-filtering** — cheap triage before sending to an expensive API
+- **Privacy-preserving analysis** — data never leaves the device
+- **Edge deployment** — small enough to run on a phone
+- **Fine-tuning foundation** — starting point for downstream financial NLP tasks
+
+## What We'd Try Next
+
+- **Longer training** (hours not minutes) on a machine with proper cooling
+- **Scale to 50-100M parameters** while staying edge-deployable
+- **Downstream tasks** — sector classification, sentiment analysis, NER on financial text
+- **Broader corpus** — earnings calls, proxy statements, analyst reports
+- **Quantization** — INT8/INT4 for even faster inference on mobile
+
+## Quick Start
+
+```bash
+# Install uv (if you don't have it)
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Install dependencies
+uv sync
+
+# Download and prepare 10-K data (~5 min)
+AUTORESEARCH_CACHE=~/.cache/autoresearch-10k uv run prepare_10k.py
+
+# Train the model (5 min on Apple Silicon)
+AUTORESEARCH_CACHE=~/.cache/autoresearch-10k uv run train.py
+
+# Run benchmarks
+AUTORESEARCH_CACHE=~/.cache/autoresearch-10k uv run benchmark.py
+
+# Live training dashboard (run in a separate terminal)
+uv run monitor.py
+```
+
+## Project Structure
+
+| File | Purpose |
+|------|---------|
+| `train.py` | Model + training loop (the file autoresearch modifies) |
+| `prepare.py` | Original ClimbMix data pipeline |
+| `prepare_10k.py` | SEC EDGAR 10-K data pipeline |
+| `benchmark.py` | Perplexity, speed, and cost benchmarks |
+| `monitor.py` | Live terminal dashboard with loss curves + thermal monitoring |
+| `program.md` | Instructions for the AI agent |
+| `MODEL_CARD.md` | Full model card for HuggingFace |
+| `benchmark_results.json` | Machine-readable benchmark results |
+
+## Requirements
+
+- macOS with Apple Silicon (M1/M2/M3/M4) or NVIDIA GPU
+- Python 3.10+
+- [uv](https://astral.sh/uv) package manager
+
+## Acknowledgments
+
+- [Andrej Karpathy](https://karpathy.ai) / [autoresearch](https://github.com/karpathy/autoresearch) for the training framework
+- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) for the macOS/MPS port
+- [Claude Code](https://claude.ai/claude-code) for autonomous experiment orchestration
+- SEC EDGAR for public filing data
+
+## License
+
+MIT
+
+---
+
+*This project is a fork of [autoresearch-macos](https://github.com/miolini/autoresearch-macos). Original README below.*
+
+---
 
 ![teaser](progress.png)