Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,12 @@ AGENTS.md

# Experimental code/artifacts
dev/

# Training artifacts (generated per-run)
run.log
run_*.log
results.tsv
model.pt

# Drafts
linkedin_post.md
153 changes: 153 additions & 0 deletions MODEL_CARD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
language:
- en
license: mit
tags:
- financial
- sec-filings
- 10-K
- small-language-model
- slm
- gpt
- domain-specific
metrics:
- bits-per-byte
pipeline_tag: text-generation
model-index:
- name: 10k-financial-slm
results:
- task:
type: text-generation
metrics:
- name: val_bpb (financial text)
type: bits-per-byte
value: 1.645
- name: val_bpb (general text baseline)
type: bits-per-byte
value: 2.146
---

# 10-K Financial SLM (11.5M params)

A tiny GPT language model trained exclusively on SEC 10-K filings from financial companies. ~20 experiments at 5 minutes each (~2 hours total GPU time) on a MacBook Air using Apple Silicon (MPS).

## Model Details

| Property | Value |
|----------|-------|
| Parameters | 11.5M |
| Architecture | GPT (decoder-only transformer) |
| Layers | 4 |
| Hidden dim | 256 |
| Attention heads | 2 |
| Context length | 2,048 tokens |
| Vocab size | 8,192 (BPE) |
| Training data | 1,131 SEC 10-K filings (financial companies, SIC 6000-6411) |
| Training time | ~2 hours total (~20 x 5-min experiments on Apple M-series MPS) |

## Performance

### Compression Quality (bits-per-byte)

| Model | val_bpb | Domain |
|-------|---------|--------|
| **This model (specialized)** | **1.645** | Financial 10-K text |
| Same architecture (general) | 2.146 | General web text (ClimbMix) |

**23.3% better compression** on financial text compared to the same architecture trained on general data.

### Inference Speed (MacBook Air, MPS)

| Metric | Value |
|--------|-------|
| Single sequence latency | 27ms (2,048 tokens) |
| Batched throughput | 75,000+ tokens/sec |
| Time per 10-K filing | ~1 second |
| Full SEC EDGAR database | ~22 hours |

### Cost Comparison (processing 80K filings)

| Approach | Cost |
|----------|------|
| GPT-4o API ($2.50/1M tokens) | ~$15,000 |
| Claude Sonnet 4.6 API ($3.00/1M tokens) | ~$18,000 |
| Claude Haiku 4.5 API ($1.00/1M tokens) | ~$6,000 |
| GPT-4o-mini API ($0.15/1M tokens) | ~$900 |
| **This model (local)** | **$0** |

*Prices as of March 2026. Input tokens only.*

## Training Details

Built using [Karpathy's autoresearch](https://github.com/miolini/autoresearch-macos) framework, which enables autonomous hyperparameter experimentation. An AI agent (Claude) iteratively modified the training configuration, ran 5-minute training sessions, and kept improvements.

### Key hyperparameters (after optimization)

- Learning rates: 1.5x default (Embedding: 0.9, Matrix/Muon: 0.06)
- Warmdown ratio: 0.05 (LR stays at peak for 95% of training)
- Optimizer: MuonAdamW (Muon for matrix params, AdamW for embeddings)
- Batch size: 65,536 tokens per step

### Data pipeline

1. Downloaded 10-K filing index from SEC EDGAR (2015-2025)
2. Filtered to financial companies (SIC codes 6000-6411): banks, insurance, investment firms
3. Sampled 1,500 filings, downloaded full text from EDGAR
4. Cleaned HTML/XBRL markup, removed filings that were too short or too numeric
5. Chunked into 2,048-token sequences, split 90/10 train/val
6. Trained a BPE tokenizer (8,192 vocab) on the financial text

## Intended Use

This model is a research artifact demonstrating domain-specific SLM training. Potential applications:

- **Document embeddings**: Fast similarity search over financial filings
- **Anomaly detection**: Flag filings with unusual language patterns
- **Pre-filtering**: Cheap triage before sending documents to expensive API models
- **Privacy-preserving analysis**: All processing stays on-device
- **Foundation for fine-tuning**: Starting point for downstream financial NLP tasks

## Limitations

- **Not a chatbot**: This is a base language model. It predicts next tokens, it doesn't answer questions.
- **Tiny model**: 11.5M parameters means limited capacity. It captures patterns and statistics of financial language, not deep reasoning.
- **Narrow training data**: Only financial company 10-K filings. Performance on other financial documents (earnings calls, proxy statements) is untested.
- **No safety training**: No RLHF, no content filtering. Not suitable for user-facing generation.

## How to Use

```python
import torch
from train import GPT, GPTConfig

# Load checkpoint
ckpt = torch.load("model.pt", map_location="cpu")
config = GPTConfig(**ckpt["config"])

model = GPT(config)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

# Run inference
tokens = torch.tensor([[1, 2, 3, ...]]) # your tokenized input
with torch.no_grad():
logits = model(tokens)
```

## Citation

If you use this model in your work, please cite:

```
@misc{10k-financial-slm-2026,
title={10-K Financial SLM: A Domain-Specific Small Language Model for SEC Filings},
year={2026},
url={https://github.com/harryschaefer93/autoresearch-10k-macos}
}
```

## Acknowledgments

- [Andrej Karpathy](https://github.com/karpathy) / [autoresearch-macos](https://github.com/miolini/autoresearch-macos) for the training framework
- [Claude Code](https://claude.ai/claude-code) for autonomous experiment orchestration
- SEC EDGAR for the public filing data
206 changes: 205 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,208 @@
# autoresearch-macos
# 10-K Financial SLM

A tiny (11.5M parameter) GPT language model trained exclusively on SEC 10-K filings from financial companies. Built using [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) framework with an AI agent (Claude) autonomously running experiments on a MacBook Air. ~20 experiments at 5 minutes each, ~2 hours of total GPU time.

**[Model on HuggingFace](https://huggingface.co/HarryS64/10k-financial-slm)** · **[GitHub Repo](https://github.com/harryschaefer93/autoresearch-10k-macos)**

<p align="center">
<img src="images/training-monitor.jpeg" width="48%" alt="Live training monitor showing benchmark race, loss curve, and thermal status" />
<img src="images/cooling-setup.jpeg" width="48%" alt="MacBook Air on a dehumidifier for passive cooling during training" />
</p>
<p align="center"><em>Left: Live training dashboard tracking the optimization journey. Right: Our "cooling solution" — a MacBook Air on a dehumidifier. It worked.</em></p>

## Results

| Metric | Value |
|--------|-------|
| Compression (val_bpb) | **1.645** (vs 2.146 general baseline) |
| Improvement over general model | **23.3%** |
| Inference speed | 75,000+ tok/sec on MacBook Air |
| Time per 10-K filing | ~1 second |
| Cost to process all 80K SEC filings | **$0** (vs $21K+ via API) |

## The Question

Can a tiny model trained exclusively on financial text outperform a general-purpose model of the same size at understanding financial documents? And can we use autonomous AI-driven experimentation to optimize it?

## What We Did

### Step 1: Built a financial data pipeline

We wrote `prepare_10k.py` to pull 10-K filings directly from SEC EDGAR:

1. Downloaded the quarterly master index (2015-2025) — found ~80,000 10-K filings total
2. Filtered to financial companies only (SIC codes 6000-6411: banks, insurance, investment firms) — 18,538 filings
3. Sampled 1,500, downloaded full text, cleaned HTML/XBRL markup
4. Kept 1,131 high-quality filings after filtering out too-short or too-numeric documents
5. Chunked into 60,095 training sequences (2,048 tokens each) + 6,677 validation sequences
6. Trained a domain-specific BPE tokenizer (8,192 vocab) on the financial text

### Step 2: Established baselines

We trained the same 11.5M parameter GPT architecture on two datasets:

- **ClimbMix** (general web text): val_bpb = 2.146
- **10-K financial text**: val_bpb = 1.711

Just swapping the training data — same model, same hyperparameters — gave a **20% improvement** on financial text compression. Domain specialization works.

### Step 3: Autonomous hyperparameter optimization

This is where [autoresearch](https://github.com/karpathy/autoresearch) comes in. We pointed Claude at the training script and let it run experiments autonomously. Each experiment:

1. Modify `train.py` (change learning rates, schedules, architecture, etc.)
2. Train for exactly 5 minutes
3. Check if val_bpb improved
4. Keep the change or revert, log results, repeat

We ran ~15 experiments. Here's what happened:

| Experiment | Change | val_bpb | Outcome |
|------------|--------|---------|---------|
| Baseline | Default config | 1.711 | Starting point |
| 1.5x learning rates | LR 0.6->0.9, 0.04->0.06 | 1.677 | **Kept** |
| 2x learning rates | Too aggressive | 1.700 | Reverted |
| Warmdown 0.5->0.3 | Keep LR high longer | 1.658 | **Kept** |
| Warmdown 0.3->0.15 | Even longer | 1.646 | **Kept** |
| Warmdown 0.15->0.05 | Nearly no cooldown | **1.645** | **Kept (best)** |
| Add 5% warmup | Ramp LR slowly | 1.749 | Reverted |
| Depth 4->6 | More layers | OOM | Reverted |
| Depth 4->5 | Slightly more layers | Too slow | Reverted |
| Batch 16->32 | Less grad accumulation | OOM | Reverted |
| 4 heads (head_dim 64) | More attention heads | 2.042 | Reverted |
| Half batch size | More steps, noisier | 1.707 | Reverted |
| SSSL window pattern | Sliding window attention | 1.819 | Reverted |

**Final best: 1.645 val_bpb** (3.9% improvement from tuning on top of the 20% from specialization).

### Step 4: What we learned the hard way

**Thermal throttling was the biggest confound.** On a MacBook Air (no fan), the M-series chip throttles aggressively under sustained load. Our step times swung from 3.6s to 148s mid-run, making experiments unreliable. We wasted several rounds before realizing the "improvements" were just thermal noise.

The fix was embarrassingly simple: **put the laptop on a dehumidifier** (see photo above). After that, step times stabilized at ~3.6s and throughput went from erratic to a consistent ~18,000 tok/sec. This alone increased our steps-per-run from ~68 to ~91 — a bigger improvement than most hyperparameter changes.

**What worked:**
- Higher learning rates (1.5x default) — the 5-minute budget means the model needs to learn fast
- Minimal warmdown — with so few steps, spending half the budget cooling down the LR wastes training time
- Keeping the model small — deeper/wider models couldn't converge in 5 minutes even if they had more capacity

**What didn't work:**
- Architecture changes (more layers, different attention patterns) — not enough training time to benefit
- Smaller batch sizes — more steps but noisier gradients, net negative
- Warmup — the model needs high LR from step 0 with random weights
- float16/bfloat16 autocast on MPS — no speedup on Apple Silicon (no tensor cores)
- torch.compile on MPS — not supported in PyTorch 2.6

## Benchmarks

### 1. Compression Quality

| Model | val_bpb | Notes |
|-------|---------|-------|
| General model (ClimbMix) | 2.146 | Same architecture, general web text |
| **10-K specialized model** | **1.645** | Same architecture, financial text |

23.3% better compression = the model captures financial language patterns significantly better.

### 2. Inference Speed (MacBook Air M2, MPS)

| Mode | Latency | Throughput |
|------|---------|------------|
| Single sequence (2,048 tokens) | 27ms | 75,000 tok/sec |
| Batched (16 x 2,048 tokens) | ~0.4s | 75,000+ tok/sec |
| One full 10-K filing (~75K tokens) | ~1 second | - |
| All 80K SEC EDGAR filings | ~22 hours | - |

### 3. Cost to Process Full SEC Database (~80K filings, ~8.4B tokens)

*Methodology: 1,131 filings averaged 120,910 tokens each (our 8K-vocab tokenizer). Converted to GPT-equivalent tokens at 0.875x ratio (accounting for vocabulary efficiency difference). Extrapolated across all 79,513 10-K filings in SEC EDGAR (2015-2025).*

| Approach | Price/1M input tokens | Cost (8.4B tokens) |
|----------|----------------------|---------------------|
| GPT-4o API | $2.50 | ~$21,000 |
| Claude Sonnet 4.6 API | $3.00 | ~$25,000 |
| Claude Haiku 4.5 API | $1.00 | ~$8,400 |
| GPT-4o-mini API | $0.15 | ~$1,260 |
| **This model (local)** | **$0** | **$0** |

*Prices as of March 2026. Input tokens only (processing/embedding), no output generation. Batch API discounts (50% off) would roughly halve these costs.*

## Potential Uses

This model won't replace GPT-4 for deep financial analysis. It's a **specialized tool** for specific use cases where speed, cost, and privacy matter:

- **Document embeddings** — fast similarity search across thousands of filings
- **Anomaly detection** — flag filings with unusual language patterns
- **Pre-filtering** — cheap triage before sending to an expensive API
- **Privacy-preserving analysis** — data never leaves the device
- **Edge deployment** — small enough to run on a phone
- **Fine-tuning foundation** — starting point for downstream financial NLP tasks

## What We'd Try Next

- **Longer training** (hours not minutes) on a machine with proper cooling
- **Scale to 50-100M parameters** while staying edge-deployable
- **Downstream tasks** — sector classification, sentiment analysis, NER on financial text
- **Broader corpus** — earnings calls, proxy statements, analyst reports
- **Quantization** — INT8/INT4 for even faster inference on mobile

## Quick Start

```bash
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

# Download and prepare 10-K data (~5 min)
AUTORESEARCH_CACHE=~/.cache/autoresearch-10k uv run prepare_10k.py

# Train the model (5 min on Apple Silicon)
AUTORESEARCH_CACHE=~/.cache/autoresearch-10k uv run train.py

# Run benchmarks
AUTORESEARCH_CACHE=~/.cache/autoresearch-10k uv run benchmark.py

# Live training dashboard (run in a separate terminal)
uv run monitor.py
```

## Project Structure

| File | Purpose |
|------|---------|
| `train.py` | Model + training loop (the file autoresearch modifies) |
| `prepare.py` | Original ClimbMix data pipeline |
| `prepare_10k.py` | SEC EDGAR 10-K data pipeline |
| `benchmark.py` | Perplexity, speed, and cost benchmarks |
| `monitor.py` | Live terminal dashboard with loss curves + thermal monitoring |
| `program.md` | Instructions for the AI agent |
| `MODEL_CARD.md` | Full model card for HuggingFace |
| `benchmark_results.json` | Machine-readable benchmark results |

## Requirements

- macOS with Apple Silicon (M1/M2/M3/M4) or NVIDIA GPU
- Python 3.10+
- [uv](https://astral.sh/uv) package manager

## Acknowledgments

- [Andrej Karpathy](https://karpathy.ai) / [autoresearch](https://github.com/karpathy/autoresearch) for the training framework
- [miolini/autoresearch-macos](https://github.com/miolini/autoresearch-macos) for the macOS/MPS port
- [Claude Code](https://claude.ai/claude-code) for autonomous experiment orchestration
- SEC EDGAR for public filing data

## License

MIT

---

*This project is a fork of [autoresearch-macos](https://github.com/miolini/autoresearch-macos). Original README below.*

---

![teaser](progress.png)

Expand Down
Loading