This project reproduces the core empirical results from the NeurIPS 2024 paper “Not All Tokens Are What You Need for Pretraining” using TinyLlama-1.1B on a single consumer GPU.
We evaluate:
- Baseline CLM training
- Top-k SLM
- Random SLM
- Stochastic SLM
across selection ratios r = 0.5 and r = 0.3, along with:
- Token bucket movement (H→L, L→H)
- Validation perplexity
- Cosine similarity between LoRA update directions
This produces a compact, faithful replication of the paper’s key findings.
Serves as the reference model (100% token usage).
Final validation perplexity: ≈ 9.89
Observations:
- Validation ppl ≈ 11.40
- Worst performer
- Highly unstable
- Confirms paper: hard selection harms performance
Observations:
- Validation ppl ≈ 10.04
- Closest to baseline
- Soft preference for high-loss tokens → stable
Observations:
- Validation ppl ≈ 10.05
- Nearly identical to stochastic
- Strong evidence of token redundancy
Observations:
- More noise (fewer tokens)
- Still stable and effective
- Shows SLM holds up even at 30% token usage
Observations:
- Noisy but structured
- Worse than stochastic at same ratio
- Still confirms difficulty-targeting behavior
Tokens are bucketed using the baseline 70% loss quantile.
-
Top-k:
- Highest L→H regressions (bad)
- Confirms overfitting to difficult tokens
-
Random:
- Moves tokens without pattern
- Neutral but stable
-
Stochastic:
- Highest H→L improvements
- Lowest L→H regressions
- Best stability
- Matches paper’s motivation for soft selection
Cosine similarity measures how close the update direction is to baseline CLM.
Higher = more similar to CLM training behavior.
| Comparison | Cosine |
|---|---|
| Baseline ↔ Top-k | 0.9116 |
| Baseline ↔ Random | 0.9317 |
| Baseline ↔ Stochastic | 0.9304 |
➡ Random & Stochastic remain closest to baseline.
➡ Top-k diverges the most → explains poor validation loss.
| Comparison | Cosine |
|---|---|
| Top-k ↔ Random | 0.8763 |
| Top-k ↔ Stochastic | 0.9020 |
| Random ↔ Stochastic | 0.9109 |
Random ↔ Stochastic is highest → both behave as “soft CLM”.
Using 50% of tokens increases perplexity by only ~1.5%.
Closest to baseline in:
- Perplexity
- Token bucket behavior
- Cosine similarity
Deterministic hard selection hurts quality and stability.
LoRA cosine similarities confirm close alignment with CLM.
All trends match the behaviors described in the SLM/RHO-1 paper.






