Context Rot Needle-in-Haystack Experiment

Replicates Chroma's Context Rot NIAH methodology. Classic needle-in-haystack with PG essays and arXiv; plus a Futurama-perturbed Django cache codebase (BenderCache, nibbler, scruffy_limit) that reproduces lost-in-the-middle.

Architecture

config.yaml + .env  -->  Haystack (PG + arXiv)  -->  Classic NIAH  -->  Providers (OpenAI, Anthropic, Google)
                                                                              |
                                                                              v
                                                                    LLM Judge (GPT-4o)
                                                                              |
                                                                              v
                                                                    JSON / CSV / HTML

Haystacks: PG essays from paulgraham.com, arXiv abstracts from cs.IR/cs.CL, OSS code (requests, flask)
Perturbed codebase (code_perturbed): Django cache module copied and systematically renamed (scripts/perturb_django.py) so the model cannot rely on parametric knowledge. Identifiers like BaseCache, max_entries, timeout become BenderCache, scruffy_limit, nibbler; default values (e.g. 300, 3) are perturbed to 719, 417, 13. This setup reproduces the lost-in-the-middle effect: accuracy is high at short context (e.g. 100% at 200 tokens) and drops sharply at longer lengths (e.g. 6.7% at 64K).
Experiment: Insert needle at 0%, 50%, 100% of context; vary context length (1K–128K)
Judge: GPT-4o evaluates model output vs expected needle
Output: results/ with JSON, CSV, and HTML report (Chart.js)

Setup

make install

Set API keys in .env or parent all/.env:

OPENAI_API_KEY
ANTHROPIC_API_KEY
GOOGLE_API_KEY

Override env path: CONTEXT_ROT_ENV=/path/to/.env

make fetch-haystacks   # Download PG essays + arXiv papers (cached in data/)

Run

make run          # Full experiment (3 models, pg+arxiv variants, 6 lengths, 3 positions)
make run-quick    # 1 model, 6 calls
make run-code     # Code variant only (requests/flask haystack, coding needles)
make perturb-django       # Build django_perturbed (required before run-code-perturbed)
make run-code-perturbed   # Perturbed variant (BenderCache needles, lost-in-the-middle)
make run-code-perturbed-sweep  # Full sweep: bender_cache_unguided, all context lengths
make report       # Regenerate HTML from existing results
make export-chart # Export accuracy-vs-length chart to assets/context-rot-chart-updated.png (requires --extra chart)

CLI Options

Option	Description
`--dry-run`	Log without API calls
`--max-calls N`	Limit API calls
`--workers-per-provider N`	Max concurrent requests per provider (default: 3). Anthropic auto-uses 1 when context > 30K to avoid 429.
`--models openai:gpt-5.1,anthropic:claude-sonnet-4-6`	Filter providers/models
`--variant pg,arxiv,code`	Variants to run (default: pg,arxiv). Use `code` for codebase haystack.
`--report-only`	Generate HTML from `results_latest.json`
`--results-dir PATH`	Output directory

Config (`config.yaml`)

Key	Description
`models`	Model IDs per provider (openai, anthropic, google)
`judge_model`	Model for correctness evaluation
`needle_positions`	[0, 0.5, 1.0] = start, middle, end
`context_lengths`	Token counts, e.g. [1000, 4000, 16000, 32000, 64000, 128000]
`needles`	Per-variant `question` and `answer` (the needle)

Rate Limits

Anthropic enforces 30K input tokens/min on some tiers. For context > 30K tokens, the runner uses 1 worker for Anthropic and retries 429s with 75s backoff. To skip Anthropic: --models openai,google.

Testing

make test              # Unit tests (no API calls)
make test-integration  # Integration tests (require API keys, real API calls)

Unit tests cover config, haystack builder, PG/arXiv parsing, metrics, output writers, and classic NIAH (with mock provider). Integration tests verify each provider's API usage with a minimal prompt.

Project Structure

context-rot/
├── run.py              # CLI entry
├── config.yaml
├── assets/             # Chart (context-rot-chart-updated.png)
├── src/
│   ├── config.py
│   ├── providers/      # OpenAI, Anthropic, Google
│   ├── haystack/       # PG essays, arXiv, builder
│   ├── experiments/    # classic NIAH
│   ├── evaluation/     # judge, metrics
│   └── output/        # JSON, CSV, HTML
├── data/               # Cached haystacks (gitignored)
├── results/            # JSON, CSV, report.html
└── tests/

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context Rot Needle-in-Haystack Experiment

Architecture

Setup

Run

CLI Options

Config (`config.yaml`)

Rate Limits

Testing

Project Structure

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Context Rot Needle-in-Haystack Experiment

Architecture

Setup

Run

CLI Options

Config (config.yaml)

Rate Limits

Testing

Project Structure

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Config (`config.yaml`)

Packages