Rigorous evaluation of contextual retrieval techniques on FinanceBench: comparing 5 embedders × 4 chunking strategies with bootstrapped confidence intervals on FinMTEB and FinanceBench.
🚧 Work in progress. This repository implements a paper-quality evaluation suite for financial document retrieval, comparing state-of-the-art embedding models and chunking strategies on the FinanceBench benchmark (Patronus AI, 2023) and FinMTEB (2025).
Domain-specific fine-tuning combined with modern chunking strategies (Anthropic's Contextual Retrieval, Late Chunking) can outperform general-purpose commercial embedders on financial document QA, even when those embedders use Matryoshka representations or 3072-dim outputs.
This project focuses on Layer 1 of the AI Engineer stack: ML/embeddings/retrieval. RAG generation, agents, and cloud deployment are out of scope for this repository — they are addressed in follow-up projects.
New to finance or to retrieval evaluation? Before diving into the technical content, this repo includes a plain-English glossary that explains every acronym and concept used in the project: SEC, 10-K filings, GICS sectors, what PatronusAI built, what an evidence passage is, and how Recall@k / MRR / NDCG / MAP actually work — with concrete examples grounded in our dataset.
👉 Read the full guide: docs/CONTEXT.md
💡 Recommended for: anyone reviewing the repo without a finance background, or anyone who wants to understand what we're evaluating, against what, and why it matters before reading the code.
| Component | Plan |
|---|---|
| Datasets | FinanceBench (150 QA pairs, public) + FinMTEB (academic finance benchmark) |
| Embedders | OpenAI text-embedding-3-large · Voyage finance-2 · BGE-M3 · Jina v5 / Qwen3 · BGE-M3 fine-tuned (custom) |
| Chunking strategies | Naive fixed-size · Semantic · Anthropic Contextual Retrieval · Late Chunking |
| Reranking | Cohere Rerank v3.5 · BGE Reranker v2 |
| Metrics | Recall@k, MRR, NDCG@10, MAP — all with bootstrap confidence intervals |
| Evaluation | 3 retrieval modes (dense, hybrid, hybrid+rerank) on both benchmarks |
├── data/ # FinanceBench corpus (gitignored)
├── notebooks/ # Exploratory and tutorial notebooks
├── src/
│ ├── embeddings/ # Embedder wrappers
│ ├── chunking/ # Chunking strategies
│ ├── eval/ # Evaluation pipeline + bootstrap
│ └── utils/ # Shared helpers
├── results/ # Per-experiment metrics
├── docs/ # Methodology & decision logs
├── scripts/ # CLI entry points
└── tests/ # Unit tests
# 1. Install Python dependencies (Python 3.12 + uv required)
uv sync
# 2. Download the FinanceBench source PDFs (~84 files, ~165 MB)
uv run python scripts/download_pdfs.py📦 Why is the data not in the repo? The 165 MB of source PDFs from FinanceBench (CC-BY-NC-4.0) are excluded from version control for three reasons: (1) Git doesn't scale well with large binary files, (2) the dataset is third-party content, and (3) avoiding duplication of data already hosted upstream by
patronus-ai/financebench. Thescripts/download_pdfs.pyscript is the canonical recipe: it reads the uniquedoc_namevalues from the HuggingFace dataset and fetches each PDF directly from the upstream raw GitHub URLs intodata/raw/pdfs/. Idempotent (skips files already present), parallel (8 workers), takes ~10 seconds on a normal connection.
🚧 TBD. Results will be published here once the evaluation pipeline lands. Each experiment will report Recall@k, MRR, NDCG@10, and MAP with bootstrap 95% confidence intervals across both benchmarks.
Per-experiment artifacts (configs, raw metrics, plots) will live under results/ organized by strategy.
Every result in this repository is reproducible. See docs/reproducibility.md for exact commands, seeds, and dataset revisions.
MIT — see LICENSE.