PDF → Mistral OCR → Deterministic AST Chunker → Contextual RAG
A production-ready pipeline that turns PDFs into retrieval-optimized chunks. Uses Mistral OCR for accurate text extraction, a deterministic AST-based chunker that respects document structure, and optionally enriches each chunk with a context summary following Anthropic's Contextual Retrieval method — reducing retrieval failures by up to 49%.
- Mistral OCR 3 as primary OCR provider; Gemini Vision as automatic fallback
- Deterministic AST chunker powered by remark/mdast — no LLM required for chunking, fast and reproducible
- Anthropic Contextual Retrieval — optional batch or per-chunk context enrichment via Gemini
- Heading normalization — two-phase Gemini pipeline fixes inconsistent heading levels from OCR
- Large PDF auto-split — PDFs over 50 MB are automatically split into 40 MB batches and merged seamlessly
- OCR cache — results cached locally (7-day TTL) so the same PDF is never re-processed
- CLI + programmatic API — use as a command-line tool or import into your own pipeline
npm install @msbayindir/rag-chunker@google/genai is a required peer dependency and is installed automatically. If you plan to use OpenAI embeddings:
npm install openaiYou can process hundreds of documents without spending a cent.
- Go to console.mistral.ai → Sign Up → API Keys
- New accounts receive $5 free credit (no credit card required on sign-up)
- Mistral OCR pricing: ~$1 per 1,000 pages → $5 gets you ~5,000 pages
- More than enough to evaluate and prototype
If you only have a Mistral key, OCR works. Context enrichment and heading normalization require a Gemini key.
- Go to aistudio.google.com → Get API Key
- Completely free on the Google AI Studio free tier — no credit card required
- Free tier limits:
gemini-2.0-flash(context default): 1,500 requests/daygemini-2.5-flash(context alternative): 500 requests/daygemini-2.5-pro(heading normalization phase 1): 50 requests/day
For most documents, context enrichment uses ~20–30 batch requests. You can process 50+ documents per day on the free tier.
# Basic — OCR + chunk, no context enrichment
npx rag-chunker process document.pdf -m YOUR_MISTRAL_KEY -o ./output
# With Anthropic-style context enrichment (recommended for RAG)
npx rag-chunker process document.pdf \
-m YOUR_MISTRAL_KEY \
-k YOUR_GEMINI_KEY \
--context-mode batch \
-o ./output
# With heading normalization (useful when OCR produces inconsistent heading levels)
npx rag-chunker process document.pdf \
-m YOUR_MISTRAL_KEY \
-k YOUR_GEMINI_KEY \
--heading-normalization \
-o ./output
# Using environment variables (recommended)
export MISTRAL_API_KEY=your_key
export GEMINI_API_KEY=your_key
npx rag-chunker process document.pdf -o ./outputAfter processing, ./output/ will contain:
output/
chunks.jsonl ← one chunk per line, ready for embedding
document.md ← full markdown with page markers
structure.json ← headings, tables, page count
manifest.json ← processing stats and metadata
Full pipeline: OCR → chunk → optional context enrichment → save.
rag-chunker process <pdf> [options]| Option | Default | Description |
|---|---|---|
-o, --output <dir> |
— | Output directory. If omitted, results are not saved to disk. |
-m, --mistral-api-key <key> |
MISTRAL_API_KEY env |
Mistral API key — primary OCR provider |
-k, --gemini-api-key <key> |
GEMINI_API_KEY env |
Gemini API key — context enrichment, heading fix, fallback OCR |
--context-mode <mode> |
none |
none | batch | per-chunk |
--context-model <model> |
gemini-2.0-flash |
Gemini model for context summaries |
--context-batch-size <n> |
10 |
Chunks per batch in batch mode |
--max-chunk-tokens <n> |
512 |
Max tokens per chunk |
--min-chunk-tokens <n> |
50 |
Min tokens for a chunk to be emitted |
--overlap-tokens <n> |
0 |
Tokens prepended from the previous chunk |
--no-preserve-tables |
— | Do not keep tables in their own chunk |
--no-preserve-code |
— | Do not keep code blocks in their own chunk |
--heading-normalization |
— | Fix inconsistent heading levels (requires --gemini-api-key) |
--ocr-cache-path <path> |
~/.rag-chunker/ocr-cache.json |
Custom OCR cache file path |
--ocr-cache-ttl <days> |
7 |
OCR cache TTL in days |
--no-ocr-cache |
— | Disable OCR caching |
--warn-large-chunk <n> |
2000 |
Warn when a table/code chunk exceeds N tokens |
--verbose |
— | Show verbose pipeline logs |
Large PDF handling: PDFs over 50 MB automatically trigger a confirmation prompt. If confirmed, the file is split into 40 MB batches, each batch is OCR'd sequentially, and results are merged — page numbers and heading hierarchy are preserved across the split.
Debug command. Runs OCR and prints each page's markdown to stdout.
rag-chunker ocr document.pdf -m YOUR_MISTRAL_KEY| Option | Description |
|---|---|
-m, --mistral-api-key <key> |
Mistral API key |
-k, --gemini-api-key <key> |
Gemini API key (for Vision fallback) |
Debug command. Runs the AST chunker on a .md file and prints chunk boundaries. No API key needed.
rag-chunker chunk document.md --max-tokens 512| Option | Default | Description |
|---|---|---|
--max-tokens <n> |
512 |
Max tokens per chunk |
--min-tokens <n> |
50 |
Min tokens per chunk |
--overlap-tokens <n> |
0 |
Overlap tokens from previous chunk |
--no-preserve-tables |
— | Do not isolate tables |
--no-preserve-code |
— | Do not isolate code blocks |
Reads an output directory and prints a summary of its manifest and structure.
rag-chunker inspect ./outputManage the local OCR cache.
rag-chunker cache list
rag-chunker cache clear --expired # remove entries older than TTL
rag-chunker cache clear --all # wipe entire cache| Option | Description |
|---|---|
--cache-path <path> |
Custom cache file path |
--ttl <days> |
TTL for expired check (default: 7) |
Full pipeline. Returns a ProcessResult with chunks, markdown, structure, manifest, and a save() method.
import { process } from '@msbayindir/rag-chunker'
const result = await process('document.pdf', {
mistralApiKey: process.env.MISTRAL_API_KEY,
geminiApiKey: process.env.GEMINI_API_KEY,
contextMode: 'batch',
maxChunkTokens: 512,
})
// Save all output files
await result.save('./output')
// Or work with chunks directly
for (const chunk of result.chunks) {
console.log(chunk.content) // embed this
console.log(chunk.sectionPath) // breadcrumb path
console.log(chunk.pageNumber)
}Convenience wrapper. Forces contextMode: 'none' and returns only Chunk[].
import { chunk } from '@msbayindir/rag-chunker'
const chunks = await chunk('document.pdf', {
mistralApiKey: process.env.MISTRAL_API_KEY,
maxChunkTokens: 512,
})| Field | Type | Default | Description |
|---|---|---|---|
mistralApiKey |
string |
— | Mistral API key — primary OCR |
geminiApiKey |
string |
— | Gemini API key — context, heading fix, fallback OCR |
contextMode |
'none' | 'batch' | 'per-chunk' |
'none' |
Context enrichment mode |
contextModel |
string |
'gemini-2.0-flash' |
Gemini model for context summaries |
contextBatchSize |
number |
10 |
Chunks per Gemini batch call |
contextConcurrency |
number |
2 |
Max concurrent calls in per-chunk mode |
maxChunkTokens |
number |
512 |
Max tokens per chunk |
minChunkTokens |
number |
50 |
Min tokens per chunk |
overlapTokens |
number |
0 |
Overlap tokens prepended from previous chunk |
preserveTables |
boolean |
true |
Keep tables in their own chunk |
preserveCodeBlocks |
boolean |
true |
Keep code blocks in their own chunk |
ocrCachePath |
string | false |
~/.rag-chunker/ocr-cache.json |
Cache path. false to disable. |
ocrCacheTtlDays |
number |
7 |
OCR cache TTL in days |
headingNormalization |
boolean |
false |
Fix OCR heading levels via Gemini |
headingFixPhase1Model |
string |
'gemini-2.5-pro' |
Model for structure discovery phase |
headingFixPhase2Model |
string |
'gemini-2.5-flash-preview-05-20' |
Model for correction phase |
warnLargeChunkTokens |
number |
2000 |
Warn threshold for oversized preserved chunks |
embeddingProvider |
IEmbeddingProvider |
— | Embedding provider (optional) |
logger |
ILogger |
pino at INFO | Custom logger |
Built-in providers can optionally generate embeddings inline during the pipeline.
import {
createGeminiEmbeddingProvider,
createOpenAiEmbeddingProvider,
createNullEmbeddingProvider,
} from '@msbayindir/rag-chunker'
// Gemini — 1536 dimensions
const geminiProvider = createGeminiEmbeddingProvider({
apiKey: process.env.GEMINI_API_KEY!,
})
// OpenAI text-embedding-3-large — 3072 dimensions (requires: npm install openai)
const openaiProvider = createOpenAiEmbeddingProvider({
apiKey: process.env.OPENAI_API_KEY!,
})
// Null — returns empty vectors, useful for testing
const nullProvider = createNullEmbeddingProvider()
const result = await process('document.pdf', {
mistralApiKey: process.env.MISTRAL_API_KEY,
embeddingProvider: openaiProvider,
})
// result.chunks[0].embedding → number[]Note: Embedding is non-fatal. If the provider throws, the chunk is still returned with
embedding: [].
One JSON object per line. Each line is a Chunk:
Full document markdown with page markers:
<!-- page 1 -->
# Chapter 1
Content...
<!-- page 2 -->
## Section 1.1Document structure extracted from markdown:
{
"headings": [
{ "level": 1, "text": "Chapter 1", "pageNumber": 1, "markdownLine": 3 }
],
"tables": [
{ "index": 0, "caption": "Table 1. Electrolyte values", "pageNumber": 5, "rowCount": 8, "columnCount": 3 }
],
"tableCount": 12,
"codeBlockCount": 0,
"pageCount": 192,
"totalTokens": 68432
}Processing metadata:
{
"version": "3.0",
"processedAt": "2026-03-16T00:20:51.118Z",
"pdfHash": "0f2860b4...",
"ocrModel": "mistral-ocr-latest",
"contextModel": "gemini-2.0-flash",
"contextMode": "batch",
"chunkStats": {
"total": 183, "avgTokens": 378,
"minTokens": 50, "maxTokens": 2662,
"tableChunks": 25, "codeChunks": 0, "textChunks": 158, "mixedChunks": 0
},
"durationMs": 146689,
"ocrCacheHit": true,
"headingFix": null, // populated if --heading-normalization was used
"contextEnrichment": {
"model": "gemini-2.0-flash",
"chunksEnriched": 176,
"chunksSkipped": 7, // low-quality OCR artifacts skipped
"batchCalls": 19,
"durationMs": 146005,
"cacheUsed": true // whether Gemini CachedContent was used
}
}When you split a document into chunks and embed them independently, each chunk loses its context. A chunk containing "It was founded in 1987 and has since expanded to 42 countries" provides no signal for a query about a specific organization — the surrounding context that identifies the subject is gone.
This is the core retrieval failure in most RAG systems.
In September 2024, Anthropic published Contextual Retrieval, showing that prepending a short, situating summary to each chunk before embedding significantly improves retrieval:
| Method | Retrieval Failure Reduction |
|---|---|
| Basic semantic embedding | baseline |
| Contextual embedding | 49% fewer failures |
| Contextual embedding + BM25 hybrid | 67% fewer failures |
The approach: for each chunk, generate a 1–2 sentence summary that situates it within the document — what section it belongs to, what topic it covers. Prepend this to the chunk content before embedding.
When --context-mode batch (or per-chunk) is used:
- The full document markdown is sent to Gemini as context
- For each chunk, a 1–2 sentence summary is generated in the same language as the document
- The
contextSummaryfield is stored separately contentis assembled as:"Context: <summary>\n\n<rawContent>"
This matches Anthropic's exact format.
| Your setup | Embed this field |
|---|---|
--context-mode batch or per-chunk (recommended) |
content — includes the context summary |
--context-mode none (default) |
content or rawContent — they are identical |
| Hybrid search: vector + BM25/keyword | content for vector index, rawContent for keyword index |
// Reading chunks.jsonl and embedding
import { createReadStream } from 'fs'
import { createInterface } from 'readline'
const rl = createInterface({ input: createReadStream('./output/chunks.jsonl') })
for await (const line of rl) {
const chunk = JSON.parse(line)
const textToEmbed = chunk.content // always use content
// → send to your vector database
}Modern embedding models (OpenAI text-embedding-3-large, Gemini embedding-001, Cohere embed-v3) are trained on web-scale data that includes markdown. They handle #, **, | and similar syntax gracefully — stripping is generally not necessary.
Research findings:
- Nussbaum et al. (2024) — Nomic Embed: Training a Reproducible Long Context Text Embedder — shows retrieval quality is robust to markdown formatting in general-purpose embedders
- Muennighoff et al. (2022) — MTEB: Massive Text Embedding Benchmark — the benchmark includes mixed-format text; top models perform well without preprocessing
- For table-heavy content, stripping
|separators can marginally improve semantic similarity scores; for prose, the effect is negligible
This package does not strip markdown. If your downstream embedding model or use case requires clean text, strip it yourself before embedding — this is intentionally left to the caller:
// Optional: strip markdown before embedding (your responsibility)
function stripMarkdown(text: string): string {
return text
.replace(/#{1,6}\s+/g, '') // headings
.replace(/\*\*([^*]+)\*\*/g, '$1') // bold
.replace(/_([^_]+)_/g, '$1') // italic
.replace(/`([^`]+)`/g, '$1') // inline code
.replace(/\|/g, ' ') // table separators
.replace(/\s+/g, ' ').trim()
}
const textToEmbed = stripMarkdown(chunk.content)Replace the default pino logger with your own:
import { process } from '@msbayindir/rag-chunker'
const result = await process('document.pdf', {
mistralApiKey: process.env.MISTRAL_API_KEY,
logger: {
debug: (msg, meta) => console.debug('[rag]', msg, meta),
info: (msg, meta) => console.info('[rag]', msg, meta),
warn: (msg, meta) => console.warn('[rag]', msg, meta),
error: (msg, meta) => console.error('[rag]', msg, meta),
}
})// Disable caching entirely
const result = await process('document.pdf', {
mistralApiKey: process.env.MISTRAL_API_KEY,
ocrCachePath: false,
})
// Use a project-local cache instead of ~/.rag-chunker/
const result = await process('document.pdf', {
mistralApiKey: process.env.MISTRAL_API_KEY,
ocrCachePath: './.rag-cache/ocr.json',
ocrCacheTtlDays: 30,
})OCR pipelines often produce inconsistent heading levels — a section title becomes # Title on one page and ## Title on another, or numbered sections (1.1, 1.2) get assigned the wrong level.
The two-phase normalization process:
- Phase 1 (Gemini Pro): Sends the full document to discover structure — document type, main sections, numbering patterns
- Phase 2 (Gemini Flash): Sends the discovered structure + heading list (not the full document) and corrects each heading's level
Enable it when:
- The source PDF has complex, multi-level structure (textbooks, technical reports)
- OCR produces mismatched heading levels
- You need accurate
sectionPathbreadcrumbs in your chunks
rag-chunker process document.pdf \
-m YOUR_MISTRAL_KEY \
-k YOUR_GEMINI_KEY \
--heading-normalization \
-o ./outputPhase 1 uses
gemini-2.5-pro(50 requests/day on free tier). For a 200-page document, this is 1 request. You can process up to 50 documents per day with heading normalization on the free tier.
import { readFileSync } from 'fs'
import { process } from '@msbayindir/rag-chunker'
const pdfBuffer = readFileSync('document.pdf')
const result = await process(pdfBuffer, {
mistralApiKey: process.env.MISTRAL_API_KEY,
})| Setup | Approx. cost per 200-page document |
|---|---|
| OCR only (Mistral) | ~$0.05 |
| OCR + context enrichment (batch) | ~$0.15–0.22 |
| OCR + heading normalization | ~$0.06 |
| OCR + context + heading normalization | ~$0.23 |
| v2 equivalent (LLM-based OCR + chunking) | ~$0.95 |
- Node.js ≥ 20
- At least one of:
MISTRAL_API_KEYorGEMINI_API_KEY
MIT
{ "chunkId": "a3f7c2d1...", // SHA-256(rawContent), first 32 hex chars — deterministic "index": 0, // 0-based position in chunk array "content": "Context: This section covers sodium metabolism in the electrolyte chapter.\n\n## Sodium Metabolism\n\nSodium is...", "rawContent": "## Sodium Metabolism\n\nSodium is...", // pure markdown, no context "contextSummary": "This section covers sodium metabolism in the electrolyte chapter.", "tokenCount": 412, "contentType": "text", // "text" | "table" | "code" | "mixed" "sectionPath": ["Electrolyte Disorders", "Sodium Metabolism"], "pageNumber": 14, // 1-based, from OCR page markers "prevChunkId": "9f1a...", "nextChunkId": "c82b...", "mustPreserve": false, // true for table/code chunks that can't be split "embedding": [] // number[] if embeddingProvider was configured }