Releases: NameetP/pdfmux
v1.6.4 — pdfmux audit CLI + OSS→cloud funnel line
Additive release. Two new things, no breaking changes, no defaults change. This is the OSS half of the Verified Extraction Manifest (VEM) standard — the audit command produces the comparison artifact that VEM 1.0 standardizes.
New: pdfmux audit
Diff your current extractor's output against pdfmux on the same PDFs:
pdfmux audit --against your_extractor_output.csv --on /path/to/pdfs- Reads
--againstas CSV (filename,textorfile,content) or JSON ({filename: text}) - Computes per-document word-set Jaccard overlap between the two extractors
- Flags documents below
--overlap-threshold(default0.70) OR--confidence-threshold(default0.50) - Writes a 7-column CSV:
filename, my_extractor_chars, pdfmux_chars, jaccard_overlap, pdfmux_confidence, recommendation, error - Exit codes:
0clean,2usage error,3anything flagged
The pitch: "diff our output against your current extractor on 100 of your own PDFs — if we agree on every document, you don't need us. If we disagree on more than 2%, those are the silent failures already in your pipeline."
New: OSS → cloud funnel line
A single dim line prints after a successful convert, pointing to the free tier (1,000 pages/mo) and the open VEM spec. Suppress with PDFMUX_NO_UPSELL=1. Skipped when stdout isn't a TTY (so piped output stays clean) and when writing to stdout via --output -.
Tests
tests/test_audit_cli.py — 8 new tests. Total: 678 passing (up from 670).
Install
pip install --upgrade pdfmuxFull changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md
v1.6.3 — audit correctness fix
Correctness patch
No defaults change. Every existing flag and CLI invocation behaves identically. Confidence numbers are now correct on documents where they were previously inflated.
The bug
`audit.compute_document_confidence` was returning 1.0 on documents with empty extractions. The function did a content-weighted average of the per-page `confidence` value the extractor wrote at yield time — always `1.0` with the comment audit will reassess. The reassessment never happened, so:
- Blank pages → confidence 1.0
- HTML files renamed to `.pdf` → confidence 1.0
- Single-character bodies → confidence 1.0
- Image-only pages with no OCR → confidence 1.0
This is the bug behind the silent failures in our 433-PDF batch retro. `--strict --min-confidence 0.20` shipped in 1.6.1 could not catch the eleven silent failures because the audit didn't know the pages were empty.
The fix (5 lines in `audit.py`)
- Re-score every page with `score_page(p.text, p.image_count)` before averaging.
- Stop flooring per-page weight at `1`. Blank pages now register zero weight in the denominator.
Also added: `eval/` calibration harness
The instrument that surfaced the bug. Self-contained, three composable steps:
```bash
python eval/build_fixtures.py # generates 50 labeled PDFs
python eval/run_eval.py # runs pdfmux on each
python eval/calibrate.py # ROC + threshold recommendations
```
The first calibration run produced precision flat at 0.683 across every threshold from 0.0 to 0.95 — the smoking gun. After the fix, threshold `0.75` produces precision `1.00`, recall `0.71` on the 50-fixture set.
Calibration headline
| Threshold | Precision | Recall | F1 |
|---|---|---|---|
| 0.50 | 0.821 | 0.821 | 0.821 |
| 0.75 | 1.000 | 0.714 | 0.833 |
`0.75` is the recommended default for `--min-confidence` when 1.7 ships breaking-default-strict on `pdfmux convert
` (target: 2026-05-08).Tests
670 passing, 3 skipped. No regressions.
Install: `pip install -U pdfmux`
v1.6.2 — regression guards for real-world failure modes
Regression-guard release
No behavior changes. Adds 11 behavioral-contract tests for the real-world failure modes that prompted the 1.6.1 work — pinning correct behavior so it can't silently regress.
Test categories
- Truncated PDF streams — the four
pypdf: Stream has ended unexpectedlycases from the original batch run. pdfmux must either recover (PyMuPDF's xref repair) or raise — never silently return empty. - Non-ASCII filenames — CJK + full-width punctuation (
Coolsoft test reports(原版).pdf). Bothextract_textandbatch_extractmust accept these without shell-quoting issues. - Arabic-only PDFs — the BiDi post-processor must not crash on RTL text.
- 0-byte files — must raise a named
PdfmuxError, never silently return empty. - HTML files renamed to
.pdf— common when 'view as PDF' saves the page source. Must error cleanly OR return text without HTML markup. - Missing files — must raise
FileError, not a bareFileNotFoundError. - Batch isolation — a bad file in
batch_extractmust yield an exception for that file without poisoning the rest of the batch.
Numbers
- 670 passing (was 659 in 1.6.1)
- 0 behavior changes
- 0 source code changes — tests-only release
Install: pip install -U pdfmux
v1.6.1 — strict mode, batch manifest, batch_extract API
Field-driven patch release
Triggered by a real-world 433-PDF batch run where the first invocation silently dropped 16 documents — the exact failure mode pdfmux's brand promises to prevent. Every change here is additive; no breaking defaults.
Surface signals that already exist
pdfmux convert --strict --min-confidence FLOAT— exits with code 3 if any document confidence falls below the threshold. Use it in CI to fail loud instead of silent.- stderr WARNING line for every document with confidence < 0.50, regardless of
--strict. Visible in CI logs. manifest.jsonwritten at the end of everyconvert <dir>run. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.pdfmux.batch_extract(paths, **kwargs)— public Python API. Use this instead of shelling out to the CLI in a loop.pdfmux doctor --check <dir>— samples PDFs, classifies them, recommends missing extras. Catches "23% of your batch is scanned, install pdfmux[ocr]" before the batch.- RapidOCR warnings translated to pdfmux-namespaced messages with file + page context.
Removed
- The ML heading classifier (
models/heading_classifier.pkl+ml_headings.py). It needed sklearn (not a base dep), printedFailed to load ML heading model24+ times per batch, and produced no measurable lift over the heuristic fallback. -250 LOC, no behavior change on real PDFs.
Fixed
pdfmux.__version__was stale at1.5.1; now matchespyproject.
Docs
- README leads Python users with
batch_extractfor batch use cases. pdfmux[ocr]promoted from "optional extra" to recommended-default.- New note: don't wrap pdfmux with your own pypdf fallback — PyMuPDF tolerates malformed PDFs that pypdf rejects.
Exit codes (now documented)
0— success1— extraction or runtime error2— usage error (bad arguments, file not found)3— strict gate failed (at least one document below--min-confidence)
Tests: 659 passed, 3 skipped.
Install: pip install -U pdfmux
v1.6.0 — Gemma 4, smart cache, streaming, profiles, watch, estimate, diff
Highlights
Extraction backends
- Mistral OCR ($0.002/page) and Marker neural extractor as optional backends
- Gemma 4 27B IT as a vision provider via GeminiAPI (reuses
GEMINI_API_KEY) with native Arabic OCR
Arabic / RTL
- BiDi post-processing (markdown-aware: preserves headings, lists, code fences)
- Arabic detection in the classifier;
arabicpage type wired into the routing matrix
Caching & streaming
- Smart result cache keyed by file hash + (quality, format, schema). 30d TTL, 1GB LRU.
pdfmux streamandextract_streamingMCP tool — NDJSON events as pages complete
DX
pdfmux profiles list/show/save/deleteand--profileflag (built-ins: invoices, receipts, papers, contracts, bulk-rag)pdfmux watch <dir>— auto-convert on changepdfmux estimate— predict cost before runningpdfmux diff a.pdf b.pdf— extraction comparison- Better error messages:
.user_message,.suggestion,.reproduce_cmd @with_retry(exponential backoff, honorsRetry-After) on every LLM provider
Tests: 659 passing (up from 481).
Install: pip install -U pdfmux
v1.5.1
Metadata-only release to unblock publishing to the Official MCP Registry.
Added
- MCP Registry ownership marker (
mcp-name: io.github.NameetP/pdfmux) in README so the PyPI package description matches theserver.jsondeclared name. Unblocks aggregator ingest across PulseMCP and other downstream MCP directories.
Notes
- No runtime, API, or benchmark changes vs. 1.5.0.
- PyPI publish is handled automatically by the trusted-publisher workflow on release.
Next step (founder)
After this release lands on PyPI (~2 min), run mcp-publisher publish against the committed server.json to register pdfmux on https://registry.modelcontextprotocol.io/.
v1.5.0 — Benchmark Score 0.905
What's New in v1.5.0
Benchmark Results
- 0.905 overall benchmark score on opendataloader-bench (200 docs)
- Up from 0.867 (v1.3.0) — a +4.4% improvement
- 100% confidence score across all documents
- 98 docs improved, only 3 regressed
Key Improvements
Image Table OCR (TEDS: 0.887 → 0.911, +2.7%)
- Integrated RapidOCR for tables embedded as images
- Smart filtering: 50% fill rate + 30% numeric cell thresholds to avoid false positives on charts
ML Heading Classifier (MHS: 0.844 → 0.852, +0.9%)
- ML-based fallback for heading detection when heuristics fail
- Improved heading cleanup for cleaner document structure
Column-Aware Reading Order (NID: 0.910 → 0.920)
- A/B column reordering: detects multi-column pages, compares both orderings, picks the better one
- Safe by design — worst case is no-op (original text preserved)
- Conservative detection (200pt gap threshold) to avoid false positives
Install
pip install pdfmux==1.5.0
Full Changelog: v1.3.0...v1.5.0
v1.2.0 — Heading Detection + Benchmark #4
What's New in v1.2.0
Heading Detection (benchmark +7.7%)
Font-size-based heading detection that analyzes PyMuPDF font metadata to identify and inject markdown heading markers.
- Compares span font sizes to body text — maps to
#/##/### - Detects bold-at-same-size headings common in academic PDFs
- Promotes short bold-only lines to
###as fallback - Early exit when pymupdf4llm already detected headings
- Zero new dependencies, ~220 lines of pure Python
Borderless Table Fallback
Whitespace column detection for tables missed by find_tables():
- Detects consistent column positions across 3+ consecutive lines
- Validates: numeric column required, minimum 3 rows
- Returns
ExtractedTableobjects matching existing API
Benchmark Results
Tested on opendataloader-bench (200 real-world PDFs):
| Metric | v1.1.0 | v1.2.0 | Delta |
|---|---|---|---|
| Overall | 0.792 | 0.853 | +0.061 |
| MHS (headings) | 0.500 | 0.740 | +0.240 |
| NID (reading) | 0.911 | 0.911 | — |
| TEDS (tables) | 0.704 | 0.704 | — |
Leaderboard: #6 → #4 — ahead of opendataloader local (0.844) and mineru (0.831).
Developer
- New modules:
headings,table_fallback - 246 tests passing (21 new)
- Zero new dependencies
v1.1.0 — Structured Extraction
What's New in v1.1.0
Structured Extraction
- JSON output with
--format json— tables extracted as structured data (headers + rows), key-value pairs auto-detected fromLabel: Valuepatterns common in bank statements, invoices, and forms - Date normalization → ISO 8601 output. Handles
"28 Feb 2026","February 28, 2026",DD/MM/YYYY, and other common formats - Amount normalization → parsed floats with currency detection (
AED,USD,EUR, etc.) and debit/credit direction - Rate normalization → percentage value + period (
monthly/annual) - Schema-guided extraction with
--schema— fuzzy-match extracted data to your JSON schema, zero LLM cost - New MCP tool:
extract_structuredfor AI agent integration convert_pdfMCP tool now supports JSON format
Bug Fixes
- Fixed
--stdoutJSON output — Rich console was word-wrapping long lines, breaking JSON validity for downstream consumers - Control character sanitization — PDFs with embedded control characters no longer produce invalid output
Developer
- New modules:
kv_extract,normalize,schema - New types:
ExtractedTable,KeyValuePair - JSON schema version bumped to
1.1.0— includestables,key_values, andstructuredfields - 225 tests passing, zero new dependencies
Usage
pip install --upgrade pdfmux
# Structured JSON with tables and key-values
pdfmux convert statement.pdf -f json
# Schema-guided extraction
pdfmux convert statement.pdf --schema bank-statement.schema.jsonFull Changelog: v1.0.1...v1.1.0
v0.4.0 — Public Python API + Section-Aware Chunking
What's New
Public Python API
pdfmux is now a proper Python library. Three importable functions — no CLI required:
import pdfmux
text = pdfmux.extract_text("report.pdf")
data = pdfmux.extract_json("report.pdf")
chunks = pdfmux.load_llm_context("report.pdf")Section-Aware Chunking
New load_llm_context() returns LLM-ready chunks split at heading boundaries, with per-chunk page tracking and token estimates. Designed for RAG pipelines and context windows.
chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")LLM Output Format
New --format llm CLI option outputs chunked JSON with {title, text, page_start, page_end, tokens, confidence} per section.
pdfmux report.pdf -f llmLocked JSON Schema
JSON output now includes schema_version: "0.4.0" and ocr_pages field for downstream stability.
pdfmux analyze
Per-page extraction breakdown showing page type, quality, char count, confidence, and extractor used.
pdfmux analyze report.pdfInstall
pip install pdfmuxFull changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md