Skip to content

Releases: NameetP/pdfmux

v1.6.4 — pdfmux audit CLI + OSS→cloud funnel line

05 May 11:12

Choose a tag to compare

Additive release. Two new things, no breaking changes, no defaults change. This is the OSS half of the Verified Extraction Manifest (VEM) standard — the audit command produces the comparison artifact that VEM 1.0 standardizes.

New: pdfmux audit

Diff your current extractor's output against pdfmux on the same PDFs:

pdfmux audit --against your_extractor_output.csv --on /path/to/pdfs
  • Reads --against as CSV (filename,text or file,content) or JSON ({filename: text})
  • Computes per-document word-set Jaccard overlap between the two extractors
  • Flags documents below --overlap-threshold (default 0.70) OR --confidence-threshold (default 0.50)
  • Writes a 7-column CSV: filename, my_extractor_chars, pdfmux_chars, jaccard_overlap, pdfmux_confidence, recommendation, error
  • Exit codes: 0 clean, 2 usage error, 3 anything flagged

The pitch: "diff our output against your current extractor on 100 of your own PDFs — if we agree on every document, you don't need us. If we disagree on more than 2%, those are the silent failures already in your pipeline."

New: OSS → cloud funnel line

A single dim line prints after a successful convert, pointing to the free tier (1,000 pages/mo) and the open VEM spec. Suppress with PDFMUX_NO_UPSELL=1. Skipped when stdout isn't a TTY (so piped output stays clean) and when writing to stdout via --output -.

Tests

tests/test_audit_cli.py — 8 new tests. Total: 678 passing (up from 670).

Install

pip install --upgrade pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

v1.6.3 — audit correctness fix

02 May 06:49

Choose a tag to compare

Correctness patch

No defaults change. Every existing flag and CLI invocation behaves identically. Confidence numbers are now correct on documents where they were previously inflated.

The bug

`audit.compute_document_confidence` was returning 1.0 on documents with empty extractions. The function did a content-weighted average of the per-page `confidence` value the extractor wrote at yield time — always `1.0` with the comment audit will reassess. The reassessment never happened, so:

  • Blank pages → confidence 1.0
  • HTML files renamed to `.pdf` → confidence 1.0
  • Single-character bodies → confidence 1.0
  • Image-only pages with no OCR → confidence 1.0

This is the bug behind the silent failures in our 433-PDF batch retro. `--strict --min-confidence 0.20` shipped in 1.6.1 could not catch the eleven silent failures because the audit didn't know the pages were empty.

The fix (5 lines in `audit.py`)

  • Re-score every page with `score_page(p.text, p.image_count)` before averaging.
  • Stop flooring per-page weight at `1`. Blank pages now register zero weight in the denominator.

Also added: `eval/` calibration harness

The instrument that surfaced the bug. Self-contained, three composable steps:

```bash
python eval/build_fixtures.py # generates 50 labeled PDFs
python eval/run_eval.py # runs pdfmux on each
python eval/calibrate.py # ROC + threshold recommendations
```

The first calibration run produced precision flat at 0.683 across every threshold from 0.0 to 0.95 — the smoking gun. After the fix, threshold `0.75` produces precision `1.00`, recall `0.71` on the 50-fixture set.

Calibration headline

Threshold Precision Recall F1
0.50 0.821 0.821 0.821
0.75 1.000 0.714 0.833

`0.75` is the recommended default for `--min-confidence` when 1.7 ships breaking-default-strict on `pdfmux convert

` (target: 2026-05-08).

Tests

670 passing, 3 skipped. No regressions.

Install: `pip install -U pdfmux`

v1.6.2 — regression guards for real-world failure modes

01 May 17:40

Choose a tag to compare

Regression-guard release

No behavior changes. Adds 11 behavioral-contract tests for the real-world failure modes that prompted the 1.6.1 work — pinning correct behavior so it can't silently regress.

Test categories

  • Truncated PDF streams — the four pypdf: Stream has ended unexpectedly cases from the original batch run. pdfmux must either recover (PyMuPDF's xref repair) or raise — never silently return empty.
  • Non-ASCII filenames — CJK + full-width punctuation (Coolsoft test reports(原版).pdf). Both extract_text and batch_extract must accept these without shell-quoting issues.
  • Arabic-only PDFs — the BiDi post-processor must not crash on RTL text.
  • 0-byte files — must raise a named PdfmuxError, never silently return empty.
  • HTML files renamed to .pdf — common when 'view as PDF' saves the page source. Must error cleanly OR return text without HTML markup.
  • Missing files — must raise FileError, not a bare FileNotFoundError.
  • Batch isolation — a bad file in batch_extract must yield an exception for that file without poisoning the rest of the batch.

Numbers

  • 670 passing (was 659 in 1.6.1)
  • 0 behavior changes
  • 0 source code changes — tests-only release

Install: pip install -U pdfmux

v1.6.1 — strict mode, batch manifest, batch_extract API

01 May 17:27

Choose a tag to compare

Field-driven patch release

Triggered by a real-world 433-PDF batch run where the first invocation silently dropped 16 documents — the exact failure mode pdfmux's brand promises to prevent. Every change here is additive; no breaking defaults.

Surface signals that already exist

  • pdfmux convert --strict --min-confidence FLOAT — exits with code 3 if any document confidence falls below the threshold. Use it in CI to fail loud instead of silent.
  • stderr WARNING line for every document with confidence < 0.50, regardless of --strict. Visible in CI logs.
  • manifest.json written at the end of every convert <dir> run. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.
  • pdfmux.batch_extract(paths, **kwargs) — public Python API. Use this instead of shelling out to the CLI in a loop.
  • pdfmux doctor --check <dir> — samples PDFs, classifies them, recommends missing extras. Catches "23% of your batch is scanned, install pdfmux[ocr]" before the batch.
  • RapidOCR warnings translated to pdfmux-namespaced messages with file + page context.

Removed

  • The ML heading classifier (models/heading_classifier.pkl + ml_headings.py). It needed sklearn (not a base dep), printed Failed to load ML heading model 24+ times per batch, and produced no measurable lift over the heuristic fallback. -250 LOC, no behavior change on real PDFs.

Fixed

  • pdfmux.__version__ was stale at 1.5.1; now matches pyproject.

Docs

  • README leads Python users with batch_extract for batch use cases.
  • pdfmux[ocr] promoted from "optional extra" to recommended-default.
  • New note: don't wrap pdfmux with your own pypdf fallback — PyMuPDF tolerates malformed PDFs that pypdf rejects.

Exit codes (now documented)

  • 0 — success
  • 1 — extraction or runtime error
  • 2 — usage error (bad arguments, file not found)
  • 3 — strict gate failed (at least one document below --min-confidence)

Tests: 659 passed, 3 skipped.

Install: pip install -U pdfmux

v1.6.0 — Gemma 4, smart cache, streaming, profiles, watch, estimate, diff

30 Apr 18:54

Choose a tag to compare

Highlights

Extraction backends

  • Mistral OCR ($0.002/page) and Marker neural extractor as optional backends
  • Gemma 4 27B IT as a vision provider via GeminiAPI (reuses GEMINI_API_KEY) with native Arabic OCR

Arabic / RTL

  • BiDi post-processing (markdown-aware: preserves headings, lists, code fences)
  • Arabic detection in the classifier; arabic page type wired into the routing matrix

Caching & streaming

  • Smart result cache keyed by file hash + (quality, format, schema). 30d TTL, 1GB LRU.
  • pdfmux stream and extract_streaming MCP tool — NDJSON events as pages complete

DX

  • pdfmux profiles list/show/save/delete and --profile flag (built-ins: invoices, receipts, papers, contracts, bulk-rag)
  • pdfmux watch <dir> — auto-convert on change
  • pdfmux estimate — predict cost before running
  • pdfmux diff a.pdf b.pdf — extraction comparison
  • Better error messages: .user_message, .suggestion, .reproduce_cmd
  • @with_retry (exponential backoff, honors Retry-After) on every LLM provider

Tests: 659 passing (up from 481).

Install: pip install -U pdfmux

v1.5.1

16 Apr 19:37

Choose a tag to compare

Metadata-only release to unblock publishing to the Official MCP Registry.

Added

  • MCP Registry ownership marker (mcp-name: io.github.NameetP/pdfmux) in README so the PyPI package description matches the server.json declared name. Unblocks aggregator ingest across PulseMCP and other downstream MCP directories.

Notes

  • No runtime, API, or benchmark changes vs. 1.5.0.
  • PyPI publish is handled automatically by the trusted-publisher workflow on release.

Next step (founder)

After this release lands on PyPI (~2 min), run mcp-publisher publish against the committed server.json to register pdfmux on https://registry.modelcontextprotocol.io/.

v1.5.0 — Benchmark Score 0.905

06 Apr 18:37

Choose a tag to compare

What's New in v1.5.0

Benchmark Results

  • 0.905 overall benchmark score on opendataloader-bench (200 docs)
  • Up from 0.867 (v1.3.0) — a +4.4% improvement
  • 100% confidence score across all documents
  • 98 docs improved, only 3 regressed

Key Improvements

Image Table OCR (TEDS: 0.887 → 0.911, +2.7%)

  • Integrated RapidOCR for tables embedded as images
  • Smart filtering: 50% fill rate + 30% numeric cell thresholds to avoid false positives on charts

ML Heading Classifier (MHS: 0.844 → 0.852, +0.9%)

  • ML-based fallback for heading detection when heuristics fail
  • Improved heading cleanup for cleaner document structure

Column-Aware Reading Order (NID: 0.910 → 0.920)

  • A/B column reordering: detects multi-column pages, compares both orderings, picks the better one
  • Safe by design — worst case is no-op (original text preserved)
  • Conservative detection (200pt gap threshold) to avoid false positives

Install

pip install pdfmux==1.5.0

Full Changelog: v1.3.0...v1.5.0

v1.2.0 — Heading Detection + Benchmark #4

18 Mar 15:33

Choose a tag to compare

What's New in v1.2.0

Heading Detection (benchmark +7.7%)

Font-size-based heading detection that analyzes PyMuPDF font metadata to identify and inject markdown heading markers.

  • Compares span font sizes to body text — maps to #/##/###
  • Detects bold-at-same-size headings common in academic PDFs
  • Promotes short bold-only lines to ### as fallback
  • Early exit when pymupdf4llm already detected headings
  • Zero new dependencies, ~220 lines of pure Python

Borderless Table Fallback

Whitespace column detection for tables missed by find_tables():

  • Detects consistent column positions across 3+ consecutive lines
  • Validates: numeric column required, minimum 3 rows
  • Returns ExtractedTable objects matching existing API

Benchmark Results

Tested on opendataloader-bench (200 real-world PDFs):

Metric v1.1.0 v1.2.0 Delta
Overall 0.792 0.853 +0.061
MHS (headings) 0.500 0.740 +0.240
NID (reading) 0.911 0.911
TEDS (tables) 0.704 0.704

Leaderboard: #6#4 — ahead of opendataloader local (0.844) and mineru (0.831).

Developer

  • New modules: headings, table_fallback
  • 246 tests passing (21 new)
  • Zero new dependencies

v1.1.0 — Structured Extraction

12 Mar 12:36

Choose a tag to compare

What's New in v1.1.0

Structured Extraction

  • JSON output with --format json — tables extracted as structured data (headers + rows), key-value pairs auto-detected from Label: Value patterns common in bank statements, invoices, and forms
  • Date normalization → ISO 8601 output. Handles "28 Feb 2026", "February 28, 2026", DD/MM/YYYY, and other common formats
  • Amount normalization → parsed floats with currency detection (AED, USD, EUR, etc.) and debit/credit direction
  • Rate normalization → percentage value + period (monthly/annual)
  • Schema-guided extraction with --schema — fuzzy-match extracted data to your JSON schema, zero LLM cost
  • New MCP tool: extract_structured for AI agent integration
  • convert_pdf MCP tool now supports JSON format

Bug Fixes

  • Fixed --stdout JSON output — Rich console was word-wrapping long lines, breaking JSON validity for downstream consumers
  • Control character sanitization — PDFs with embedded control characters no longer produce invalid output

Developer

  • New modules: kv_extract, normalize, schema
  • New types: ExtractedTable, KeyValuePair
  • JSON schema version bumped to 1.1.0 — includes tables, key_values, and structured fields
  • 225 tests passing, zero new dependencies

Usage

pip install --upgrade pdfmux

# Structured JSON with tables and key-values
pdfmux convert statement.pdf -f json

# Schema-guided extraction
pdfmux convert statement.pdf --schema bank-statement.schema.json

Full Changelog: v1.0.1...v1.1.0

v0.4.0 — Public Python API + Section-Aware Chunking

04 Mar 11:16

Choose a tag to compare

What's New

Public Python API

pdfmux is now a proper Python library. Three importable functions — no CLI required:

import pdfmux

text = pdfmux.extract_text("report.pdf")
data = pdfmux.extract_json("report.pdf")
chunks = pdfmux.load_llm_context("report.pdf")

Section-Aware Chunking

New load_llm_context() returns LLM-ready chunks split at heading boundaries, with per-chunk page tracking and token estimates. Designed for RAG pipelines and context windows.

chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
    print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")

LLM Output Format

New --format llm CLI option outputs chunked JSON with {title, text, page_start, page_end, tokens, confidence} per section.

pdfmux report.pdf -f llm

Locked JSON Schema

JSON output now includes schema_version: "0.4.0" and ocr_pages field for downstream stability.

pdfmux analyze

Per-page extraction breakdown showing page type, quality, char count, confidence, and extractor used.

pdfmux analyze report.pdf

Install

pip install pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md