Releases · NameetP/pdfmux

05 May 11:12

NameetP

v1.6.4

a160a1d

v1.6.4 — pdfmux audit CLI + OSS→cloud funnel line Latest

Latest

Additive release. Two new things, no breaking changes, no defaults change. This is the OSS half of the Verified Extraction Manifest (VEM) standard — the audit command produces the comparison artifact that VEM 1.0 standardizes.

New: `pdfmux audit`

Diff your current extractor's output against pdfmux on the same PDFs:

pdfmux audit --against your_extractor_output.csv --on /path/to/pdfs

Reads --against as CSV (filename,text or file,content) or JSON ({filename: text})
Computes per-document word-set Jaccard overlap between the two extractors
Flags documents below --overlap-threshold (default 0.70) OR --confidence-threshold (default 0.50)
Writes a 7-column CSV: filename, my_extractor_chars, pdfmux_chars, jaccard_overlap, pdfmux_confidence, recommendation, error
Exit codes: 0 clean, 2 usage error, 3 anything flagged

The pitch: "diff our output against your current extractor on 100 of your own PDFs — if we agree on every document, you don't need us. If we disagree on more than 2%, those are the silent failures already in your pipeline."

New: OSS → cloud funnel line

A single dim line prints after a successful convert, pointing to the free tier (1,000 pages/mo) and the open VEM spec. Suppress with PDFMUX_NO_UPSELL=1. Skipped when stdout isn't a TTY (so piped output stays clean) and when writing to stdout via --output -.

Tests

tests/test_audit_cli.py — 8 new tests. Total: 678 passing (up from 670).

Install

pip install --upgrade pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

Assets 2

02 May 06:49

NameetP

v1.6.3

e1d5cb0

v1.6.3 — audit correctness fix

Correctness patch

No defaults change. Every existing flag and CLI invocation behaves identically. Confidence numbers are now correct on documents where they were previously inflated.

The bug

`audit.compute_document_confidence` was returning 1.0 on documents with empty extractions. The function did a content-weighted average of the per-page `confidence` value the extractor wrote at yield time — always `1.0` with the comment audit will reassess. The reassessment never happened, so:

Blank pages → confidence 1.0
HTML files renamed to `.pdf` → confidence 1.0
Single-character bodies → confidence 1.0
Image-only pages with no OCR → confidence 1.0

This is the bug behind the silent failures in our 433-PDF batch retro. `--strict --min-confidence 0.20` shipped in 1.6.1 could not catch the eleven silent failures because the audit didn't know the pages were empty.

The fix (5 lines in `audit.py`)

Re-score every page with `score_page(p.text, p.image_count)` before averaging.
Stop flooring per-page weight at `1`. Blank pages now register zero weight in the denominator.

Also added: `eval/` calibration harness

The instrument that surfaced the bug. Self-contained, three composable steps:

```bash
python eval/build_fixtures.py # generates 50 labeled PDFs
python eval/run_eval.py # runs pdfmux on each
python eval/calibrate.py # ROC + threshold recommendations
```

The first calibration run produced precision flat at 0.683 across every threshold from 0.0 to 0.95 — the smoking gun. After the fix, threshold `0.75` produces precision `1.00`, recall `0.71` on the 50-fixture set.

Calibration headline

Threshold	Precision	Recall	F1
0.50	0.821	0.821	0.821
0.75	1.000	0.714	0.833

`0.75` is the recommended default for `--min-confidence` when 1.7 ships breaking-default-strict on `pdfmux convert

` (target: 2026-05-08).

Tests

670 passing, 3 skipped. No regressions.

Install: `pip install -U pdfmux`

Assets 2

01 May 17:40

NameetP

v1.6.2

35de0c6

v1.6.2 — regression guards for real-world failure modes

Regression-guard release

No behavior changes. Adds 11 behavioral-contract tests for the real-world failure modes that prompted the 1.6.1 work — pinning correct behavior so it can't silently regress.

Test categories

Truncated PDF streams — the four pypdf: Stream has ended unexpectedly cases from the original batch run. pdfmux must either recover (PyMuPDF's xref repair) or raise — never silently return empty.
Non-ASCII filenames — CJK + full-width punctuation (Coolsoft test reports（原版）.pdf). Both extract_text and batch_extract must accept these without shell-quoting issues.
Arabic-only PDFs — the BiDi post-processor must not crash on RTL text.
0-byte files — must raise a named PdfmuxError, never silently return empty.
HTML files renamed to .pdf — common when 'view as PDF' saves the page source. Must error cleanly OR return text without HTML markup.
Missing files — must raise FileError, not a bare FileNotFoundError.
Batch isolation — a bad file in batch_extract must yield an exception for that file without poisoning the rest of the batch.

Numbers

670 passing (was 659 in 1.6.1)
0 behavior changes
0 source code changes — tests-only release

Install: pip install -U pdfmux

Assets 2

01 May 17:27

NameetP

v1.6.1

90f8795

v1.6.1 — strict mode, batch manifest, batch_extract API

Field-driven patch release

Triggered by a real-world 433-PDF batch run where the first invocation silently dropped 16 documents — the exact failure mode pdfmux's brand promises to prevent. Every change here is additive; no breaking defaults.

Surface signals that already exist

pdfmux convert --strict --min-confidence FLOAT — exits with code 3 if any document confidence falls below the threshold. Use it in CI to fail loud instead of silent.
stderr WARNING line for every document with confidence < 0.50, regardless of --strict. Visible in CI logs.
manifest.json written at the end of every convert <dir> run. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.
pdfmux.batch_extract(paths, **kwargs) — public Python API. Use this instead of shelling out to the CLI in a loop.
pdfmux doctor --check <dir> — samples PDFs, classifies them, recommends missing extras. Catches "23% of your batch is scanned, install pdfmux[ocr]" before the batch.
RapidOCR warnings translated to pdfmux-namespaced messages with file + page context.

Removed

The ML heading classifier (models/heading_classifier.pkl + ml_headings.py). It needed sklearn (not a base dep), printed Failed to load ML heading model 24+ times per batch, and produced no measurable lift over the heuristic fallback. -250 LOC, no behavior change on real PDFs.

Fixed

pdfmux.__version__ was stale at 1.5.1; now matches pyproject.

Docs

README leads Python users with batch_extract for batch use cases.
pdfmux[ocr] promoted from "optional extra" to recommended-default.
New note: don't wrap pdfmux with your own pypdf fallback — PyMuPDF tolerates malformed PDFs that pypdf rejects.

Exit codes (now documented)

0 — success
1 — extraction or runtime error
2 — usage error (bad arguments, file not found)
3 — strict gate failed (at least one document below --min-confidence)

Tests: 659 passed, 3 skipped.

Install: pip install -U pdfmux

Assets 2

30 Apr 18:54

NameetP

v1.6.0

b634712

v1.6.0 — Gemma 4, smart cache, streaming, profiles, watch, estimate, diff

Highlights

Extraction backends

Mistral OCR ($0.002/page) and Marker neural extractor as optional backends
Gemma 4 27B IT as a vision provider via GeminiAPI (reuses GEMINI_API_KEY) with native Arabic OCR

Arabic / RTL

BiDi post-processing (markdown-aware: preserves headings, lists, code fences)
Arabic detection in the classifier; arabic page type wired into the routing matrix

Caching & streaming

Smart result cache keyed by file hash + (quality, format, schema). 30d TTL, 1GB LRU.
pdfmux stream and extract_streaming MCP tool — NDJSON events as pages complete

pdfmux profiles list/show/save/delete and --profile flag (built-ins: invoices, receipts, papers, contracts, bulk-rag)
pdfmux watch <dir> — auto-convert on change
pdfmux estimate — predict cost before running
pdfmux diff a.pdf b.pdf — extraction comparison
Better error messages: .user_message, .suggestion, .reproduce_cmd
@with_retry (exponential backoff, honors Retry-After) on every LLM provider

Tests: 659 passing (up from 481).

Install: pip install -U pdfmux

Assets 2

16 Apr 19:37

NameetP

v1.5.1

266a657

v1.5.1

Metadata-only release to unblock publishing to the Official MCP Registry.

Added

MCP Registry ownership marker (mcp-name: io.github.NameetP/pdfmux) in README so the PyPI package description matches the server.json declared name. Unblocks aggregator ingest across PulseMCP and other downstream MCP directories.

Notes

No runtime, API, or benchmark changes vs. 1.5.0.
PyPI publish is handled automatically by the trusted-publisher workflow on release.

Next step (founder)

After this release lands on PyPI (~2 min), run mcp-publisher publish against the committed server.json to register pdfmux on https://registry.modelcontextprotocol.io/.

Assets 2

06 Apr 18:37

NameetP

v1.5.0

38a6449

v1.5.0 — Benchmark Score 0.905

What's New in v1.5.0

Benchmark Results

0.905 overall benchmark score on opendataloader-bench (200 docs)
Up from 0.867 (v1.3.0) — a +4.4% improvement
100% confidence score across all documents
98 docs improved, only 3 regressed

Key Improvements

Image Table OCR (TEDS: 0.887 → 0.911, +2.7%)

Integrated RapidOCR for tables embedded as images
Smart filtering: 50% fill rate + 30% numeric cell thresholds to avoid false positives on charts

ML Heading Classifier (MHS: 0.844 → 0.852, +0.9%)

ML-based fallback for heading detection when heuristics fail
Improved heading cleanup for cleaner document structure

Column-Aware Reading Order (NID: 0.910 → 0.920)

A/B column reordering: detects multi-column pages, compares both orderings, picks the better one
Safe by design — worst case is no-op (original text preserved)
Conservative detection (200pt gap threshold) to avoid false positives

Install

pip install pdfmux==1.5.0

Full Changelog: v1.3.0...v1.5.0

Assets 2

18 Mar 15:33

NameetP

v1.2.0

33ec2eb

v1.2.0 — Heading Detection + Benchmark #4

What's New in v1.2.0

Heading Detection (benchmark +7.7%)

Font-size-based heading detection that analyzes PyMuPDF font metadata to identify and inject markdown heading markers.

Compares span font sizes to body text — maps to #/##/###
Detects bold-at-same-size headings common in academic PDFs
Promotes short bold-only lines to ### as fallback
Early exit when pymupdf4llm already detected headings
Zero new dependencies, ~220 lines of pure Python

Borderless Table Fallback

Whitespace column detection for tables missed by find_tables():

Detects consistent column positions across 3+ consecutive lines
Validates: numeric column required, minimum 3 rows
Returns ExtractedTable objects matching existing API

Benchmark Results

Tested on opendataloader-bench (200 real-world PDFs):

Metric	v1.1.0	v1.2.0	Delta
Overall	0.792	0.853	+0.061
MHS (headings)	0.500	0.740	+0.240
NID (reading)	0.911	0.911	—
TEDS (tables)	0.704	0.704	—

Leaderboard: #6 → #4 — ahead of opendataloader local (0.844) and mineru (0.831).

Developer

New modules: headings, table_fallback
246 tests passing (21 new)
Zero new dependencies

Assets 2

12 Mar 12:36

NameetP

v1.1.0

4af8e83

v1.1.0 — Structured Extraction

What's New in v1.1.0

Structured Extraction

JSON output with --format json — tables extracted as structured data (headers + rows), key-value pairs auto-detected from Label: Value patterns common in bank statements, invoices, and forms
Date normalization → ISO 8601 output. Handles "28 Feb 2026", "February 28, 2026", DD/MM/YYYY, and other common formats
Amount normalization → parsed floats with currency detection (AED, USD, EUR, etc.) and debit/credit direction
Rate normalization → percentage value + period (monthly/annual)
Schema-guided extraction with --schema — fuzzy-match extracted data to your JSON schema, zero LLM cost
New MCP tool: extract_structured for AI agent integration
convert_pdf MCP tool now supports JSON format

Bug Fixes

Fixed --stdout JSON output — Rich console was word-wrapping long lines, breaking JSON validity for downstream consumers
Control character sanitization — PDFs with embedded control characters no longer produce invalid output

Developer

New modules: kv_extract, normalize, schema
New types: ExtractedTable, KeyValuePair
JSON schema version bumped to 1.1.0 — includes tables, key_values, and structured fields
225 tests passing, zero new dependencies

Usage

pip install --upgrade pdfmux

# Structured JSON with tables and key-values
pdfmux convert statement.pdf -f json

# Schema-guided extraction
pdfmux convert statement.pdf --schema bank-statement.schema.json

Full Changelog: v1.0.1...v1.1.0

Assets 2

04 Mar 11:16

NameetP

v0.4.0

602eaae

v0.4.0 — Public Python API + Section-Aware Chunking

What's New

Public Python API

pdfmux is now a proper Python library. Three importable functions — no CLI required:

import pdfmux

text = pdfmux.extract_text("report.pdf")
data = pdfmux.extract_json("report.pdf")
chunks = pdfmux.load_llm_context("report.pdf")

Section-Aware Chunking

New load_llm_context() returns LLM-ready chunks split at heading boundaries, with per-chunk page tracking and token estimates. Designed for RAG pipelines and context windows.

chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
    print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")

LLM Output Format

New --format llm CLI option outputs chunked JSON with {title, text, page_start, page_end, tokens, confidence} per section.

pdfmux report.pdf -f llm

Locked JSON Schema

JSON output now includes schema_version: "0.4.0" and ocr_pages field for downstream stability.

`pdfmux analyze`

Per-page extraction breakdown showing page type, quality, char count, confidence, and extractor used.

pdfmux analyze report.pdf

Install

pip install pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

Assets 2

Releases: NameetP/pdfmux

v1.6.4 — pdfmux audit CLI + OSS→cloud funnel line

New: pdfmux audit

New: OSS → cloud funnel line

Tests

Install

Uh oh!

v1.6.3 — audit correctness fix

Correctness patch

The bug

The fix (5 lines in `audit.py`)

Also added: `eval/` calibration harness

Calibration headline

Tests

Uh oh!

v1.6.2 — regression guards for real-world failure modes

Regression-guard release

Test categories

Numbers

Uh oh!

v1.6.1 — strict mode, batch manifest, batch_extract API

Field-driven patch release

Surface signals that already exist

Removed

Fixed

Docs

Exit codes (now documented)

Uh oh!

v1.6.0 — Gemma 4, smart cache, streaming, profiles, watch, estimate, diff

Highlights

Uh oh!

v1.5.1

Added

Notes

Next step (founder)

Uh oh!

v1.5.0 — Benchmark Score 0.905

What's New in v1.5.0

Benchmark Results

Key Improvements

Install

Uh oh!

v1.2.0 — Heading Detection + Benchmark #4

What's New in v1.2.0

Heading Detection (benchmark +7.7%)

Borderless Table Fallback

Benchmark Results

Developer

Uh oh!

v1.1.0 — Structured Extraction

What's New in v1.1.0

Structured Extraction

Bug Fixes

Developer

Usage

Uh oh!

v0.4.0 — Public Python API + Section-Aware Chunking

What's New

Public Python API

Section-Aware Chunking

LLM Output Format

Locked JSON Schema

pdfmux analyze

Install

Uh oh!

New: `pdfmux audit`

`pdfmux analyze`