- Accept
None
formax_workers
- Update documentation with a meme for the younger generation
- Allow returning indirect object references from worker processes
- Remove excessive debug logging
- Add rendering matrix to
GlyphObject
- Fix ToUnicode CMaps for CID fonts
- Optimize text extraction
- Support slices and lists in
PageList.__getitem__
- Remove remaining dangerous
cast
usage - Make text extraction less Lazy so that we get graphics state correct (slightly breaking change)
- Correct the handling of marked content sections\
- Be robust to junk before the header
- Deliberately break the CLI (ZeroVer FTW YOLO ROTFL)
- Correct some type annotations (these were not really bugs)
- Handle more CMap and ToUnicode corner cases
- Add parallel operations
- Deprecate "eager" API
- Correct some problems on Windows/MacOS
- Fix various bugs in the lazy API
- Add specialized
__len__
methods to ContentObject classes - Clarify iteration over ContentObject
- Add specialized
- Fix installation of playa-pdf[crypto]
- Fix attribute classes in structure tree elements
- Deprecate "user" device space to avoid confusion with user space
- Parse arbitrary Encoding CMaps
- Update
pdfplumber
support - Add parser for object streams and iterator over all indirect objects in a document
- fix more embarrassing bugs largely regarding the creation of empty ContentObjects
- these are not actually all fixed because (surprise!) sometimes we neglect to map the characters in fonts correctly
- oh and also lots and lots of robustness fixes thanks to the pdf.js testsuite of pdf horrors
- release early and often
- fix some embarrassing bugs, again:
- CMap parser did not recognize bfrange correctly (regression)
- corner cases of inline images caused endless woe
- documentation said document.structtree exists but nope it didn't
- make everything quite a lot faster (25-35% faster than pdfminer now)
- fix some more pdfminer.six bugs and verify others already fixed
- really make sure not to return text objects with no text
- fix serious bug on malformed stream_length
- report actual bounding box for rotated glyphs
- eager API is no longer faster than pdfminer :( but it is more correct
- expose form XObjects on Page to allow getting only their contents
- expose form XObject IDs in LayoutDict
- make TextState conform to PDF spec (leading and line matrix) and document it
- expose more of TextState in LayoutDict (render mode in particular - OCRmyPDF)
- do not try to map characters with no ToUnicode and no Encoding (OCRmyPDF)
- properly support Pattern color space (uncolored tiling patterns) the way pdfplumber expects it to work
- support marked content points as ContentObjects
- document ContentObjects
- make a proper schema for LayoutDict, document it, and communicate it to Polars
- separate color values and patterns in LayoutDict