Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with PDF type detection #798

Closed
pavel-denisov-fraunhofer opened this issue Jan 24, 2025 · 0 comments · Fixed by #827
Closed

Issue with PDF type detection #798

pavel-denisov-fraunhofer opened this issue Jan 24, 2025 · 0 comments · Fixed by #827
Assignees
Labels
bug Something isn't working mimetype

Comments

@pavel-denisov-fraunhofer
Copy link
Contributor

pavel-denisov-fraunhofer commented Jan 24, 2025

Bug

Type of some PDF files can not be identified by filetype. As a result, such files are parsed as text or XML, and Docling crashes because of invalid UTF-8:

...
  File ".../lib/python3.12/site-packages/docling/datamodel/document.py", line 303, in _guess_format
    return _DocumentConversionInput._guess_from_content(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.12/site-packages/docling/datamodel/document.py", line 315, in _guess_from_content
    content_str = content.decode("utf-8")
                  ^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 194: invalid start byte

The file WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf is extracted from Common Crawl, there are more cases like this.

python-magic identifies such PDF files, but I assume it is not used because of the problems with Windows distribution.

filetype issue: h2non/filetype.py#192.

Possible solutions:

  • Complicated: add python-magic as an optional dependency and use it as a fallback when filetype returns None and python-magic is available.
  • Simple: add PDF extension here:
    def _mime_from_extension(ext):
    and allow the user to force the type as PDF via the file extension, in case the type is already known.

Steps to reproduce

docling WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.12.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mimetype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants