You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Type of some PDF files can not be identified by filetype. As a result, such files are parsed as text or XML, and Docling crashes because of invalid UTF-8:
...
File ".../lib/python3.12/site-packages/docling/datamodel/document.py", line 303, in _guess_format
return _DocumentConversionInput._guess_from_content(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../lib/python3.12/site-packages/docling/datamodel/document.py", line 315, in _guess_from_content
content_str = content.decode("utf-8")
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 194: invalid start byte
Bug
Type of some PDF files can not be identified by
filetype
. As a result, such files are parsed as text or XML, and Docling crashes because of invalid UTF-8:The file WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf is extracted from Common Crawl, there are more cases like this.
python-magic
identifies such PDF files, but I assume it is not used because of the problems with Windows distribution.filetype
issue: h2non/filetype.py#192.Possible solutions:
python-magic
as an optional dependency and use it as a fallback whenfiletype
returnsNone
andpython-magic
is available.docling/docling/datamodel/document.py
Line 345 in 8543c22
Steps to reproduce
Docling version
Python version
The text was updated successfully, but these errors were encountered: