Issue with PDF type detection #798

pavel-denisov-fraunhofer · 2025-01-24T09:56:21Z

Bug

Type of some PDF files can not be identified by filetype. As a result, such files are parsed as text or XML, and Docling crashes because of invalid UTF-8:

...
  File ".../lib/python3.12/site-packages/docling/datamodel/document.py", line 303, in _guess_format
    return _DocumentConversionInput._guess_from_content(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../lib/python3.12/site-packages/docling/datamodel/document.py", line 315, in _guess_from_content
    content_str = content.decode("utf-8")
                  ^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 194: invalid start byte

The file WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf is extracted from Common Crawl, there are more cases like this.

python-magic identifies such PDF files, but I assume it is not used because of the problems with Windows distribution.

filetype issue: h2non/filetype.py#192.

Possible solutions:

Complicated: add python-magic as an optional dependency and use it as a fallback when filetype returns None and python-magic is available.
Simple: add PDF extension here:

docling/docling/datamodel/document.py

Line 345 in 8543c22

def _mime_from_extension(ext):

and allow the user to force the type as PDF via the file extension, in case the type is already known.

Steps to reproduce

docling WMOOS6EVZZXGYSVQPP7UORUTNEUEBO3Q.pdf

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.12.8

The text was updated successfully, but these errors were encountered:

pavel-denisov-fraunhofer added the bug Something isn't working label Jan 24, 2025

ceberam self-assigned this Jan 27, 2025

PeterStaar-IBM added the mimetype label Jan 28, 2025

ceberam mentioned this issue Jan 28, 2025

fix: use file extension if filetype fails with PDF #827

Merged

3 tasks

dolfim-ibm closed this as completed in #827 Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with PDF type detection #798

Issue with PDF type detection #798

pavel-denisov-fraunhofer commented Jan 24, 2025 •

edited

Loading

Issue with PDF type detection #798

Issue with PDF type detection #798

Comments

pavel-denisov-fraunhofer commented Jan 24, 2025 • edited Loading

Bug

Steps to reproduce

Docling version

Python version

pavel-denisov-fraunhofer commented Jan 24, 2025 •

edited

Loading