You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some PDFs (see attached samples), the BaseOcrModel.get_ocr_rects() method gets wrong bounding boxes. Whel I look at the bounding box, it looks like a vertical strip crossing the text in the middle of the line (see attached screenshot from EasyOcrMode.__call__() high_res_image.show() ).
As a result (on exo_pg1.pdf), the markdown is a pile of garbage. On the other hand, when I created very similar slide in google doc and exported to pdf (see exo_synth.pdf) the document is parsed normally.
...
Bug
For some PDFs (see attached samples), the
BaseOcrModel.get_ocr_rects()
method gets wrong bounding boxes. Whel I look at the bounding box, it looks like a vertical strip crossing the text in the middle of the line (see attached screenshot fromEasyOcrMode.__call__()
high_res_image.show()
).As a result (on exo_pg1.pdf), the markdown is a pile of garbage. On the other hand, when I created very similar slide in google doc and exported to pdf (see exo_synth.pdf) the document is parsed normally.
...
exo_pg1.pdf
exo_synth.pdf
Steps to reproduce
Docling version
docling==2.15.1
docling-core==2.15.1
docling-ibm-models==3.2.1
docling-parse==3.1.1
Python version
Python 3.12.8
The text was updated successfully, but these errors were encountered: