Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML-like tags in PDFs should be escaped in Markdown (and HTML!) output #764

Closed
dhdaines opened this issue Jan 16, 2025 · 0 comments · Fixed by DS4SD/docling-core#143
Closed
Assignees
Labels
bug Something isn't working docling-document

Comments

@dhdaines
Copy link

In the case where the text of a PDF contains <things> like <this>, these get passed through unescaped in Markdown ... and also in HTML! In the case where they are actual HTML tags, well, you get the actual HTML tags, which might not be what you want. If not, well, you get... something.

This can also cause weird issues in some corner cases like the one in the attached document where <snip> (not an HTML tag) gets split across a line break (here it's kind of contrived but I have a real document that does this) and thus becomes <s nip>, causing the rest of the document to be in strikethrough.

testpdf.pdf

To reproduce, run:

docling testpdf.pdf
docling --to html testpdf.html
open testpdf.html

You will see:

Image

I would expect the tags to come through as they do in the original document since it was not HTML... and of course no strikethough :)

Docling version

Docling version: 2.15.1
Docling Core version: 2.14.0
Docling IBM Models version: 3.1.2
Docling Parse version: 3.0.0

Python version

3.10.12

@dhdaines dhdaines added the bug Something isn't working label Jan 16, 2025
@dolfim-ibm dolfim-ibm self-assigned this Jan 30, 2025
@vagenas vagenas added enhancement New feature or request and removed enhancement New feature or request labels Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docling-document
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants