Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion from Markdown to JSON fails with no clue #797

Closed
dmartinol opened this issue Jan 24, 2025 · 5 comments
Closed

Conversion from Markdown to JSON fails with no clue #797

dmartinol opened this issue Jan 24, 2025 · 5 comments
Assignees
Labels
bug Something isn't working markdown issue related to markdown backend

Comments

@dmartinol
Copy link

Bug

Converting a md to json results in a command failure without a clear indication of the root cause.

Steps to reproduce

docling --from md --to json anatomy1.md -vv
...TRUNCATED...
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - Heading level 2, content: Clinical significance
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - Image with alt: Gross pathology of fresh hypertrophic tonsil. Top left: Surface facing the into the aerodigestive tract. Top right: Opposite surface (cauterized). Bottom: Cut sections., url: Gross_pathology_of_tonsil.jpg
INFO:docling.document_converter:Finished converting document anatomy1.md in 0.17 sec.
WARNING:docling.cli.main:Document /var/folders/3v/tdbf44sx53d590tl6ys7h2z40000gn/T/tmpbcdsxkl6/anatomy1.md failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 0.17 seconds.

Docling version

% docling --version                          
Docling version: 2.15.1
Docling Core version: 2.15.1
Docling IBM Models version: 3.2.1
Docling Parse version: 3.1.1

I got the same error also with previous versions, like 2.8.3.

Python version

% python --version
Python 3.11.9
@dmartinol dmartinol added the bug Something isn't working label Jan 24, 2025
@ceberam ceberam self-assigned this Jan 27, 2025
@ceberam
Copy link
Contributor

ceberam commented Jan 27, 2025

Thanks @dmartinol for sharing this issue.
Unfortunately, I haven't been able to reproduce this issue with the docling configuration and the file reported above.
Could you please share the operation system and its version?

In any case, inspecting the document anatomy1.md, one can clearly see that it is not in markdown format but in HTML and this may confuse the backend converter.

For now I would recommend:

  • Update docling to the latest version
  • Rename the document from anatomy1.md to anatomy1.html
  • Run the conversion with docling --from html --to json anatomy1.html -vv

Please, let us know if the conversion finished successfully.

@dmartinol
Copy link
Author

BTW: I don't think it's html as you said, but instead md with an embedded html table.

Anyway, same issue happens after upgrading docling and renaming to html:

% docling --from html --to json anatomy1.html -vv

DEBUG:docling.backend.html_backend:About to init HTML backend...
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document anatomy1.html
DEBUG:docling.backend.html_backend:Trying to convert HTML...
INFO:docling.document_converter:Finished converting document anatomy1.html in 0.00 sec.
WARNING:docling.cli.main:Document /var/folders/3v/tdbf44sx53d590tl6ys7h2z40000gn/T/tmp9m3y05ag/anatomy1.html failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 0.01 seconds.

Anyway, after upgrading the docling from 2.15.1 to 2.16, the original issue seems fixed:

% docling --from md --to json anatomy1.md -v
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document anatomy1.md
INFO:docling.document_converter:Finished converting document anatomy1.md in 0.07 sec.
INFO:docling.cli.main:writing JSON output to anatomy1.json
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 0.07 seconds.

If I get back to 2.15.1 it fails as described before.

My proposal is not to close the issue because there seems to be an unresolved problem in properly communicating the root cause (e.g.,in the html case which still fails, what is the problem with the file?). WDYT?

@PeterStaar-IBM PeterStaar-IBM added the markdown issue related to markdown backend label Jan 28, 2025
@ceberam
Copy link
Contributor

ceberam commented Jan 28, 2025

OK, I see now the misunderstanding. On the issue description, you added the link anatomy1.md, which is the GitHub rendered HTML page of the file.
I guess you were referring to the raw file, which indeed has the markdown format.

Using the raw file I was able to reproduce the issue with version 2.15.1. The Docling version 2.15.1 fails with this file because of a bug with image elements, which was fixed in version 2.16.0, precisely in commit d5b2c07.

This should work:

$ pip install docling
$ wget https://raw.githubusercontent.com/luke-inglis/il-anatomy-knowledge/refs/heads/main/anatomy1.md
$ docling --from md --to json anatomy1.md -vv

Please, confirm that this issue can be closed.

Remark: docling supports tables in markdown files according to its table extended syntax. Some lightweight markup languages like GitHub Flavored Markdown allow tables as HTML blocks. This is the case of the example file anatomy1.md you brought in this issue. Currently, docling does not parse tables as HTML blocks, but we plan to do it in next releases.

@dmartinol
Copy link
Author

Sorry for sharing the wrong URL, my fault.
I understand your point now and if you want we can close the issue as I also confirmed it's fixed in the latest docling version.
A minor issue remains, because the error message was really unclear: don't you think we should also try to improve the diagnostics here?

@ceberam
Copy link
Contributor

ceberam commented Jan 28, 2025

Sorry for sharing the wrong URL, my fault.

No worries, the GitHub rendering was misleading 🙂

A minor issue remains, because the error message was really unclear: don't you think we should also try to improve the diagnostics here?

Absolutely, this is an area we plan to improve

@ceberam ceberam closed this as completed Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working markdown issue related to markdown backend
Projects
None yet
Development

No branches or pull requests

3 participants