-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion from Markdown to JSON fails with no clue #797
Comments
Thanks @dmartinol for sharing this issue. In any case, inspecting the document anatomy1.md, one can clearly see that it is not in markdown format but in HTML and this may confuse the backend converter. For now I would recommend:
Please, let us know if the conversion finished successfully. |
BTW: I don't think it's html as you said, but instead md with an embedded html table. Anyway, same issue happens after upgrading docling and renaming to html: % docling --from html --to json anatomy1.html -vv
DEBUG:docling.backend.html_backend:About to init HTML backend...
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document anatomy1.html
DEBUG:docling.backend.html_backend:Trying to convert HTML...
INFO:docling.document_converter:Finished converting document anatomy1.html in 0.00 sec.
WARNING:docling.cli.main:Document /var/folders/3v/tdbf44sx53d590tl6ys7h2z40000gn/T/tmp9m3y05ag/anatomy1.html failed to convert.
INFO:docling.cli.main:Processed 1 docs, of which 1 failed
INFO:docling.cli.main:All documents were converted in 0.01 seconds. Anyway, after upgrading the docling from % docling --from md --to json anatomy1.md -v
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document anatomy1.md
INFO:docling.document_converter:Finished converting document anatomy1.md in 0.07 sec.
INFO:docling.cli.main:writing JSON output to anatomy1.json
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 0.07 seconds. If I get back to My proposal is not to close the issue because there seems to be an unresolved problem in properly communicating the root cause (e.g.,in the html case which still fails, what is the problem with the file?). WDYT? |
OK, I see now the misunderstanding. On the issue description, you added the link anatomy1.md, which is the GitHub rendered HTML page of the file. Using the raw file I was able to reproduce the issue with version This should work: $ pip install docling
$ wget https://raw.githubusercontent.com/luke-inglis/il-anatomy-knowledge/refs/heads/main/anatomy1.md
$ docling --from md --to json anatomy1.md -vv Please, confirm that this issue can be closed. Remark: docling supports tables in markdown files according to its table extended syntax. Some lightweight markup languages like GitHub Flavored Markdown allow tables as HTML blocks. This is the case of the example file anatomy1.md you brought in this issue. Currently, docling does not parse tables as HTML blocks, but we plan to do it in next releases. |
Sorry for sharing the wrong URL, my fault. |
No worries, the GitHub rendering was misleading 🙂
Absolutely, this is an area we plan to improve |
Bug
Converting a
md
tojson
results in a command failure without a clear indication of the root cause.Steps to reproduce
Docling version
I got the same error also with previous versions, like 2.8.3.
Python version
The text was updated successfully, but these errors were encountered: