-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numbered headings in Word documents appear as list items #612
Comments
I think there is also a related issue where sometimes the first item of a list that is within a numbered heading section will go missing. If useful I can create a failing test for that too? |
@mattmalcher If you can provide us with failing tests that would be very helpful for checking, thanks. |
I have added two failing tests, with ground truths in a branch in a fork here: https://github.com/mattmalcher/docling/tree/issue_612_docx_numbered_headings For the issue with text going missing where numbered headings are involved: Actual (Markdown) |
I'm also running into this problem. It seems like Docling is not directly extracting the header data from word. Expected (doctags)
Actual
|
First off, thank you for docling! <3
A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.
Bug
Lots of long technical documents use multilevel lists in word to have numbered sections.
These documents sometimes also include numbered paragraphs.
At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.
see:
docling/docling/backend/msword_backend.py
Lines 244 to 297 in 3bb3bf5
So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.
I have had a go at adding a failing unit test, by adding a modified copy of
unit_test_headers.docx
and the expected ground truths for this case in a fork here: a544360Have also attached the same example to this issue: unit_test_headers_numbered.docx
Current output:
Expected output:
Steps to reproduce
Parse a word document with numbered headings like: unit_test_headers_numbered.docx
Docling version
Python version
Python 3.12.3
The text was updated successfully, but these errors were encountered: