Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numbered headings in Word documents appear as list items #612

Open
mattmalcher opened this issue Dec 16, 2024 · 5 comments
Open

Numbered headings in Word documents appear as list items #612

mattmalcher opened this issue Dec 16, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mattmalcher
Copy link

mattmalcher commented Dec 16, 2024

First off, thank you for docling! <3

A standard representation, maintaining context and hierarchy, for content across multiple formats, with an MIT licence is just super! Fan of features like the hybrid text chunker.

Bug

Lots of long technical documents use multilevel lists in word to have numbered sections.

These documents sometimes also include numbered paragraphs.

At the moment, in the word backend, docling checks to see if an item is a list item and handles that case separately, before checking to see if it is a heading.

see:

if numid is not None and ilevel is not None:
self.add_listitem(
element,
docx_obj,
doc,
p_style_id,
p_level,
numid,
ilevel,
text,
is_numbered,
)
self.update_history(p_style_id, p_level, numid, ilevel)
return
elif numid is None and self.prev_numid() is not None: # Close list
for key, val in self.parents.items():
if key >= self.level_at_new_list:
self.parents[key] = None
self.level = self.level_at_new_list - 1
self.level_at_new_list = None
if p_style_id in ["Title"]:
for key, val in self.parents.items():
self.parents[key] = None
self.parents[0] = doc.add_text(
parent=None, label=DocItemLabel.TITLE, text=text
)
elif "Heading" in p_style_id:
self.add_header(element, docx_obj, doc, p_style_id, p_level, text)
elif p_style_id in [
"Paragraph",
"Normal",
"Subtitle",
"Author",
"DefaultText",
"ListParagraph",
"ListBullet",
"Quote",
]:
level = self.get_level()
doc.add_text(
label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text
)
else:
# Text style names can, and will have, not only default values but user values too
# hence we treat all other labels as pure text
level = self.get_level()
doc.add_text(
label=DocItemLabel.PARAGRAPH, parent=self.parents[level - 1], text=text
)
self.update_history(p_style_id, p_level, numid, ilevel)
return

So paras/tags which are both a list item and a heading just get treated as a list item. It would probably be more useful to treat them as a heading, and convert the list index into plaintext.

I have had a go at adding a failing unit test, by adding a modified copy of unit_test_headers.docx and the expected ground truths for this case in a fork here: a544360

Have also attached the same example to this issue: unit_test_headers_numbered.docx

Current output:

# Test Document

- Section 1

Paragraph 1.1

Paragraph 1.2

Expected output:

# Test Document
## 1. Section 1

Paragraph 1.1

Paragraph 1.2

Steps to reproduce

Parse a word document with numbered headings like: unit_test_headers_numbered.docx

Docling version

Docling version: 2.12.0
Docling Core version: 2.9.0
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

Python 3.12.3

@mattmalcher mattmalcher added the bug Something isn't working label Dec 16, 2024
@mattmalcher
Copy link
Author

I think there is also a related issue where sometimes the first item of a list that is within a numbered heading section will go missing.

If useful I can create a failing test for that too?

@cau-git
Copy link
Contributor

cau-git commented Dec 18, 2024

@mattmalcher If you can provide us with failing tests that would be very helpful for checking, thanks.

@mattmalcher
Copy link
Author

mattmalcher commented Dec 19, 2024

I have added two failing tests, with ground truths in a branch in a fork here: https://github.com/mattmalcher/docling/tree/issue_612_docx_numbered_headings

For the issue with text going missing where numbered headings are involved:

Original Document
image

Expected (Markdown)
image

Actual (Markdown)
Note that heading 1.2 here has gone altogether!
image

@asvintheguy
Copy link

I'm also running into this problem. It seems like Docling is not directly extracting the header data from word.

Original Document
Image

Expected (doctags)

<section_header>1. Introduction</section_header>
<section_header>1.1 A Regulated Environment by Ensuring Good Laboratory Practices (GLP) Studies in Toxicology, Pathology, and Drug Development Against a Backdrop of Heterogeneous Technologies.</section_header>
<paragraph>Digital Toxicologic Histopathology has become a crucial aspect of the process used to establish drug safety [13], offering advanced methods to evaluate potential drug-induced toxicity, and enhancing patient safety. Deep Learning (DL) as part of Artificial intelligence (AI) applied to digital pathology is causing a revolution in the field, enabling automated analysis (AI/DL) and improved diagnostic accuracy, thus improving patient safety.  Moreover, Microsoft usurped a common term from aeronautics, a Co-Pilot, which is really an autonomous agent that performs background work to simplify tasks and improve insights when applied to information.  In Digital Histopathology we are both using AI/DL to score slides to match what a pathologist would do, but also to leverage Co-Pilot functions to guide them to slides of interest where abnormalities might form.  Here are some of the many possible uses of AI/DL in Toxicologic Pathology:</paragraph>
<paragraph>AI in Toxicologic Pathology: AI/DL has significantly contributed to advancing the implementation of toxicological pathology, which focuses on evaluating drug safety. These technologies have shown promise in automating toxicological assessments, potentially leading to more personalized medicine approaches [1]. While the focus of this paper is on toxicologic pathology in drug safety and development, the system and novel approaches are being designed to span broader sets of use cases.</paragraph>

Actual

<list_item>Introduction</list_item>
<paragraph></paragraph>
<list_item>A Regulated Environment by Ensuring Good Laboratory Practices (GLP) Studies in Toxicology, Pathology, and Drug Development Against a Backdrop of Heterogeneous Technologies.</list_item>
<paragraph>Digital Toxicologic Histopathology has become a crucial aspect of the process used to establish drug safety [13], offering advanced methods to evaluate potential drug-induced toxicity, and enhancing patient safety. Deep Learning (DL) as part of Artificial intelligence (AI) applied to digital pathology is causing a revolution in the field, enabling automated analysis (AI/DL) and improved diagnostic accuracy, thus improving patient safety.  Moreover, Microsoft usurped a common term from aeronautics, a Co-Pilot, which is really an autonomous agent that performs background work to simplify tasks and improve insights when applied to information.  In Digital Histopathology we are both using AI/DL to score slides to match what a pathologist would do, but also to leverage Co-Pilot functions to guide them to slides of interest where abnormalities might form.  Here are some of the many possible uses of AI/DL in Toxicologic Pathology:</paragraph>
<paragraph>AI in Toxicologic Pathology: AI/DL has significantly contributed to advancing the implementation of toxicological pathology, which focuses on evaluating drug safety. These technologies have shown promise in automating toxicological assessments, potentially leading to more personalized medicine approaches [1]. While the focus of this paper is on toxicologic pathology in drug safety and development, the system and novel approaches are being designed to span broader sets of use cases.</paragraph>

@MiguelAngelTorres
Copy link

MiguelAngelTorres commented Jan 27, 2025

#795 Is probably the same issue

Edit

It is the same issue. As explained in the other issue, the debug shows that the label parsing produces a whitespace that breaks the logic. A possible solution is explained in the other issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants