Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error parsing html file where there's no <body> tag #810

Closed
kkew3 opened this issue Jan 26, 2025 · 2 comments · Fixed by #818
Closed

Error parsing html file where there's no <body> tag #810

kkew3 opened this issue Jan 26, 2025 · 2 comments · Fixed by #818
Assignees
Labels
bug Something isn't working html issue related to html backend

Comments

@kkew3
Copy link

kkew3 commented Jan 26, 2025

Bug

When parsing this html file:

syllabus.html.txt

docling raises:

AttributeError: 'NoneType' object has no attribute 'find_all'

But the html file can be rendered correctly on Chrome.

Cause of the bug:

for br in self.soup.body.find_all("br"):

The code assumes there's a tag called body but there isn't.

Steps to reproduce

Following the demo:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("syllabus.html")
print(result.document.export_to_markdown())

Docling version

Docling version: 2.15.1
Docling Core version: 2.15.1
Docling IBM Models version: 3.2.1
Docling Parse version: 3.1.1

Python version

Python 3.11.5
@kkew3 kkew3 added the bug Something isn't working label Jan 26, 2025
@ceberam ceberam self-assigned this Jan 27, 2025
@ceberam ceberam added the html issue related to html backend label Jan 27, 2025
@ceberam
Copy link
Contributor

ceberam commented Jan 27, 2025

Thanks @kkew3 for reporting this issue.
While it is preferred to use perfectly-formed HTML documents, we understand that some tags are optional according to HTML5 specifications, including body.
We will therefore fix this issue.

@kkew3
Copy link
Author

kkew3 commented Jan 27, 2025

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working html issue related to html backend
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants