Skip to content

Commit

Permalink
fix: parse HTML files without body tag
Browse files Browse the repository at this point in the history
Parse HTML files without 'body' tag, since it is optional in HTML5 specification.

Signed-off-by: Cesar Berrospi Ramis <[email protected]>
  • Loading branch information
ceberam committed Jan 27, 2025
1 parent 5332755 commit baf622f
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions docling/backend/html_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,11 @@ def convert(self) -> DoclingDocument:

if self.is_valid():
assert self.soup is not None
content = self.soup.body or self.soup
# Replace <br> tags with newline characters
for br in self.soup.body.find_all("br"):
for br in content.find_all("br"):
br.replace_with("\n")
doc = self.walk(self.soup.body, doc)
doc = self.walk(content, doc)
else:
raise RuntimeError(
f"Cannot convert doc with {self.document_hash} because the backend failed to init."
Expand Down

0 comments on commit baf622f

Please sign in to comment.