Skip to content

Commit

Permalink
fix: Upgrade docling-parse to 1.1.1, safety checks for failed parse o…
Browse files Browse the repository at this point in the history
…n pages (#45)

* Put safety-checks for failed parse of pages

Signed-off-by: Christoph Auer <[email protected]>

* Bump to docling-parse 1.1.1

Signed-off-by: Christoph Auer <[email protected]>

---------

Signed-off-by: Christoph Auer <[email protected]>
  • Loading branch information
cau-git authored Aug 23, 2024
1 parent 1930f08 commit 7e84533
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 28 deletions.
11 changes: 10 additions & 1 deletion docling/backend/docling_parse_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,15 @@ def __init__(
self._ppage = page_obj

parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
self._dpage = parsed_page["pages"][0]

self._dpage = None
self.broken_page = "pages" not in parsed_page
if not self.broken_page:
self._dpage = parsed_page["pages"][0]

def get_text_in_rect(self, bbox: BoundingBox) -> str:
if self.broken_page:
return ""
# Find intersecting cells on the page
text_piece = ""
page_size = self.get_size()
Expand Down Expand Up @@ -60,6 +66,9 @@ def get_text_cells(self) -> Iterable[Cell]:
cells = []
cell_counter = 0

if self.broken_page:
return cells

page_size = self.get_size()

parser_width = self._dpage["width"]
Expand Down
54 changes: 28 additions & 26 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ pydantic-settings = "^2.3.0"
huggingface_hub = ">=0.23,<1"
requests = "^2.32.3"
easyocr = "^1.7"
docling-parse = "^1.0.0"
docling-parse = "^1.1.1"
certifi = ">=2024.7.4"
rtree = "^1.3.0"
scipy = "^1.14.1"
Expand Down

0 comments on commit 7e84533

Please sign in to comment.