Skip to content

Commit

Permalink
merged with main
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Staar <[email protected]>
  • Loading branch information
PeterStaar-IBM committed Jan 24, 2025
2 parents 8a4d59f + e9768ae commit 745615c
Show file tree
Hide file tree
Showing 65 changed files with 3,808 additions and 1,867 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Install tesseract
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa libleptonica-dev libtesseract-dev pkg-config
run: sudo apt-get update && sudo apt-get install -y tesseract-ocr tesseract-ocr-eng tesseract-ocr-fra tesseract-ocr-deu tesseract-ocr-spa tesseract-ocr-script-latn libleptonica-dev libtesseract-dev pkg-config
- name: Set TESSDATA_PREFIX
run: |
echo "TESSDATA_PREFIX=$(dpkg -L tesseract-ocr-eng | grep tessdata$)" >> "$GITHUB_ENV"
Expand Down
7 changes: 5 additions & 2 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,10 @@ jobs:
- uses: ./.github/actions/setup-poetry
- name: Build docs
run: poetry run mkdocs build --verbose --clean
- name: Make docs LLM ready
if: inputs.deploy
uses: demodrive-ai/llms-txt-action@ad720693843126e6a73910a667d0eba37c1dea4b
- name: Build and push docs
if: inputs.deploy
run: poetry run mkdocs gh-deploy --force

run: poetry run mkdocs gh-deploy --force --dirty

33 changes: 33 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,36 @@
## [v2.15.1](https://github.com/DS4SD/docling/releases/tag/v2.15.1) - 2025-01-10

### Fix

* Improve OCR results, stricten criteria before dropping bitmap areas ([#719](https://github.com/DS4SD/docling/issues/719)) ([`5a060f2`](https://github.com/DS4SD/docling/commit/5a060f237d1decd0ff9db9e73478978419315778))
* Allow earlier requests versions ([#716](https://github.com/DS4SD/docling/issues/716)) ([`e64b5a2`](https://github.com/DS4SD/docling/commit/e64b5a2f628acc340a6d94ee6f1ada2aa267cecc))

### Documentation

* Add pointers to LangChain-side docs ([#718](https://github.com/DS4SD/docling/issues/718)) ([`9a6b5c8`](https://github.com/DS4SD/docling/commit/9a6b5c8c8debc81e0ddcbe91df6afbbeb29e97e6))
* Add LangChain docs ([#717](https://github.com/DS4SD/docling/issues/717)) ([`4fa8028`](https://github.com/DS4SD/docling/commit/4fa8028bd8120d7557e1d45ba31e200e130af698))

## [v2.15.0](https://github.com/DS4SD/docling/releases/tag/v2.15.0) - 2025-01-08

### Feature

* Added http header support for document converter and cli ([#642](https://github.com/DS4SD/docling/issues/642)) ([`0ee849e`](https://github.com/DS4SD/docling/commit/0ee849e8bc8cf24d1c5597af3fe20a7fa19a29e0))

### Fix

* Correct scaling of debug visualizations, tune OCR ([#700](https://github.com/DS4SD/docling/issues/700)) ([`5cb4cf6`](https://github.com/DS4SD/docling/commit/5cb4cf6f19f91e6c87141e93400c4b54b93aa5d7))
* Let BeautifulSoup detect the HTML encoding ([#695](https://github.com/DS4SD/docling/issues/695)) ([`42856fd`](https://github.com/DS4SD/docling/commit/42856fdf79559188ec4617bc5d3a007286f114d2))
* **mspowerpoint:** Handle invalid images in PowerPoint slides ([#650](https://github.com/DS4SD/docling/issues/650)) ([`d49650c`](https://github.com/DS4SD/docling/commit/d49650c54ffa60bc6d6106970e104071689bc7b0))

### Documentation

* Specify docstring types ([#702](https://github.com/DS4SD/docling/issues/702)) ([`ead396a`](https://github.com/DS4SD/docling/commit/ead396ab407f6bbd43176abd6ed2bed7ed8c7c43))
* Add link to rag with granite ([#698](https://github.com/DS4SD/docling/issues/698)) ([`6701f34`](https://github.com/DS4SD/docling/commit/6701f34c855992c52918b210c65a2edb1c827c01))
* Add integrations, revamp docs ([#693](https://github.com/DS4SD/docling/issues/693)) ([`2d24fae`](https://github.com/DS4SD/docling/commit/2d24faecd96bfa656b2b8c80f25cdf251a50526a))
* Add OpenContracts as an integration ([#679](https://github.com/DS4SD/docling/issues/679)) ([`569038d`](https://github.com/DS4SD/docling/commit/569038df4205703f87517ea58da7902d143e7699))
* Add Weaviate RAG recipe notebook ([#451](https://github.com/DS4SD/docling/issues/451)) ([`2b591f9`](https://github.com/DS4SD/docling/commit/2b591f98726ed0d883236dd0550201b95203eebb))
* Document Haystack & Vectara support ([#628](https://github.com/DS4SD/docling/issues/628)) ([`fc645ea`](https://github.com/DS4SD/docling/commit/fc645ea531ddc67959640b428007851d641c923e))

## [v2.14.0](https://github.com/DS4SD/docling/releases/tag/v2.14.0) - 2024-12-18

### Feature
Expand Down
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Docling parses documents and exports them to the desired format with ease and sp
* 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
* 📑 Advanced PDF document understanding including page layout, reading order & table structures
* 🧩 Unified, expressive [DoclingDocument](https://ds4sd.github.io/docling/concepts/docling_document/) representation format
* 🤖 Easy integration with 🦙 LlamaIndex & 🦜🔗 LangChain for powerful RAG / QA applications
* 🤖 Plug-and-play [integrations](https://ds4sd.github.io/docling/integrations/) incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
* 🔍 OCR support for scanned PDFs
* 💻 Simple and convenient CLI

Expand All @@ -39,7 +39,6 @@ Explore the [documentation](https://ds4sd.github.io/docling/) to discover plenty

* ♾️ Equation & code extraction
* 📝 Metadata extraction, including title, authors, references & language
* 🦜🔗 Native LangChain extension

## Installation

Expand Down
1 change: 0 additions & 1 deletion docling/backend/asciidoc_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@


class AsciiDocBackend(DeclarativeDocumentBackend):

def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
super().__init__(in_doc, path_or_stream)

Expand Down
4 changes: 2 additions & 2 deletions docling/backend/docling_parse_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ def draw_clusters_and_cells():
return cells

def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
AREA_THRESHOLD = 32 * 32
AREA_THRESHOLD = 0 # 32 * 32

for i in range(len(self._dpage["images"])):
bitmap = self._dpage["images"][i]
Expand Down Expand Up @@ -163,7 +163,7 @@ def get_page_image(
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
)
else:
padbox = cropbox.to_bottom_left_origin(page_size.height)
padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
padbox.r = page_size.width - padbox.r
padbox.t = page_size.height - padbox.t

Expand Down
4 changes: 2 additions & 2 deletions docling/backend/docling_parse_v2_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ def draw_clusters_and_cells():
return cells

def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
AREA_THRESHOLD = 32 * 32
AREA_THRESHOLD = 0 # 32 * 32

images = self._dpage["sanitized"]["images"]["data"]
images_header = self._dpage["sanitized"]["images"]["header"]
Expand Down Expand Up @@ -178,7 +178,7 @@ def get_page_image(
l=0, r=0, t=0, b=0, coord_origin=CoordOrigin.BOTTOMLEFT
)
else:
padbox = cropbox.to_bottom_left_origin(page_size.height)
padbox = cropbox.to_bottom_left_origin(page_size.height).model_copy()
padbox.r = page_size.width - padbox.r
padbox.t = page_size.height - padbox.t

Expand Down
6 changes: 3 additions & 3 deletions docling/backend/html_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,10 @@ def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]

try:
if isinstance(self.path_or_stream, BytesIO):
text_stream = self.path_or_stream.getvalue().decode("utf-8")
text_stream = self.path_or_stream.getvalue()
self.soup = BeautifulSoup(text_stream, "html.parser")
if isinstance(self.path_or_stream, Path):
with open(self.path_or_stream, "r", encoding="utf-8") as f:
with open(self.path_or_stream, "rb") as f:
html_content = f.read()
self.soup = BeautifulSoup(html_content, "html.parser")
except Exception as e:
Expand Down Expand Up @@ -215,7 +215,7 @@ def handle_code(self, element, idx, doc):
label = DocItemLabel.CODE
if len(text) == 0:
return
doc.add_text(parent=self.parents[self.level], label=label, text=text)
doc.add_code(parent=self.parents[self.level], label=label, text=text)

def handle_paragraph(self, element, idx, doc):
"""Handles paragraph tags (p)."""
Expand Down
71 changes: 44 additions & 27 deletions docling/backend/md_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,22 @@
import warnings
from io import BytesIO
from pathlib import Path
from typing import Set, Union
from typing import List, Optional, Set, Union

import marko
import marko.ext
import marko.ext.gfm
import marko.inline
from docling_core.types.doc import (
DocItem,
DocItemLabel,
DoclingDocument,
DocumentOrigin,
GroupLabel,
NodeItem,
TableCell,
TableData,
TextItem,
)
from marko import Markdown

Expand All @@ -27,8 +30,7 @@


class MarkdownDocumentBackend(DeclarativeDocumentBackend):

def shorten_underscore_sequences(self, markdown_text, max_length=10):
def shorten_underscore_sequences(self, markdown_text: str, max_length: int = 10):
# This regex will match any sequence of underscores
pattern = r"_+"

Expand Down Expand Up @@ -90,13 +92,13 @@ def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]
) from e
return

def close_table(self, doc=None):
def close_table(self, doc: DoclingDocument):
if self.in_table:
_log.debug("=== TABLE START ===")
for md_table_row in self.md_table_buffer:
_log.debug(md_table_row)
_log.debug("=== TABLE END ===")
tcells = []
tcells: List[TableCell] = []
result_table = []
for n, md_table_row in enumerate(self.md_table_buffer):
data = []
Expand Down Expand Up @@ -137,15 +139,19 @@ def close_table(self, doc=None):
self.in_table = False
self.md_table_buffer = [] # clean table markdown buffer
# Initialize Docling TableData
data = TableData(num_rows=num_rows, num_cols=num_cols, table_cells=tcells)
table_data = TableData(
num_rows=num_rows, num_cols=num_cols, table_cells=tcells
)
# Populate
for tcell in tcells:
data.table_cells.append(tcell)
table_data.table_cells.append(tcell)
if len(tcells) > 0:
doc.add_table(data=data)
doc.add_table(data=table_data)
return

def process_inline_text(self, parent_element, doc=None):
def process_inline_text(
self, parent_element: Optional[NodeItem], doc: DoclingDocument
):
# self.inline_text_buffer += str(text_in)
txt = self.inline_text_buffer.strip()
if len(txt) > 0:
Expand All @@ -156,14 +162,20 @@ def process_inline_text(self, parent_element, doc=None):
)
self.inline_text_buffer = ""

def iterate_elements(self, element, depth=0, doc=None, parent_element=None):
def iterate_elements(
self,
element: marko.block.Element,
depth: int,
doc: DoclingDocument,
parent_element: Optional[NodeItem] = None,
):
# Iterates over all elements in the AST
# Check for different element types and process relevant details
if isinstance(element, marko.block.Heading):
self.close_table(doc)
self.process_inline_text(parent_element, doc)
_log.debug(
f" - Heading level {element.level}, content: {element.children[0].children}"
f" - Heading level {element.level}, content: {element.children[0].children}" # type: ignore
)
if element.level == 1:
doc_label = DocItemLabel.TITLE
Expand All @@ -172,10 +184,10 @@ def iterate_elements(self, element, depth=0, doc=None, parent_element=None):

# Header could have arbitrary inclusion of bold, italic or emphasis,
# hence we need to traverse the tree to get full text of a header
strings = []
strings: List[str] = []

# Define a recursive function to traverse the tree
def traverse(node):
def traverse(node: marko.block.BlockElement):
# Check if the node has a "children" attribute
if hasattr(node, "children"):
# If "children" is a list, continue traversal
Expand Down Expand Up @@ -209,9 +221,13 @@ def traverse(node):
self.process_inline_text(parent_element, doc)
_log.debug(" - List item")

snippet_text = str(element.children[0].children[0].children)
snippet_text = str(element.children[0].children[0].children) # type: ignore
is_numbered = False
if parent_element.label == GroupLabel.ORDERED_LIST:
if (
parent_element is not None
and isinstance(parent_element, DocItem)
and parent_element.label == GroupLabel.ORDERED_LIST
):
is_numbered = True
doc.add_list_item(
enumerated=is_numbered, parent=parent_element, text=snippet_text
Expand All @@ -221,7 +237,14 @@ def traverse(node):
self.close_table(doc)
self.process_inline_text(parent_element, doc)
_log.debug(f" - Image with alt: {element.title}, url: {element.dest}")
doc.add_picture(parent=parent_element, caption=element.title)

fig_caption: Optional[TextItem] = None
if element.title is not None and element.title != "":
fig_caption = doc.add_text(
label=DocItemLabel.CAPTION, text=element.title
)

doc.add_picture(parent=parent_element, caption=fig_caption)

elif isinstance(element, marko.block.Paragraph):
self.process_inline_text(parent_element, doc)
Expand Down Expand Up @@ -252,27 +275,21 @@ def traverse(node):
self.process_inline_text(parent_element, doc)
_log.debug(f" - Code Span: {element.children}")
snippet_text = str(element.children).strip()
doc.add_text(
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
)
doc.add_code(parent=parent_element, text=snippet_text)

elif isinstance(element, marko.block.CodeBlock):
self.close_table(doc)
self.process_inline_text(parent_element, doc)
_log.debug(f" - Code Block: {element.children}")
snippet_text = str(element.children[0].children).strip()
doc.add_text(
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
)
snippet_text = str(element.children[0].children).strip() # type: ignore
doc.add_code(parent=parent_element, text=snippet_text)

elif isinstance(element, marko.block.FencedCode):
self.close_table(doc)
self.process_inline_text(parent_element, doc)
_log.debug(f" - Code Block: {element.children}")
snippet_text = str(element.children[0].children).strip()
doc.add_text(
label=DocItemLabel.CODE, parent=parent_element, text=snippet_text
)
snippet_text = str(element.children[0].children).strip() # type: ignore
doc.add_code(parent=parent_element, text=snippet_text)

elif isinstance(element, marko.inline.LineBreak):
self.process_inline_text(parent_element, doc)
Expand Down
1 change: 0 additions & 1 deletion docling/backend/msexcel_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ class ExcelTable(BaseModel):


class MsExcelDocumentBackend(DeclarativeDocumentBackend):

def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
super().__init__(in_doc, path_or_stream)

Expand Down
26 changes: 15 additions & 11 deletions docling/backend/mspowerpoint_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
TableCell,
TableData,
)
from PIL import Image
from PIL import Image, UnidentifiedImageError
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE, PP_PLACEHOLDER

Expand Down Expand Up @@ -120,6 +120,7 @@ def handle_text_elements(self, shape, parent_slide, slide_ind, doc):
bullet_type = "None"
list_text = ""
list_label = GroupLabel.LIST
doc_label = DocItemLabel.LIST_ITEM
prov = self.generate_prov(shape, slide_ind, shape.text.strip())

# Identify if shape contains lists
Expand Down Expand Up @@ -276,16 +277,19 @@ def handle_pictures(self, shape, parent_slide, slide_ind, doc):
im_dpi, _ = image.dpi

# Open it with PIL
pil_image = Image.open(BytesIO(image_bytes))

# shape has picture
prov = self.generate_prov(shape, slide_ind, "")
doc.add_picture(
parent=parent_slide,
image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
caption=None,
prov=prov,
)
try:
pil_image = Image.open(BytesIO(image_bytes))

# shape has picture
prov = self.generate_prov(shape, slide_ind, "")
doc.add_picture(
parent=parent_slide,
image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
caption=None,
prov=prov,
)
except (UnidentifiedImageError, OSError) as e:
_log.warning(f"Warning: image cannot be loaded by Pillow: {e}")
return

def handle_tables(self, shape, parent_slide, slide_ind, doc):
Expand Down
1 change: 0 additions & 1 deletion docling/backend/msword_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@


class MsWordDocumentBackend(DeclarativeDocumentBackend):

def __init__(self, in_doc: "InputDocument", path_or_stream: Union[BytesIO, Path]):
super().__init__(in_doc, path_or_stream)
self.XML_KEY = (
Expand Down
2 changes: 0 additions & 2 deletions docling/backend/pdf_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@


class PdfPageBackend(ABC):

@abstractmethod
def get_text_in_rect(self, bbox: BoundingBox) -> str:
pass
Expand Down Expand Up @@ -45,7 +44,6 @@ def unload(self):


class PdfDocumentBackend(PaginatedDocumentBackend):

def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
super().__init__(in_doc, path_or_stream)

Expand Down
Loading

0 comments on commit 745615c

Please sign in to comment.