Skip to content

Commit

Permalink
feat: Extracting picture data for raster images found in PPTX (#349)
Browse files Browse the repository at this point in the history
* Added picture data for pptx pictures

Signed-off-by: Maksym Lysak <[email protected]>

* Added tests for pptx

Signed-off-by: Maksym Lysak <[email protected]>

* Inferring image DPI from pptx file

Signed-off-by: Maksym Lysak <[email protected]>

---------

Signed-off-by: Maksym Lysak <[email protected]>
Co-authored-by: Maksym Lysak <[email protected]>
  • Loading branch information
maxmnemonic and Maksym Lysak authored Nov 18, 2024
1 parent 7dbdbde commit 7a97d71
Show file tree
Hide file tree
Showing 9 changed files with 2,467 additions and 1 deletion.
17 changes: 16 additions & 1 deletion docling/backend/mspowerpoint_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,13 @@
DoclingDocument,
DocumentOrigin,
GroupLabel,
ImageRef,
ProvenanceItem,
Size,
TableCell,
TableData,
)
from PIL import Image
from pptx import Presentation
from pptx.enum.shapes import MSO_SHAPE_TYPE, PP_PLACEHOLDER

Expand Down Expand Up @@ -268,9 +270,22 @@ def handle_title(self, shape, parent_slide, slide_ind, doc):
return

def handle_pictures(self, shape, parent_slide, slide_ind, doc):
# Get the image bytes
image = shape.image
image_bytes = image.blob
im_dpi, _ = image.dpi

# Open it with PIL
pil_image = Image.open(BytesIO(image_bytes))

# shape has picture
prov = self.generate_prov(shape, slide_ind, "")
doc.add_picture(parent=parent_slide, caption=None, prov=prov)
doc.add_picture(
parent=parent_slide,
image=ImageRef.from_pil(image=pil_image, dpi=im_dpi),
caption=None,
prov=prov,
)
return

def handle_tables(self, shape, parent_slide, slide_ind, doc):
Expand Down
35 changes: 35 additions & 0 deletions tests/data/groundtruth/docling_v2/powerpoint_sample.pptx.itxt
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
item-0 at level 0: unspecified: group _root_
item-1 at level 1: chapter: group slide-0
item-2 at level 2: title: Test Table Slide
item-3 at level 2: paragraph: With footnote
item-4 at level 2: table with [9x7]
item-5 at level 1: chapter: group slide-1
item-6 at level 2: title: Second slide title
item-7 at level 2: paragraph: Let’s introduce a list
item-8 at level 2: paragraph: With foo
item-9 at level 2: paragraph: Bar
item-10 at level 2: paragraph: And baz things
item-11 at level 2: paragraph: A rectangle shape with this text inside.
item-12 at level 1: chapter: group slide-2
item-13 at level 2: ordered_list: group list
item-14 at level 3: list_item: List item4
item-15 at level 3: list_item: List item5
item-16 at level 3: list_item: List item6
item-17 at level 2: list: group list
item-18 at level 3: list_item: I1
item-19 at level 3: list_item: I2
item-20 at level 3: list_item: I3
item-21 at level 3: list_item: I4
item-22 at level 2: paragraph: Some info:
item-23 at level 2: list: group list
item-24 at level 3: list_item: Item A
item-25 at level 3: list_item: Item B
item-26 at level 2: paragraph: Maybe a list?
item-27 at level 2: ordered_list: group list
item-28 at level 3: list_item: List1
item-29 at level 3: list_item: List2
item-30 at level 3: list_item: List3
item-31 at level 2: list: group list
item-32 at level 3: list_item: l1
item-33 at level 3: list_item: l2
item-34 at level 3: list_item: l3
Loading

0 comments on commit 7a97d71

Please sign in to comment.