feat: add page chunking #337

vagenas · 2025-06-20T14:59:12Z

No description provided.

Signed-off-by: Panos Vagenas <[email protected]>

github-actions · 2025-06-20T14:59:22Z

✅ DCO Check Passed

Thanks @vagenas, all your commits are properly signed off. 🎉

mergify · 2025-06-20T14:59:46Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

codecov · 2025-06-20T15:00:48Z

Codecov Report

Attention: Patch coverage is 87.50000% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling_core/transforms/chunker/page_chunker.py	81.81%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

PeterStaar-IBM · 2025-07-02T04:37:33Z

docling_core/transforms/chunker/page_chunker.py

+                )
+        else:
+            # if no pages, treat whole document as single chunk
+            ser_res = my_doc_ser.serialize()


I think that here we need to have a parameter that sets the max size of the chunk (measured in chars or string length), otherwise we might get into trouble due to a few poisonous documents.

@PeterStaar-IBM, as the premise is not just to chunk the text, but also to provide the doc items contributing to each chunk, adding such a limit would be somewhat more involved.

👉 I therefore propose introducing this capability in its simple form for now to address strictly page-based use cases; a max_chars mechanism could be then be included in the future if deemed necessary.

I agree with the current proposal, and with the intent of enhancing it in the future.

In general, I would be in favour of a maximum size, but we would propose it as a solution which doesn't split doc items, simply add until the limit is reached.

@vagenas , I tried this locally. However I dont see the images part of the metadata. Wouldnt it be good to have?

@nikhildigde consistently with our other chunkers, this implementation provides the chunk objects, which:

contain the text from the respective items

provide the contextualized version thereof, i.e. including the respective section headers (see docs), and

contain references to the respective items, which can be used for getting all relevant metadata

You can resolve things like images from point 3, e.g. see our visual grounding example.

Yes thats what I plan to do, but would be convenient if the references contain also the "picture" item refs of the metadata. And not only picture, everything maybe.

nikhildigde · 2025-07-15T04:32:52Z

can this be merged please? A much required feature :) . Thanks @vagenas for the PR!!!

dolfim-ibm

lgtm

vagenas · 2025-07-15T08:18:34Z

@PeterStaar-IBM please consider providing an updated review based on my comment further above, otherwise this PR appears blocked by the previous review.

Internally discussed with maintainers.

nikhildigde · 2025-07-16T05:07:47Z

Any clue on when the next release is planned? Thank you!

feat: add page chunking

a1a133a

Signed-off-by: Panos Vagenas <[email protected]>

vagenas marked this pull request as ready for review July 1, 2025 11:32

vagenas requested review from cau-git and PeterStaar-IBM July 2, 2025 04:20

PeterStaar-IBM previously requested changes Jul 2, 2025

View reviewed changes

PeterStaar-IBM assigned vagenas Jul 9, 2025

PeterStaar-IBM requested a review from dolfim-ibm July 15, 2025 05:46

dolfim-ibm approved these changes Jul 15, 2025

View reviewed changes

vagenas requested a review from PeterStaar-IBM July 15, 2025 08:14

cau-git approved these changes Jul 15, 2025

View reviewed changes

vagenas merged commit 3a0b747 into main Jul 15, 2025
12 checks passed

vagenas deleted the add-page-chunking branch July 15, 2025 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add page chunking #337

feat: add page chunking #337

Uh oh!

vagenas commented Jun 20, 2025

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

mergify bot commented Jun 20, 2025 •

edited

Loading

Uh oh!

codecov bot commented Jun 20, 2025

Uh oh!

PeterStaar-IBM Jul 2, 2025

Uh oh!

vagenas Jul 15, 2025

Uh oh!

dolfim-ibm Jul 15, 2025

Uh oh!

nikhildigde Jul 15, 2025

Uh oh!

vagenas Jul 15, 2025

Uh oh!

nikhildigde Jul 15, 2025

Uh oh!

nikhildigde commented Jul 15, 2025 •

edited

Loading

Uh oh!

dolfim-ibm left a comment

Uh oh!

vagenas commented Jul 15, 2025

Uh oh!

Uh oh!

nikhildigde commented Jul 16, 2025

Uh oh!

Uh oh!

feat: add page chunking #337

feat: add page chunking #337

Uh oh!

Conversation

vagenas commented Jun 20, 2025

Uh oh!

github-actions bot commented Jun 20, 2025

Uh oh!

mergify bot commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🟢 Enforce conventional commit

🟢 Require two reviewer for test updates

Uh oh!

codecov bot commented Jun 20, 2025

Codecov Report

Uh oh!

PeterStaar-IBM Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

vagenas Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

nikhildigde Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

vagenas Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

nikhildigde Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

nikhildigde commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

vagenas commented Jul 15, 2025

Uh oh!

Uh oh!

nikhildigde commented Jul 16, 2025

Uh oh!

Uh oh!

mergify bot commented Jun 20, 2025 •

edited

Loading

nikhildigde commented Jul 15, 2025 •

edited

Loading