Skip to content

feat: add page chunking #337

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 15, 2025
Merged

feat: add page chunking #337

merged 1 commit into from
Jul 15, 2025

Conversation

vagenas
Copy link
Collaborator

@vagenas vagenas commented Jun 20, 2025

No description provided.

Signed-off-by: Panos Vagenas <[email protected]>
Copy link
Contributor

DCO Check Passed

Thanks @vagenas, all your commits are properly signed off. 🎉

Copy link

mergify bot commented Jun 20, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Copy link

codecov bot commented Jun 20, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/chunker/page_chunker.py 81.81% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

@vagenas vagenas marked this pull request as ready for review July 1, 2025 11:32
@vagenas vagenas requested review from cau-git and PeterStaar-IBM July 2, 2025 04:20
)
else:
# if no pages, treat whole document as single chunk
ser_res = my_doc_ser.serialize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that here we need to have a parameter that sets the max size of the chunk (measured in chars or string length), otherwise we might get into trouble due to a few poisonous documents.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PeterStaar-IBM, as the premise is not just to chunk the text, but also to provide the doc items contributing to each chunk, adding such a limit would be somewhat more involved.

👉 I therefore propose introducing this capability in its simple form for now to address strictly page-based use cases; a max_chars mechanism could be then be included in the future if deemed necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the current proposal, and with the intent of enhancing it in the future.

In general, I would be in favour of a maximum size, but we would propose it as a solution which doesn't split doc items, simply add until the limit is reached.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagenas , I tried this locally. However I dont see the images part of the metadata. Wouldnt it be good to have?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikhildigde consistently with our other chunkers, this implementation provides the chunk objects, which:

  1. contain the text from the respective items
  2. provide the contextualized version thereof, i.e. including the respective section headers (see docs), and
  3. contain references to the respective items, which can be used for getting all relevant metadata

You can resolve things like images from point 3, e.g. see our visual grounding example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats what I plan to do, but would be convenient if the references contain also the "picture" item refs of the metadata. And not only picture, everything maybe.

@nikhildigde
Copy link

nikhildigde commented Jul 15, 2025

can this be merged please? A much required feature :) . Thanks @vagenas for the PR!!!

Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@vagenas vagenas requested a review from PeterStaar-IBM July 15, 2025 08:14
@vagenas
Copy link
Collaborator Author

vagenas commented Jul 15, 2025

@PeterStaar-IBM please consider providing an updated review based on my comment further above, otherwise this PR appears blocked by the previous review.

@vagenas vagenas dismissed PeterStaar-IBM’s stale review July 15, 2025 08:33

Internally discussed with maintainers.

@vagenas vagenas merged commit 3a0b747 into main Jul 15, 2025
12 checks passed
@vagenas vagenas deleted the add-page-chunking branch July 15, 2025 09:03
@nikhildigde
Copy link

Any clue on when the next release is planned? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants