-
Notifications
You must be signed in to change notification settings - Fork 72
feat: add page chunking #337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Panos Vagenas <[email protected]>
✅ DCO Check Passed Thanks @vagenas, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
Codecov ReportAttention: Patch coverage is
📢 Thoughts on this report? Let us know! |
) | ||
else: | ||
# if no pages, treat whole document as single chunk | ||
ser_res = my_doc_ser.serialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that here we need to have a parameter that sets the max size of the chunk (measured in chars or string length), otherwise we might get into trouble due to a few poisonous documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PeterStaar-IBM, as the premise is not just to chunk the text, but also to provide the doc items contributing to each chunk, adding such a limit would be somewhat more involved.
👉 I therefore propose introducing this capability in its simple form for now to address strictly page-based use cases; a max_chars mechanism could be then be included in the future if deemed necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the current proposal, and with the intent of enhancing it in the future.
In general, I would be in favour of a maximum size, but we would propose it as a solution which doesn't split doc items, simply add until the limit is reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vagenas , I tried this locally. However I dont see the images part of the metadata. Wouldnt it be good to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikhildigde consistently with our other chunkers, this implementation provides the chunk objects, which:
- contain the text from the respective items
- provide the contextualized version thereof, i.e. including the respective section headers (see docs), and
- contain references to the respective items, which can be used for getting all relevant metadata
You can resolve things like images from point 3, e.g. see our visual grounding example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats what I plan to do, but would be convenient if the references contain also the "picture" item refs of the metadata. And not only picture, everything maybe.
can this be merged please? A much required feature :) . Thanks @vagenas for the PR!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@PeterStaar-IBM please consider providing an updated review based on my comment further above, otherwise this PR appears blocked by the previous review. |
Internally discussed with maintainers.
Any clue on when the next release is planned? Thank you! |
No description provided.