Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample chunking notebook that includes merging, etc. #193

Merged
merged 4 commits into from
Nov 19, 2024

Conversation

jwm4
Copy link
Collaborator

@jwm4 jwm4 commented Nov 1, 2024

Some key differences between this proposed chunking notebook and the one in advanced_chunking.ipynb:

  1. This one merges chunks that have the same headings and captions (e.g., adjacent paragraphs within the same section).
  2. This one splits on doc_items such as elements of an itemized list before trying to apply generic text splitting. This results in chunks that respect the begin and end of the list items more often.
  3. This one uses the DoclingDocument.name as the title of the document instead of assuming that the title will be in the headers. That's probably not a great idea going forward though because in the near future the extracted title will be in the headers. The DoclingDocument.name comes from document metadata and sometimes also reflects the title but is often not very useful.
  4. This one uses semchunk as the plain text splitter for use when the hierarchical elements are too big. In the semchunk repo, you can see their argument for why this is a good generic text splitter. Also, I tried it on some tricky examples and I liked the output in practice.
  5. This one does not use yield to stream out the chunks one at a time -- it just uses lists for everything and then wraps them in an iterator at the end to comply with the API. That seems simpler but probably less efficient especially when dealing with large scale.

jwm4 and others added 4 commits November 1, 2024 08:41
Signed-off-by: Bill Murdock <[email protected]>
Earlier versions used the `doc.name` as the overall title of the document, but the discussion revealed that probably it is better to just trust the `doc_chunk.meta.headings` to have the title information sooner or later.  So I've removed all the special title stuff and am just relying on the headers now.

Signed-off-by: Bill Murdock <[email protected]>
@PeterStaar-IBM
Copy link
Contributor

@vagenas Let's review together later today. I do not see any blocker so far.

@PeterStaar-IBM PeterStaar-IBM marked this pull request as draft November 18, 2024 08:32
@vagenas vagenas marked this pull request as ready for review November 19, 2024 22:11
@vagenas vagenas merged commit 5a8186b into advanced-chunking-example Nov 19, 2024
8 checks passed
@vagenas vagenas deleted the jwm4-chunking-example-1 branch November 19, 2024 22:12
vagenas pushed a commit that referenced this pull request Dec 3, 2024
Signed-off-by: Bill Murdock <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Co-authored-by: Peter Staar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants