Sample chunking notebook that includes merging, etc. #193

jwm4 · 2024-11-01T12:49:04Z

Some key differences between this proposed chunking notebook and the one in advanced_chunking.ipynb:

This one merges chunks that have the same headings and captions (e.g., adjacent paragraphs within the same section).
This one splits on doc_items such as elements of an itemized list before trying to apply generic text splitting. This results in chunks that respect the begin and end of the list items more often.
This one uses the DoclingDocument.name as the title of the document instead of assuming that the title will be in the headers. That's probably not a great idea going forward though because in the near future the extracted title will be in the headers. The DoclingDocument.name comes from document metadata and sometimes also reflects the title but is often not very useful.
This one uses semchunk as the plain text splitter for use when the hierarchical elements are too big. In the semchunk repo, you can see their argument for why this is a good generic text splitter. Also, I tried it on some tricky examples and I liked the output in practice.
This one does not use yield to stream out the chunks one at a time -- it just uses lists for everything and then wraps them in an iterator at the end to comply with the API. That seems simpler but probably less efficient especially when dealing with large scale.

Signed-off-by: Bill Murdock <[email protected]>

Earlier versions used the `doc.name` as the overall title of the document, but the discussion revealed that probably it is better to just trust the `doc_chunk.meta.headings` to have the title information sooner or later. So I've removed all the special title stuff and am just relying on the headers now. Signed-off-by: Bill Murdock <[email protected]>

Signed-off-by: Peter Staar <[email protected]>

PeterStaar-IBM · 2024-11-05T06:30:46Z

@vagenas Let's review together later today. I do not see any blocker so far.

Signed-off-by: Bill Murdock <[email protected]> Signed-off-by: Peter Staar <[email protected]> Co-authored-by: Peter Staar <[email protected]>

jwm4 and others added 4 commits November 1, 2024 08:41

Add files via upload

e41b413

Signed-off-by: Bill Murdock <[email protected]>

Update advanced_chunking_with_merging

7ed4d37

Signed-off-by: Bill Murdock <[email protected]>

reformatted the code to pass the tests

98efb89

Signed-off-by: Peter Staar <[email protected]>

jwm4 mentioned this pull request Nov 8, 2024

Use Docling v2 hierarchical chunking instead of the existing context-aware chunking implementation instructlab/sdg#350

Closed

PeterStaar-IBM marked this pull request as draft November 18, 2024 08:32

vagenas marked this pull request as ready for review November 19, 2024 22:11

vagenas merged commit 5a8186b into advanced-chunking-example Nov 19, 2024
8 checks passed

vagenas deleted the jwm4-chunking-example-1 branch November 19, 2024 22:12

vagenas pushed a commit that referenced this pull request Dec 3, 2024

Sample chunking notebook that includes merging, etc. (#193)

2c06c0d

Signed-off-by: Bill Murdock <[email protected]> Signed-off-by: Peter Staar <[email protected]> Co-authored-by: Peter Staar <[email protected]>

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination for Bill Murdock instructlab/instructlab#2931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample chunking notebook that includes merging, etc. #193

Sample chunking notebook that includes merging, etc. #193

jwm4 commented Nov 1, 2024 •

edited

Loading

PeterStaar-IBM commented Nov 5, 2024

Sample chunking notebook that includes merging, etc. #193

Sample chunking notebook that includes merging, etc. #193

Conversation

jwm4 commented Nov 1, 2024 • edited Loading

PeterStaar-IBM commented Nov 5, 2024

jwm4 commented Nov 1, 2024 •

edited

Loading