Reduce allocations for MarkdownTextChunker by 80-95 % #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR dramatically reduces memory allocations in
MarkdownTextChunker.Chunk()- a hot-path component in the GraphRAG indexing pipeline. The method is called for every document during the CreateBaseTextUnitsWorkflow (step 2 of 7), making allocation reduction critical for large-scale indexing.Proposed Changes
Architecture Overhaul: Range-Based Splitting
The core change replaces string-based recursive splitting with a range-based architecture:
FragmentRangestruct - ReplacesFragmentrecord class withreadonly record struct FragmentRange(Range Range, bool IsSeparator). Zero-allocation fragment representation using indices into the original text.RecursiveSplitRanges()- New method that returnsList<Range>instead ofList<string>. Tracks positions viaRangevalues throughout the recursion tree, deferring string allocation until the final step.GenerateChunksRanges()- Partner method that accumulates ranges using integer indices instead of StringBuilder. Eliminates the massive intermediate string allocations fromStringBuilder.ToString()calls.SplitToFragments()- ReturnsList<FragmentRange>and usesSearchValues<char>for SIMD-vectorized separator detection viaIndexOfAny().Additional Optimizations
SeparatorTrie.FirstChars- AddedSearchValues<char>field for vectorized first-character lookup.MatchLongest()- Returnsint(match length) instead ofstring?, avoiding substring allocation.NormalizeNewlines()- UsesReplaceLineEndings("\n")(single-pass) instead of twostring.Replace()calls.CountTokenswith Span - Replacedtokenizer.EncodeToIds(text).Countwithtokenizer.CountTokens(text.AsSpan())for zero-allocation token counting during recursion.Benchmark Results
Small Document (~1KB)
80% memory reduction for small documents.
Medium Document (~100KB)
90%+ memory reduction for medium documents. Gen0 collections reduced by 10-20x.
Large Document (~1MB)
90-95% memory reduction for large documents. Gen0 collections reduced from 8000-25000 to 0-1000.
Improvement Summary
Key Insight
dotMemory profiling revealed that
StringBuilder.ToString()calls in the recursive splitting code were allocating 1.51 GB of intermediate strings for a 1MB document. The range-based architecture eliminates this entirely by tracking indices throughout recursion and allocating strings only at the final step.Original:

Optimized:

Checklist
Additional Notes
CountTokens), so further time improvements would require tokenizer-level optimizations.On top of added test cases I created regression tests to ensure optimizations are not changing the output. After each small optimizations these tests were running to ensure output remains the same.
Since this is a pretty much rewrite of the
MarkdownTextChunkercomponent I would suggest thorough review/test by maintainers before it gets merged.