Reduce allocations for MarkdownTextChunker by 80-95 % #9

gmanvel · 2025-12-10T13:08:41Z

Description

This PR dramatically reduces memory allocations in MarkdownTextChunker.Chunk() - a hot-path component in the GraphRAG indexing pipeline. The method is called for every document during the CreateBaseTextUnitsWorkflow (step 2 of 7), making allocation reduction critical for large-scale indexing.

Proposed Changes

Architecture Overhaul: Range-Based Splitting

The core change replaces string-based recursive splitting with a range-based architecture:

FragmentRange struct - Replaces Fragment record class with readonly record struct FragmentRange(Range Range, bool IsSeparator). Zero-allocation fragment representation using indices into the original text.
RecursiveSplitRanges() - New method that returns List<Range> instead of List<string>. Tracks positions via Range values throughout the recursion tree, deferring string allocation until the final step.
GenerateChunksRanges() - Partner method that accumulates ranges using integer indices instead of StringBuilder. Eliminates the massive intermediate string allocations from StringBuilder.ToString() calls.
SplitToFragments() - Returns List<FragmentRange> and uses SearchValues<char> for SIMD-vectorized separator detection via IndexOfAny().

Additional Optimizations

SeparatorTrie.FirstChars - Added SearchValues<char> field for vectorized first-character lookup.
MatchLongest() - Returns int (match length) instead of string?, avoiding substring allocation.
NormalizeNewlines() - Uses ReplaceLineEndings("\n") (single-pass) instead of two string.Replace() calls.
CountTokens with Span - Replaced tokenizer.EncodeToIds(text).Count with tokenizer.CountTokens(text.AsSpan()) for zero-allocation token counting during recursion.

Benchmark Results

BenchmarkDotNet v0.15.8, macOS Sequoia 15.6.1 (24G90) [Darwin 24.6.0]
Apple M4 Pro, 1 CPU, 14 logical and 14 physical cores
.NET SDK 10.0.100
  [Host]     : .NET 10.0.0 (10.0.0, 10.0.25.52411), Arm64 RyuJIT armv8.0-a
  DefaultJob : .NET 10.0.0 (10.0.0, 10.0.25.52411), Arm64 RyuJIT armv8.0-a

Small Document (~1KB)

Method	ChunkSize	Overlap	Mean	Gen0	Allocated	Alloc Ratio
Original	512	0	24.30 μs	0.3052	2,680 B	1.00
Optimized	512	0	24.18 μs	0.0610	544 B	0.20
Original	1024	128	24.66 μs	0.3052	2,680 B	1.00
Optimized	1024	128	24.61 μs	0.0610	544 B	0.20

80% memory reduction for small documents.

Medium Document (~100KB)

Method	ChunkSize	Overlap	Mean	Gen0	Gen1	Allocated	Alloc Ratio
Original	512	0	23,557 μs	781	375	6.77 MB	1.00
Optimized	512	0	22,687 μs	62	-	687 KB	0.10
Original	1024	0	41,617 μs	1333	167	11.75 MB	1.00
Optimized	1024	0	39,724 μs	77	-	800 KB	0.07
Original	2048	128	71,659 μs	2429	429	20.58 MB	1.00
Optimized	2048	128	69,346 μs	125	-	1.60 MB	0.08

90%+ memory reduction for medium documents. Gen0 collections reduced by 10-20x.

Large Document (~1MB)

Method	ChunkSize	Overlap	Mean	Gen0	Gen1	Allocated	Alloc Ratio
Original	512	0	243,527 μs	8000	1000	67.67 MB	1.00
Optimized	512	0	229,046 μs	667	-	6.68 MB	0.10
Original	1024	0	422,356 μs	13000	1000	117.24 MB	1.00
Optimized	1024	0	403,563 μs	0	-	7.83 MB	0.07
Original	2048	0	755,569 μs	25000	1000	216.36 MB	1.00
Optimized	2048	0	728,155 μs	1000	-	10.02 MB	0.05
Original	2048	128	729,194 μs	24000	1000	207.99 MB	1.00
Optimized	2048	128	693,075 μs	1000	-	16.18 MB	0.08

90-95% memory reduction for large documents. Gen0 collections reduced from 8000-25000 to 0-1000.

Improvement Summary

Document Size	Configuration	Time Improvement	Memory Reduction	Gen0 Reduction
Small (1KB)	512/0	0.5%	80%	5x
Medium (100KB)	512/0	3.7%	90%	12x
Medium (100KB)	1024/0	4.5%	93%	17x
Large (1MB)	512/0	5.9%	90%	12x
Large (1MB)	1024/0	4.5%	93%	∞ (zero GC)
Large (1MB)	2048/0	3.6%	95%	25x
Large (1MB)	2048/128	5.0%	92%	24x

Key Insight

dotMemory profiling revealed that StringBuilder.ToString() calls in the recursive splitting code were allocating 1.51 GB of intermediate strings for a 1MB document. The range-based architecture eliminates this entirely by tracking indices throughout recursion and allocating strings only at the final step.

Original:

Optimized:

Checklist

I have tested these changes locally.
I have reviewed the code changes.
I have updated the documentation (if necessary).
I have added appropriate unit tests (if applicable).

Additional Notes

36 comprehensive tests were added covering all internal methods before optimization to ensure correctness.
The optimized code path is now the default - the original implementation remains in place but is unused.
CPU time is dominated by the tokenizer (~29% in CountTokens), so further time improvements would require tokenizer-level optimizations.
This optimization specifically targets memory pressure, reducing GC overhead which benefits overall application responsiveness and throughput.

On top of added test cases I created regression tests to ensure optimizations are not changing the output. After each small optimizations these tests were running to ensure output remains the same.

Since this is a pretty much rewrite of the MarkdownTextChunker component I would suggest thorough review/test by maintainers before it gets merged.

codecov · 2025-12-10T13:14:41Z

Codecov Report

❌ Patch coverage is 75.73964% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.32%. Comparing base (22e277d) to head (b6873ee).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
...nagedCode.GraphRag/Chunking/MarkdownTextChunker.cs	75.73%	35 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main       #9      +/-   ##
==========================================
- Coverage   75.46%   74.32%   -1.14%     
==========================================
  Files         115      115              
  Lines        4757     4885     +128     
  Branches      798      827      +29     
==========================================
+ Hits         3590     3631      +41     
- Misses        854      947      +93     
+ Partials      313      307       -6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

KSemenenko · 2025-12-11T11:56:13Z

one more amazing PR @gmanvel you are unstopable! this is so cool!

btw what do you think to create new library just wich chunkers? we have ours, with many think like, what do you thnk?

gmanvel · 2025-12-11T13:12:09Z

@KSemenenko might be a good idea, but first I'd look into existing libs MS offers.

KSemenenko · 2025-12-11T13:26:57Z

@gmanvel if you know any can you share it with me?

gmanvel · 2025-12-11T13:44:17Z

@gmanvel if you know any can you share it with me?

https://devblogs.microsoft.com/dotnet/introducing-data-ingestion-building-blocks-preview/#split-your-data-into-chunks

gmanvel added 2 commits December 10, 2025 13:14

Optimize the MarkdownTextChunk allocation footprint

c6d82c8

Cleanup leftover method

b6873ee

KSemenenko merged commit dc9c98e into managedcode:main Dec 11, 2025
4 of 5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce allocations for MarkdownTextChunker by 80-95 % #9

Reduce allocations for MarkdownTextChunker by 80-95 % #9

Uh oh!

gmanvel commented Dec 10, 2025

Uh oh!

codecov bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

KSemenenko commented Dec 11, 2025

Uh oh!

Uh oh!

gmanvel commented Dec 11, 2025

Uh oh!

KSemenenko commented Dec 11, 2025

Uh oh!

gmanvel commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reduce allocations for MarkdownTextChunker by 80-95 % #9

Reduce allocations for MarkdownTextChunker by 80-95 % #9

Uh oh!

Conversation

gmanvel commented Dec 10, 2025

Description

Proposed Changes

Architecture Overhaul: Range-Based Splitting

Additional Optimizations

Benchmark Results

Small Document (~1KB)

Medium Document (~100KB)

Large Document (~1MB)

Improvement Summary

Key Insight

Checklist

Additional Notes

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KSemenenko commented Dec 11, 2025

Uh oh!

Uh oh!

gmanvel commented Dec 11, 2025

Uh oh!

KSemenenko commented Dec 11, 2025

Uh oh!

gmanvel commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 10, 2025 •

edited

Loading