Skip to content

Conversation

@gmanvel
Copy link

@gmanvel gmanvel commented Dec 10, 2025

Description

This PR dramatically reduces memory allocations in MarkdownTextChunker.Chunk() - a hot-path component in the GraphRAG indexing pipeline. The method is called for every document during the CreateBaseTextUnitsWorkflow (step 2 of 7), making allocation reduction critical for large-scale indexing.

Proposed Changes

Architecture Overhaul: Range-Based Splitting

The core change replaces string-based recursive splitting with a range-based architecture:

  1. FragmentRange struct - Replaces Fragment record class with readonly record struct FragmentRange(Range Range, bool IsSeparator). Zero-allocation fragment representation using indices into the original text.

  2. RecursiveSplitRanges() - New method that returns List<Range> instead of List<string>. Tracks positions via Range values throughout the recursion tree, deferring string allocation until the final step.

  3. GenerateChunksRanges() - Partner method that accumulates ranges using integer indices instead of StringBuilder. Eliminates the massive intermediate string allocations from StringBuilder.ToString() calls.

  4. SplitToFragments() - Returns List<FragmentRange> and uses SearchValues<char> for SIMD-vectorized separator detection via IndexOfAny().

Additional Optimizations

  1. SeparatorTrie.FirstChars - Added SearchValues<char> field for vectorized first-character lookup.

  2. MatchLongest() - Returns int (match length) instead of string?, avoiding substring allocation.

  3. NormalizeNewlines() - Uses ReplaceLineEndings("\n") (single-pass) instead of two string.Replace() calls.

  4. CountTokens with Span - Replaced tokenizer.EncodeToIds(text).Count with tokenizer.CountTokens(text.AsSpan()) for zero-allocation token counting during recursion.

Benchmark Results

BenchmarkDotNet v0.15.8, macOS Sequoia 15.6.1 (24G90) [Darwin 24.6.0]
Apple M4 Pro, 1 CPU, 14 logical and 14 physical cores
.NET SDK 10.0.100
  [Host]     : .NET 10.0.0 (10.0.0, 10.0.25.52411), Arm64 RyuJIT armv8.0-a
  DefaultJob : .NET 10.0.0 (10.0.0, 10.0.25.52411), Arm64 RyuJIT armv8.0-a

Small Document (~1KB)

Method ChunkSize Overlap Mean Gen0 Allocated Alloc Ratio
Original 512 0 24.30 μs 0.3052 2,680 B 1.00
Optimized 512 0 24.18 μs 0.0610 544 B 0.20
Original 1024 128 24.66 μs 0.3052 2,680 B 1.00
Optimized 1024 128 24.61 μs 0.0610 544 B 0.20

80% memory reduction for small documents.

Medium Document (~100KB)

Method ChunkSize Overlap Mean Gen0 Gen1 Allocated Alloc Ratio
Original 512 0 23,557 μs 781 375 6.77 MB 1.00
Optimized 512 0 22,687 μs 62 - 687 KB 0.10
Original 1024 0 41,617 μs 1333 167 11.75 MB 1.00
Optimized 1024 0 39,724 μs 77 - 800 KB 0.07
Original 2048 128 71,659 μs 2429 429 20.58 MB 1.00
Optimized 2048 128 69,346 μs 125 - 1.60 MB 0.08

90%+ memory reduction for medium documents. Gen0 collections reduced by 10-20x.

Large Document (~1MB)

Method ChunkSize Overlap Mean Gen0 Gen1 Allocated Alloc Ratio
Original 512 0 243,527 μs 8000 1000 67.67 MB 1.00
Optimized 512 0 229,046 μs 667 - 6.68 MB 0.10
Original 1024 0 422,356 μs 13000 1000 117.24 MB 1.00
Optimized 1024 0 403,563 μs 0 - 7.83 MB 0.07
Original 2048 0 755,569 μs 25000 1000 216.36 MB 1.00
Optimized 2048 0 728,155 μs 1000 - 10.02 MB 0.05
Original 2048 128 729,194 μs 24000 1000 207.99 MB 1.00
Optimized 2048 128 693,075 μs 1000 - 16.18 MB 0.08

90-95% memory reduction for large documents. Gen0 collections reduced from 8000-25000 to 0-1000.

Improvement Summary

Document Size Configuration Time Improvement Memory Reduction Gen0 Reduction
Small (1KB) 512/0 0.5% 80% 5x
Medium (100KB) 512/0 3.7% 90% 12x
Medium (100KB) 1024/0 4.5% 93% 17x
Large (1MB) 512/0 5.9% 90% 12x
Large (1MB) 1024/0 4.5% 93% ∞ (zero GC)
Large (1MB) 2048/0 3.6% 95% 25x
Large (1MB) 2048/128 5.0% 92% 24x

Key Insight

dotMemory profiling revealed that StringBuilder.ToString() calls in the recursive splitting code were allocating 1.51 GB of intermediate strings for a 1MB document. The range-based architecture eliminates this entirely by tracking indices throughout recursion and allocating strings only at the final step.

Original:
Screenshot 2025-12-10 at 13 52 25

Optimized:
Screenshot 2025-12-10 at 13 55 16

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

  • 36 comprehensive tests were added covering all internal methods before optimization to ensure correctness.
  • The optimized code path is now the default - the original implementation remains in place but is unused.
  • CPU time is dominated by the tokenizer (~29% in CountTokens), so further time improvements would require tokenizer-level optimizations.
  • This optimization specifically targets memory pressure, reducing GC overhead which benefits overall application responsiveness and throughput.

On top of added test cases I created regression tests to ensure optimizations are not changing the output. After each small optimizations these tests were running to ensure output remains the same.

Since this is a pretty much rewrite of the MarkdownTextChunker component I would suggest thorough review/test by maintainers before it gets merged.

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 75.73964% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.32%. Comparing base (22e277d) to head (b6873ee).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...nagedCode.GraphRag/Chunking/MarkdownTextChunker.cs 75.73% 35 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main       #9      +/-   ##
==========================================
- Coverage   75.46%   74.32%   -1.14%     
==========================================
  Files         115      115              
  Lines        4757     4885     +128     
  Branches      798      827      +29     
==========================================
+ Hits         3590     3631      +41     
- Misses        854      947      +93     
+ Partials      313      307       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@KSemenenko
Copy link
Member

one more amazing PR @gmanvel you are unstopable! this is so cool!

btw what do you think to create new library just wich chunkers? we have ours, with many think like, what do you thnk?

@KSemenenko KSemenenko merged commit dc9c98e into managedcode:main Dec 11, 2025
4 of 5 checks passed
@gmanvel
Copy link
Author

gmanvel commented Dec 11, 2025

@KSemenenko might be a good idea, but first I'd look into existing libs MS offers.

@KSemenenko
Copy link
Member

@gmanvel if you know any can you share it with me?

@gmanvel
Copy link
Author

gmanvel commented Dec 11, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants