Skip to content

feat(chunking): code fence pairing, list-aware + XML tag break points#553

Open
galligan wants to merge 6 commits intotobi:mainfrom
galligan:feat/chunking-improvements
Open

feat(chunking): code fence pairing, list-aware + XML tag break points#553
galligan wants to merge 6 commits intotobi:mainfrom
galligan:feat/chunking-improvements

Conversation

@galligan
Copy link
Copy Markdown
Contributor

This consolidates what was previously four stacked PRs (#538, #539, #540, #541) into a single PR for simpler review. The stacked approach created unnecessary friction for a cross-fork contribution. Full context in the overview gist. Each logical change is its own commit if you prefer to review commit-by-commit.

Summary

Four chunking improvements, each in its own commit:

  1. Fix code fence pairing to follow CommonMark rules (char + length matching, tilde support)
  2. Rename CodeFenceRegion to ProtectedRegion with an optional kind tag (pure mechanical refactor)
  3. List-aware break point scanner with depth-weighted scoring and nested sublist tracking
  4. XML tag break point scanner for agent-prompt tags like <example>, <instructions>, <thinking>

Plus a test isolation fix for a pre-existing flaky test (createStore throws without explicit path in test mode on Bun ubuntu, same root cause as upstream commit 66e70c0).

Splitting

These changes sit on clean seams and can be split into separate PRs if that's preferable. The commits are ordered so any prefix is self-contained:

  • Commits 1-2 (fence fix + rename) are a standalone bugfix with no feature additions
  • Commit 3 (list-aware) is independent of commit 4 (XML tags)
  • Commit 4 (XML tags) is independent of commit 3 (list-aware)

If you'd rather land these incrementally, I'm happy to split them back out.

Commit 1: fix code fence pairing

findCodeFences had two problems: the regex only matched exactly three backticks (ignoring tildes and longer runs), and pairing was a naive toggle. A fence opened with ```` was never recognized. A stray ``` inside a longer fence prematurely closed it.

Now tracks the opening fence's character and length. A close candidate must use the same char, be at least as long, and carry no info string. BREAK_PATTERNS updated to scan for 3+ backticks or tildes.

Scope: column-0 fences only. Indented fences are not detected (documented).

12 new tests covering 4/5/6-backtick nesting, tilde fences, mixed chars, same-length non-nesting (CommonMark quirk), info-string validation.

Commit 2: rename CodeFenceRegion to ProtectedRegion

Pure mechanical rename. CodeFenceRegion becomes ProtectedRegion with optional kind?: string (set to 'fence' by findCodeFences). isInsideCodeFence becomes isInsideProtectedRegion. All parameter names updated. No behavior change. Not re-exported from src/index.ts, so fully contained.

Commit 3: list-aware break point scanner

Replaces the two naive BREAK_PATTERNS entries (list: 5, numlist: 5) with findListBreakPoints, a stack-based scanner that tracks nested list frames and emits depth-weighted break points:

Break point Score
list-end (list to non-list transition) 75
Top-level item (depth 0) 70
Second-level item (depth 1) 45
Third-level and deeper (depth 2+) 25

Handles: - and * unordered markers, 1. and 1) ordered markers, mixed marker characters at same indent (treated as one list), nested sublists, blank lines inside items, list-end detection.

Deliberately deferred: loose/tight distinction, lazy continuation, 4-space indented code blocks, tab indentation. Each documented in a block comment.

16 new tests including an end-to-end integration test through chunkDocument confirming long lists split at item boundaries.

Commit 4: XML tag break point scanner

findXmlTagBreakPoints detects line-anchored paired XML tags and emits asymmetric break points:

Break point Score
tag-open 30 (weak: splitting right before content orphans the opener)
tag-close 75 (strong: splitting after a closed block is great)

Tag name grammar: [A-Za-z_][A-Za-z0-9_.:-]* (XML Name production, custom elements, namespaced tags).

HTML5 element names are blocked via a case-insensitive blocklist in src/html-elements.ts so inline HTML (<div>, <p>, <br>) isn't confused for structural tags.

Key rules:

  • Line-anchored only (tag must be on its own line, leading whitespace OK)
  • Case-sensitive open/close matching (XML semantics)
  • Stack-based nesting (same-name and different-name)
  • Self-closing <tag/> produces nothing
  • <!-- -->, <!DOCTYPE>, <![CDATA[]]>, <?xml ?> skipped
  • Cross-tag interleaving is malformed (zero break points)
  • Unclosed tags emit nothing
  • Tags inside code fences are ignored (fence scan runs first)

27 new tests including fence precedence, HTML blocklist, case sensitivity, malformed input handling, and an integration test through chunkDocument.

Regression analysis

The only pattern that used to score and no longer does is -\t (dash followed by literal tab as marker separator). The old regex \n[-*]\s matched it at score 5; the new scanner requires space-separated markers. Tab-indented list items were never detected by the old regex either (\n\t- foo was invisible). Not a regression in practice.

Everything the old code detected, the new code detects and scores higher. Previously undetected patterns (nested sublists, 1) form, list-end transitions, XML tags, tilde fences, 4+ backtick fences) are now handled.

Files changed

  • Modified: src/store.ts (fence fix, rename, list scanner, XML scanner, integration)
  • Modified: test/store.test.ts (55 new tests + test isolation fix)
  • Modified: CHANGELOG.md (changelog entries under [Unreleased])
  • New: src/html-elements.ts (HTML5 element blocklist)

Test plan

  • npx vitest run test/store.test.ts passes (246/246, was 203 + 43 new)
  • npx vitest run test/ast-chunking.test.ts passes (12/12)
  • npx tsc -p tsconfig.build.json --noEmit clean
  • CI green

Code fence detection only matched exactly ``` and toggled open/close
on every match, so fences opened with 4+ backticks were never
recognized, tilde fences were ignored, and a stray ``` inside a
longer fence could prematurely close it. Chunks could then split
inside code blocks.

findCodeFences now follows CommonMark pairing: the closing fence
must use the same character as the opener, be at least as long,
and carry no info string. Tilde fences are recognized. Only
column-0 fences are detected; indented fences are not.
Pure rename, no behavior change. CodeFenceRegion becomes ProtectedRegion
with an optional `kind` tag (set to 'fence' by findCodeFences). This
opens the seam for future passes to contribute other kinds of protected
regions without changing the chunker's core contract.

Renames:
- interface CodeFenceRegion -> ProtectedRegion (adds optional kind)
- isInsideCodeFence -> isInsideProtectedRegion
- findBestCutoff param: codeFences -> protectedRegions
- chunkDocumentWithBreakPoints param: codeFences -> protectedRegions

findCodeFences keeps its name as one producer of protected regions.
No external callers — the symbols are not re-exported from src/index.ts,
so the rename is contained.
Mirrors the fix applied in 66e70c0 ("fix(test): reset _productionMode
in getDefaultDbPath test"). The createStore-throws test in store.test.ts
has the same isolation issue as the parallel test in
store.helpers.unit.test.ts: bun runs all test files in a single process
so _productionMode state leaks between files. If a previous test file
sets production mode, this test fails because getDefaultDbPath returns
a real path instead of throwing.

Adds the same _resetProductionModeForTesting() call right before the
expectation. Test passes deterministically regardless of file ordering.

Surfaced when stacked feature branches above this PR shifted bun's
test file ordering enough to trigger the latent failure.
Replaces the two naive list patterns in BREAK_PATTERNS with a
stack-based scanner that tracks nested list frames and emits
depth-weighted break points plus a list-end transition break point.

Old behavior:
  [/\n[-*]\s/g, 5, 'list']
  [/\n\d+\.\s/g, 5, 'numlist']

Both scored every list-item start at 5, so the break point almost
always lost to nearby heading/blank/codeblock scores and chunks
landed mid-item on long lists. Nested sublists and the ordered `1)`
form were not detected at all.

New scanner (findListBreakPoints):
  - depth 0 item (top-level): score 70
  - depth 1 item (first sublist): score 45
  - depth 2+ item (deeper): score 25
  - list-end (list -> non-list transition): score 75

Scope:
  - Unordered markers: `-`, `*` (matches previous behavior; `+` not
    supported — agents and modern docs don't use it)
  - Ordered markers: `1.` and `1)` (new: `1)` was never detected)
  - Mixed marker characters at the same indent are treated as one
    list (simpler than CommonMark's split rule, better for chunking)
  - Nested sublists with proper depth tracking (new)
  - Blank lines inside items don't terminate the list
  - Column-0 non-list lines terminate the list and emit list-end

Deliberately deferred:
  - Loose vs tight list distinction (rendering concern, no chunking
    impact)
  - Lazy continuation (column-0 line that CommonMark folds back into
    the preceding item)
  - 4-space indented code blocks inside items (ambiguous with
    continuation; defer)
  - Tab-as-marker-separator (`-\t`); not a regression since neither
    old nor new matches tab indentation

Integration: chunkDocument and chunkDocumentAsync now merge
findListBreakPoints output with scanBreakPoints before passing to
chunkDocumentWithBreakPoints. mergeBreakPoints already handles
"higher score wins at same position." AST points continue to layer
on top in the async path.

16 new tests in test/store.test.ts covering empty input, prose,
unordered/ordered/mixed lists, three-deep nesting, mixed marker
nesting, list-end at prose and EOF, blank-line continuation, `+`
rejection, position convention, and an end-to-end integration test
through chunkDocument confirming long lists split at item boundaries.
Adds findXmlTagBreakPoints for recognizing line-anchored paired XML
tags as split points in the chunker. Agent instruction files and
prompt docs frequently wrap structural blocks in tags like
<example>, <instructions>, <thinking>, <tool_use>, <system>, and
the chunker should prefer to split at the close of those blocks
rather than mid-block.

Scoring (asymmetric, same rationale as the "prefer splits at the
end of structured blocks" principle used elsewhere):
  - tag-open:  30  (weak — splitting right before content is bad)
  - tag-close: 75  (strong — same as list-end, splitting after a
                    closed block is great)

Scope:
  - Line-anchored only. Opening and closing tags must occupy their
    own line (leading whitespace allowed). Mid-line tags like
    `Here's an <example>foo</example>` are ignored. Multi-line tag
    openers like `<tag\n  attr="v">` are also not recognized.
  - Tag name grammar: `[A-Za-z_][A-Za-z0-9_.:-]*`. Covers XML Name
    production, custom elements (`my-widget`), and namespaced tags
    (`xsl:template`).
  - HTML5 element names are blocked via a case-insensitive blocklist
    in src/html-elements.ts. This prevents inline HTML in markdown
    (<div>, <p>, <br>, etc.) from being picked up as structural
    tags. Agent-prompt tags (<example>, <instructions>, <thinking>,
    ...) are not HTML elements and pass through.
  - Open/close matching is case-sensitive (XML semantics). `<Example>`
    does not match `</example>`.
  - Self-closing `<tag/>` and `<tag />` create no region.
  - `<!-- … -->`, `<!DOCTYPE …>`, `<![CDATA[…]]>`, `<?xml … ?>`
    are recognized as non-tag constructs and skipped entirely.
  - Nesting is stack-based. Same-name and different-name nesting
    both work.
  - Cross-tag interleaving (`<a><b></a></b>`) is treated as
    malformed and emits zero break points for all involved tags.
  - Unclosed tags emit no break points (unlike code fences, which
    extend to EOF).
  - Tags inside code fences are ignored — the fence scan runs first
    and its regions are passed to the tag scanner.

Known limitations (documented in the function's doc comment):
  - Attribute parsing is lazy. The opener regex terminates at the
    first `>`, so a `>` inside a quoted attribute value produces a
    malformed match. Real agent-prompt tags use simple attribute
    values, so this is acceptable.
  - Comments, CDATA, and processing instructions must fit on a
    single line. Multi-line comments are not recognized (rare in
    agent docs).

Integration: chunkDocument and chunkDocumentAsync compute fences
first, then pass them to findXmlTagBreakPoints, then merge the tag
points with scanBreakPoints output via mergeBreakPoints.

27 new tests in test/store.test.ts covering empty input, prose,
single/multiple/nested blocks, self-closing with and without space,
attributes, HTML blocklist (with case-insensitivity), custom
elements, namespaced tags, case-sensitive matching, unclosed and
stray tags, cross-tag interleaving, fence precedence, mid-line
rejection, leading whitespace, all four non-tag constructs,
first-line skip, position convention, and an end-to-end integration
test through chunkDocument confirming that tag-close positions are
preferred as split points.
@galligan galligan marked this pull request as ready for review April 10, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant