Skip to content

Handle frontmatter in chunking and title extraction#551

Open
surma wants to merge 4 commits intotobi:mainfrom
surma-dump:feat/frontmatter-chunks
Open

Handle frontmatter in chunking and title extraction#551
surma wants to merge 4 commits intotobi:mainfrom
surma-dump:feat/frontmatter-chunks

Conversation

@surma
Copy link
Copy Markdown
Contributor

@surma surma commented Apr 10, 2026

Summary

This teaches qmd to treat leading frontmatter as document metadata instead of ordinary prose.

Before this change, smart chunking could split frontmatter across multiple chunks, and extractTitle() ignored frontmatter entirely. That meant metadata could get mixed into semantic content chunks, and documents that explicitly declared a title in frontmatter still fell back to headings or filenames.

With this PR:

  • leading frontmatter is emitted as its own first chunk
  • markdown title extraction prefers frontmatter title
  • heading-based title extraction now runs on the body after frontmatter, so frontmatter text cannot be mistaken for a heading

What changed

Frontmatter-aware chunking

  • Added frontmatter detection in src/store.ts
  • When a document starts with frontmatter, qmd now:
    • keeps the entire frontmatter block together as chunk 0
    • re-chunks only the remaining body content
    • offsets body chunk positions back into the original document
  • This preserves metadata boundaries and avoids splitting frontmatter into later semantic chunks

Frontmatter-aware title extraction

  • extractTitle() now checks frontmatter first and returns data.title when present
  • If no frontmatter title exists, it falls back to the existing title extraction logic
  • For markdown files, heading extraction now runs against the body after frontmatter, so a # inside frontmatter cannot be treated as the document title

Parsing behavior

  • Added gray-matter for frontmatter parsing
  • YAML and JSON frontmatter are parsed via gray-matter
  • +++ ... +++ frontmatter is also recognized for chunk separation
  • For non-YAML frontmatter that gray-matter does not decode into an object, qmd falls back to parsing the raw matter as YAML or JSON when looking for title

Tests

Added coverage for:

  • YAML frontmatter title extraction
  • JSON frontmatter title extraction
  • ignoring frontmatter when scanning markdown headings
  • keeping YAML frontmatter as its own chunk
  • keeping +++ frontmatter as its own chunk

Validated with:

  • bunx vitest run test/store.test.ts -t frontmatter
  • bun run build

Why this is useful

A lot of markdown collections rely on frontmatter for canonical titles and metadata. Keeping that metadata in a dedicated chunk makes retrieval cleaner, and honoring frontmatter title makes qmd’s document display/title behavior line up better with how static-site and note-taking tools already structure documents.

@surma surma force-pushed the feat/frontmatter-chunks branch from 79af5c7 to dcd8ce3 Compare April 10, 2026 17:08
@socket-security
Copy link
Copy Markdown

socket-security bot commented Apr 11, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednpm/​gray-matter@​4.0.39910010083100

View full report

@surma
Copy link
Copy Markdown
Contributor Author

surma commented Apr 11, 2026

@tobi pushed a fix for the seemingly flakey CI test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants