Skip to content

fix(embed): honor -c <collection> with scoped force-clear and shared-hash preservation#580

Open
lukeboyett wants to merge 6 commits intotobi:mainfrom
lukeboyett:fix/embed-collection-filter
Open

fix(embed): honor -c <collection> with scoped force-clear and shared-hash preservation#580
lukeboyett wants to merge 6 commits intotobi:mainfrom
lukeboyett:fix/embed-collection-filter

Conversation

@lukeboyett
Copy link
Copy Markdown

Summary

qmd embed -c <collection> accepts the flag but ignores it: the handler never
reads cli.opts.collection, so a scoped embed ends up embedding every pending
document in the index. This PR makes -c behave on embed the way it does on
search / query / #566's forthcoming update -c, and tightens the adjacent
--force and vec-table paths that scoped embedding exposes.

Five focused commits:

  1. fix(embed): honor -c <collection> by filtering pending docs — thread
    options.collection through generateEmbeddingsgetPendingEmbeddingDocs
    / getHashesNeedingEmbedding with an added AND d.collection = ? filter.
    CLI handler forwards cli.opts.collection.

  2. fix(embed): scope --force clear to the requested collection — when
    called with force: true and collection, clearAllEmbeddings now
    deletes only rows whose hash belongs to a doc in that collection and prunes
    the matching vectors_vec hash_seq entries, instead of wiping every
    collection's vectors. (vec0 virtual tables don't reliably accept
    IN-subquery DELETEs, so we enumerate and delete per row.)

  3. fix(embed): preserve shared hashes when force-clearing a collection
    content_vectors is keyed globally by content hash; two docs in different
    collections with identical bodies share one row. The scoped clear now only
    removes hashes owned exclusively by active docs in the target collection,
    so shared vectors stay valid for sibling collections.

  4. fix(embed): validate -c against configured collections — route
    cli.opts.collection through resolveCollectionFilter(..., false) so an
    unknown collection errors consistently with the rest of the CLI, instead
    of silently reporting "0 to embed." This matches the pattern feat(update): add -c/--collection filter to qmd update #566 uses for
    update -c.

  5. fix(embed): drop vectors_vec when scoped force empties content_vectors
    — if the scoped clear empties content_vectors entirely (e.g. the target
    is the only active collection), drop vectors_vec so the next
    generateEmbeddings run recreates it via ensureVecTable with the current
    model's dimensions, matching the unscoped branch's behavior.

Test plan

  • npm run build passes (depends on fix(db): declare transaction() on local Database interface #579 for Database.transaction() type — see below)
  • npm test — passes with new regression coverage in test/store.test.ts:
    • scoped embed embeds only target collection, sibling untouched
    • -f -c preserves sibling collection vectors
    • shared-hash preservation across collections
    • unknown -c value errors cleanly
    • scoped force rebuilds vectors_vec when content_vectors is emptied

Dependencies and related work

Luke Boyett added 6 commits April 14, 2026 08:34
The narrow cross-runtime Database interface in src/db.ts defines the
subset of better-sqlite3 / bun:sqlite methods used throughout QMD.
Commit fee576b ("fix: migrate legacy lowercase paths on reindex")
introduced a db.transaction(...) call in src/store.ts but did not
extend the interface, breaking `tsc -p tsconfig.build.json`:

  src/store.ts(2142,22): error TS2339: Property 'transaction' does
  not exist on type 'Database'.

Both underlying engines expose transaction(fn), so this just makes
the type reflect reality.
`qmd embed -c <collection>` accepted the flag but ignored it: the CLI
handler never read cli.opts.collection, and getPendingEmbeddingDocs
selected all unembedded documents regardless of collection. A user
asking to embed a single collection would end up embedding every
pending document in the index.

Changes:
- Add optional `collection` to `EmbedOptions` and thread it through
  `generateEmbeddings` -> `getPendingEmbeddingDocs` with an added
  `AND d.collection = ?` filter.
- Add the same optional filter to `getHashesNeedingEmbedding` so the
  "nothing to do" pre-check reflects the requested scope.
- CLI embed handler forwards `cli.opts.collection` (first value if the
  parser hands back an array) to `vectorIndex` -> `generateEmbeddings`.

Test:
- New `test/store.test.ts` regression covering a two-collection fixture:
  `embed { collection: "alpha" }` embeds only alpha docs; a subsequent
  `embed { collection: "beta" }` adds beta without disturbing alpha.
Codex review on #2: when generateEmbeddings is called with both
`force: true` and `collection: "alpha"`, the original patch still
called `clearAllEmbeddings(db)` which wipes every collection's
vectors before the filtered re-embed runs — leaving every sibling
collection unembedded until a later full run.

- `clearAllEmbeddings` now accepts an optional `collection`. When
  set, it deletes only the rows in `content_vectors` whose hash
  belongs to a document in that collection, and removes the matching
  `hash_seq` entries from `vectors_vec` (vec0 virtual tables do not
  reliably accept IN-subquery DELETEs, so we enumerate first and
  delete per row). `vectors_vec` is preserved so other collections
  keep working.
- `generateEmbeddings` forwards `options.collection` to
  `clearAllEmbeddings` when `force` is set.
- New regression test covers the `-f -c` case: embed both
  collections, then force-re-embed only alpha and assert beta's
  vector count is unchanged.
Codex second-round review on #2: `content_vectors` is keyed globally
by a content hash. Two documents in different collections with
identical bodies share a single `content_vectors` row. The previous
patch deleted every row whose hash appears in the target collection,
which would remove vectors still referenced by active documents in
other collections — the filtered re-embed would not rebuild them.

- `clearAllEmbeddings(db, collection)` now deletes only hashes
  "owned" exclusively by active documents in the target collection,
  i.e. no active document in any other collection references the same
  hash. Shared hashes are left in place because their vectors are
  still valid for the sibling collections.
- Inactive rows in other collections do not block deletion (they do
  not need vector service); the check is active-only.
- New regression test `generateEmbeddings with force + collection
  preserves shared hashes`: two collections each contain a document
  with identical content plus one unique document; force-re-embed of
  alpha must leave the shared hash's vector present in
  `content_vectors`.
Codex P2 feedback on #2: the embed handler forwarded
`cli.opts.collection` directly, unlike search/query which funnel
through `resolveCollectionFilter`. A typo or nonexistent name was
accepted silently — `getHashesNeedingEmbedding(db, badName)` returns
0 and the command reported success with no work done, masking the
user's mistake.

- Route `cli.opts.collection` through `resolveCollectionFilter(..., false)`
  so an unknown collection errors with the same "Collection not found:
  <name>" behavior as other commands.
- embed targets a single collection; only the first validated value is
  used. When no `-c` is passed, the filter returns `[]` and embed runs
  its normal global pass.
Codex P2 on #2: `clearAllEmbeddings(db, collection)` preserved
`vectors_vec` in the scoped branch. If the target collection is the
only active collection (or owns every active hash), the scoped clear
empties `content_vectors` but leaves `vectors_vec` bound to the old
vec0 dimensions. A follow-up
`generateEmbeddings({ force: true, collection, model: <different-dim> })`
then fails on dimension mismatch instead of recreating the table.

- After the scoped clear, check whether `content_vectors` is empty
  and drop `vectors_vec` if so. The next embed run recreates it with
  the current model's dimensions via the usual `ensureVecTable`
  initialization path, matching the unscoped branch.
- New regression test: with a single active collection, scoped force
  re-embed leaves `vectors_vec` healthy and `content_vectors`
  populated post-rebuild.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant