fix(embed): honor -c <collection> with scoped force-clear and shared-hash preservation#580
Open
lukeboyett wants to merge 6 commits intotobi:mainfrom
Open
fix(embed): honor -c <collection> with scoped force-clear and shared-hash preservation#580lukeboyett wants to merge 6 commits intotobi:mainfrom
lukeboyett wants to merge 6 commits intotobi:mainfrom
Conversation
added 6 commits
April 14, 2026 08:34
The narrow cross-runtime Database interface in src/db.ts defines the subset of better-sqlite3 / bun:sqlite methods used throughout QMD. Commit fee576b ("fix: migrate legacy lowercase paths on reindex") introduced a db.transaction(...) call in src/store.ts but did not extend the interface, breaking `tsc -p tsconfig.build.json`: src/store.ts(2142,22): error TS2339: Property 'transaction' does not exist on type 'Database'. Both underlying engines expose transaction(fn), so this just makes the type reflect reality.
`qmd embed -c <collection>` accepted the flag but ignored it: the CLI
handler never read cli.opts.collection, and getPendingEmbeddingDocs
selected all unembedded documents regardless of collection. A user
asking to embed a single collection would end up embedding every
pending document in the index.
Changes:
- Add optional `collection` to `EmbedOptions` and thread it through
`generateEmbeddings` -> `getPendingEmbeddingDocs` with an added
`AND d.collection = ?` filter.
- Add the same optional filter to `getHashesNeedingEmbedding` so the
"nothing to do" pre-check reflects the requested scope.
- CLI embed handler forwards `cli.opts.collection` (first value if the
parser hands back an array) to `vectorIndex` -> `generateEmbeddings`.
Test:
- New `test/store.test.ts` regression covering a two-collection fixture:
`embed { collection: "alpha" }` embeds only alpha docs; a subsequent
`embed { collection: "beta" }` adds beta without disturbing alpha.
Codex review on #2: when generateEmbeddings is called with both `force: true` and `collection: "alpha"`, the original patch still called `clearAllEmbeddings(db)` which wipes every collection's vectors before the filtered re-embed runs — leaving every sibling collection unembedded until a later full run. - `clearAllEmbeddings` now accepts an optional `collection`. When set, it deletes only the rows in `content_vectors` whose hash belongs to a document in that collection, and removes the matching `hash_seq` entries from `vectors_vec` (vec0 virtual tables do not reliably accept IN-subquery DELETEs, so we enumerate first and delete per row). `vectors_vec` is preserved so other collections keep working. - `generateEmbeddings` forwards `options.collection` to `clearAllEmbeddings` when `force` is set. - New regression test covers the `-f -c` case: embed both collections, then force-re-embed only alpha and assert beta's vector count is unchanged.
Codex second-round review on #2: `content_vectors` is keyed globally by a content hash. Two documents in different collections with identical bodies share a single `content_vectors` row. The previous patch deleted every row whose hash appears in the target collection, which would remove vectors still referenced by active documents in other collections — the filtered re-embed would not rebuild them. - `clearAllEmbeddings(db, collection)` now deletes only hashes "owned" exclusively by active documents in the target collection, i.e. no active document in any other collection references the same hash. Shared hashes are left in place because their vectors are still valid for the sibling collections. - Inactive rows in other collections do not block deletion (they do not need vector service); the check is active-only. - New regression test `generateEmbeddings with force + collection preserves shared hashes`: two collections each contain a document with identical content plus one unique document; force-re-embed of alpha must leave the shared hash's vector present in `content_vectors`.
Codex P2 feedback on #2: the embed handler forwarded `cli.opts.collection` directly, unlike search/query which funnel through `resolveCollectionFilter`. A typo or nonexistent name was accepted silently — `getHashesNeedingEmbedding(db, badName)` returns 0 and the command reported success with no work done, masking the user's mistake. - Route `cli.opts.collection` through `resolveCollectionFilter(..., false)` so an unknown collection errors with the same "Collection not found: <name>" behavior as other commands. - embed targets a single collection; only the first validated value is used. When no `-c` is passed, the filter returns `[]` and embed runs its normal global pass.
Codex P2 on #2: `clearAllEmbeddings(db, collection)` preserved `vectors_vec` in the scoped branch. If the target collection is the only active collection (or owns every active hash), the scoped clear empties `content_vectors` but leaves `vectors_vec` bound to the old vec0 dimensions. A follow-up `generateEmbeddings({ force: true, collection, model: <different-dim> })` then fails on dimension mismatch instead of recreating the table. - After the scoped clear, check whether `content_vectors` is empty and drop `vectors_vec` if so. The next embed run recreates it with the current model's dimensions via the usual `ensureVecTable` initialization path, matching the unscoped branch. - New regression test: with a single active collection, scoped force re-embed leaves `vectors_vec` healthy and `content_vectors` populated post-rebuild.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
qmd embed -c <collection>accepts the flag but ignores it: the handler neverreads
cli.opts.collection, so a scoped embed ends up embedding every pendingdocument in the index. This PR makes
-cbehave onembedthe way it does onsearch/query/ #566's forthcomingupdate -c, and tightens the adjacent--forceand vec-table paths that scoped embedding exposes.Five focused commits:
fix(embed): honor -c <collection> by filtering pending docs— threadoptions.collectionthroughgenerateEmbeddings→getPendingEmbeddingDocs/
getHashesNeedingEmbeddingwith an addedAND d.collection = ?filter.CLI handler forwards
cli.opts.collection.fix(embed): scope --force clear to the requested collection— whencalled with
force: trueandcollection,clearAllEmbeddingsnowdeletes only rows whose hash belongs to a doc in that collection and prunes
the matching
vectors_vechash_seq entries, instead of wiping everycollection's vectors. (vec0 virtual tables don't reliably accept
IN-subquery DELETEs, so we enumerate and delete per row.)
fix(embed): preserve shared hashes when force-clearing a collection—content_vectorsis keyed globally by content hash; two docs in differentcollections with identical bodies share one row. The scoped clear now only
removes hashes owned exclusively by active docs in the target collection,
so shared vectors stay valid for sibling collections.
fix(embed): validate -c against configured collections— routecli.opts.collectionthroughresolveCollectionFilter(..., false)so anunknown collection errors consistently with the rest of the CLI, instead
of silently reporting "0 to embed." This matches the pattern feat(update): add -c/--collection filter to qmd update #566 uses for
update -c.fix(embed): drop vectors_vec when scoped force empties content_vectors— if the scoped clear empties
content_vectorsentirely (e.g. the targetis the only active collection), drop
vectors_vecso the nextgenerateEmbeddingsrun recreates it viaensureVecTablewith the currentmodel's dimensions, matching the unscoped branch's behavior.
Test plan
npm run buildpasses (depends on fix(db): declare transaction() on local Database interface #579 forDatabase.transaction()type — see below)npm test— passes with new regression coverage intest/store.test.ts:-f -cpreserves sibling collection vectors-cvalue errors cleanlyvectors_vecwhencontent_vectorsis emptiedDependencies and related work
fix(db): declare transaction() on local Database interface).maincurrently failstsc -p tsconfig.build.jsonon an unrelated call site, so CI on this PR will be red until fix(db): declare transaction() on local Database interface #579 merges; I'll rebase onto post-mergemainthen.feat(update): add -c/--collection filter to qmd update). SameresolveCollectionFilter(..., false)validation path, so the two should compose without conflict regardless of merge order.