fix(embed): honor -c <collection> by filtering pending docs#2
fix(embed): honor -c <collection> by filtering pending docs#2lukeboyett wants to merge 5 commits intofix/db-transaction-typefrom
Conversation
`qmd embed -c <collection>` accepted the flag but ignored it: the CLI
handler never read cli.opts.collection, and getPendingEmbeddingDocs
selected all unembedded documents regardless of collection. A user
asking to embed a single collection would end up embedding every
pending document in the index.
Changes:
- Add optional `collection` to `EmbedOptions` and thread it through
`generateEmbeddings` -> `getPendingEmbeddingDocs` with an added
`AND d.collection = ?` filter.
- Add the same optional filter to `getHashesNeedingEmbedding` so the
"nothing to do" pre-check reflects the requested scope.
- CLI embed handler forwards `cli.opts.collection` (first value if the
parser hands back an array) to `vectorIndex` -> `generateEmbeddings`.
Test:
- New `test/store.test.ts` regression covering a two-collection fixture:
`embed { collection: "alpha" }` embeds only alpha docs; a subsequent
`embed { collection: "beta" }` adds beta without disturbing alpha.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3e87ed340
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| clearAllEmbeddings(db); | ||
| } | ||
|
|
||
| const docsToEmbed = getPendingEmbeddingDocs(db); | ||
| const docsToEmbed = getPendingEmbeddingDocs(db, options?.collection); |
There was a problem hiding this comment.
Restrict forced re-embed to selected collection
When generateEmbeddings is called with both force and collection, it still executes clearAllEmbeddings(db) before applying the new collection filter. In a multi-collection index, qmd embed -f -c alpha will delete vectors for every collection and then only rebuild alpha hashes from getPendingEmbeddingDocs(..., options?.collection), leaving other collections unembedded until a later full embed run.
Useful? React with 👍 / 👎.
Codex review on #2: when `generateEmbeddings` is called with both `force: true` and `collection: "alpha"`, the original patch still called `clearAllEmbeddings(db)` which wipes every collection's vectors before the filtered re-embed runs — leaving every sibling collection unembedded until a later full run. - `clearAllEmbeddings` now accepts an optional `collection`. When set, it deletes only the rows in `content_vectors` whose hash belongs to a document in that collection, and removes the matching hash_seq entries from `vectors_vec` (vec0 virtual tables don't reliably accept IN-subquery DELETEs, so we enumerate first and delete per row. is preserved so other collections keep working. - forwards to when is set. - New regression test covers the case: embed both collections, then force-re-embed only alpha and assert beta's vector count is unchanged. EOF )
Codex review on #2: when generateEmbeddings is called with both `force: true` and `collection: "alpha"`, the original patch still called `clearAllEmbeddings(db)` which wipes every collection's vectors before the filtered re-embed runs — leaving every sibling collection unembedded until a later full run. - `clearAllEmbeddings` now accepts an optional `collection`. When set, it deletes only the rows in `content_vectors` whose hash belongs to a document in that collection, and removes the matching `hash_seq` entries from `vectors_vec` (vec0 virtual tables do not reliably accept IN-subquery DELETEs, so we enumerate first and delete per row). `vectors_vec` is preserved so other collections keep working. - `generateEmbeddings` forwards `options.collection` to `clearAllEmbeddings` when `force` is set. - New regression test covers the `-f -c` case: embed both collections, then force-re-embed only alpha and assert beta's vector count is unchanged.
a7a91be to
5467198
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5467198a82
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| DELETE FROM content_vectors | ||
| WHERE hash IN (SELECT DISTINCT d.hash FROM documents d WHERE d.collection = ?) | ||
| `).run(collection); |
There was a problem hiding this comment.
Avoid deleting hashes still used outside target collection
content_vectors is keyed globally by hash, but this delete removes every hash that appears in the requested collection without checking whether the same hash is still referenced by active documents in other collections. In the --force --collection path, generateEmbeddings then only re-embeds active docs from that one collection, so a hash shared with (or only retained via inactive rows in) the target collection can be removed and left missing for sibling collections, degrading vector search until another embed run repairs it.
Useful? React with 👍 / 👎.
Codex second-round review on #2: `content_vectors` is keyed globally by a content hash. Two documents in different collections with identical bodies share a single `content_vectors` row. The previous patch deleted every row whose hash appears in the target collection, which would remove vectors still referenced by active documents in other collections — the filtered re-embed would not rebuild them. - `clearAllEmbeddings(db, collection)` now deletes only hashes "owned" exclusively by active documents in the target collection, i.e. no active document in any other collection references the same hash. Shared hashes are left in place because their vectors are still valid for the sibling collections. - Inactive rows in other collections do not block deletion (they do not need vector service); the check is active-only. - New regression test `generateEmbeddings with force + collection preserves shared hashes`: two collections each contain a document with identical content plus one unique document; force-re-embed of alpha must leave the shared hash's vector present in `content_vectors`.
|
Addressed in 532fae4: @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 532fae4f96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const embedCollectionOpt = cli.opts.collection; | ||
| const embedCollection = Array.isArray(embedCollectionOpt) | ||
| ? embedCollectionOpt[0] | ||
| : embedCollectionOpt; | ||
| await vectorIndex(DEFAULT_EMBED_MODEL_URI, !!cli.values.force, { |
There was a problem hiding this comment.
Validate embed collection names before dispatch
The embed handler now forwards cli.opts.collection directly, but unlike search/query flows it never runs resolveCollectionFilter, so a typo or nonexistent collection silently becomes a no-op: getHashesNeedingEmbedding(db, badName) returns 0 and the command reports success while leaving pending docs untouched. This is a behavior regression for -c users because invalid collection input is no longer rejected consistently across commands.
Useful? React with 👍 / 👎.
Codex P2 feedback on #2: the embed handler forwarded `cli.opts.collection` directly, unlike search/query which funnel through `resolveCollectionFilter`. A typo or nonexistent name was accepted silently — `getHashesNeedingEmbedding(db, badName)` returns 0 and the command reported success with no work done, masking the user's mistake. - Route `cli.opts.collection` through `resolveCollectionFilter(..., false)` so an unknown collection errors with the same "Collection not found: <name>" behavior as other commands. - embed targets a single collection; only the first validated value is used. When no `-c` is passed, the filter returns `[]` and embed runs its normal global pass.
|
Two pieces of feedback on 532fae4: P2 "Validate embed collection names before dispatch" — addressed in 54ab824. Embed now routes P1 "Avoid deleting hashes still used outside target collection" — I believe this is already handled as of 532fae4 and may be a re-flag of the earlier concern. The delete now uses SELECT DISTINCT d.hash
FROM documents d
WHERE d.collection = ? AND d.active = 1
AND NOT EXISTS (
SELECT 1 FROM documents d2
WHERE d2.hash = d.hash
AND d2.active = 1
AND d2.collection != d.collection
)A hash active in the target AND active in any sibling fails the If I'm missing a case — e.g. hash only retained via inactive rows in the target while active in a sibling — happy to add a test and cover it. Can you reproduce a scenario where 532fae4 drops a hash that's still needed? @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54ab82422b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!collection) { | ||
| db.exec(`DELETE FROM content_vectors`); | ||
| db.exec(`DROP TABLE IF EXISTS vectors_vec`); | ||
| return; |
There was a problem hiding this comment.
Rebuild vec table after scoped force clears all hashes
In clearAllEmbeddings(db, collection), only the unscoped branch drops vectors_vec; the scoped branch always preserves it. When the target collection is the only active collection (or owns all active hashes), a scoped force run can remove every content_vectors row but leave vectors_vec with the old dimension, so a subsequent generateEmbeddings({ force: true, collection, model }) with a different embedding size fails on dimension mismatch instead of rebuilding from scratch. This makes scoped force re-embed unreliable for model/dimension migrations.
Useful? React with 👍 / 👎.
Codex P2 on #2: `clearAllEmbeddings(db, collection)` preserved `vectors_vec` in the scoped branch. If the target collection is the only active collection (or owns every active hash), the scoped clear empties `content_vectors` but leaves `vectors_vec` bound to the old vec0 dimensions. A follow-up `generateEmbeddings({ force: true, collection, model: <different-dim> })` then fails on dimension mismatch instead of recreating the table. - After the scoped clear, check whether `content_vectors` is empty and drop `vectors_vec` if so. The next embed run recreates it with the current model's dimensions via the usual `ensureVecTable` initialization path, matching the unscoped branch. - New regression test: with a single active collection, scoped force re-embed leaves `vectors_vec` healthy and `content_vectors` populated post-rebuild.
|
Two items on 54ab824: P2 "Rebuild vec table after scoped force clears all hashes" — valid, addressed in 0b28ee0. After the scoped clear, if P1 "Avoid deleting hashes still used outside target collection" — I still believe this is already handled by the @codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
`qmd embed -c ` accepted the `-c` flag but silently ignored it.
Before
Running `qmd embed -c alpha` against an index with two collections (alpha + beta, each with 3 pending docs) embeds all 6 documents across both collections. The flag is parsed but never threaded into the pending-doc query.
After
Running `qmd embed -c alpha` embeds only the 3 alpha docs. A subsequent `qmd embed -c beta` adds 3 beta vectors without disturbing alpha.
Why
Two upstream code paths silently dropped the `collection` value:
`EmbedOptions` also didn't carry a `collection` field, so even SDK users couldn't scope embedding to one collection.
Changes
Test plan
Notes