fix(embed): honor -c <collection> by filtering pending docs by lukeboyett · Pull Request #2 · lukeboyett/qmd

lukeboyett · 2026-04-14T12:56:28Z

Summary

`qmd embed -c ` accepted the `-c` flag but silently ignored it.

Before

Running `qmd embed -c alpha` against an index with two collections (alpha + beta, each with 3 pending docs) embeds all 6 documents across both collections. The flag is parsed but never threaded into the pending-doc query.

After

Running `qmd embed -c alpha` embeds only the 3 alpha docs. A subsequent `qmd embed -c beta` adds 3 beta vectors without disturbing alpha.

Why

Two upstream code paths silently dropped the `collection` value:

`src/cli/qmd.ts` `case "embed":` never read `cli.opts.collection`.
`src/store.ts` `getPendingEmbeddingDocs(db)` and `getHashesNeedingEmbedding(db)` had no parameter to filter on.

`EmbedOptions` also didn't carry a `collection` field, so even SDK users couldn't scope embedding to one collection.

Changes

Add optional `collection` to `EmbedOptions`.
Thread it through `generateEmbeddings` -> `getPendingEmbeddingDocs` with `AND d.collection = ?`.
Apply the same optional filter to `getHashesNeedingEmbedding` so the "nothing to do" pre-check reflects the requested scope.
CLI embed handler forwards `cli.opts.collection` (first value when the parser hands back an array) to `vectorIndex` -> `generateEmbeddings`.

Test plan

New regression test `test/store.test.ts`: two-collection fixture, `embed { collection: "alpha" }` embeds only alpha; follow-up `embed { collection: "beta" }` adds beta without re-touching alpha; a re-run with an already-embedded collection is a no-op.
`npm run build` passes.
Full CI test suite (other tests unaffected since the filter is opt-in).

Notes

Based on `fix/db-transaction-type` because a clean `main` currently fails to build; that baseline build fix is the parent PR.
Reproduced against a real QMD index before and after this change; numbers quoted in Summary are from that repro.

`qmd embed -c <collection>` accepted the flag but ignored it: the CLI handler never read cli.opts.collection, and getPendingEmbeddingDocs selected all unembedded documents regardless of collection. A user asking to embed a single collection would end up embedding every pending document in the index. Changes: - Add optional `collection` to `EmbedOptions` and thread it through `generateEmbeddings` -> `getPendingEmbeddingDocs` with an added `AND d.collection = ?` filter. - Add the same optional filter to `getHashesNeedingEmbedding` so the "nothing to do" pre-check reflects the requested scope. - CLI embed handler forwards `cli.opts.collection` (first value if the parser hands back an array) to `vectorIndex` -> `generateEmbeddings`. Test: - New `test/store.test.ts` regression covering a two-collection fixture: `embed { collection: "alpha" }` embeds only alpha docs; a subsequent `embed { collection: "beta" }` adds beta without disturbing alpha.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3e87ed340

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T13:00:18Z

    clearAllEmbeddings(db);
  }

-  const docsToEmbed = getPendingEmbeddingDocs(db);
+  const docsToEmbed = getPendingEmbeddingDocs(db, options?.collection);


Restrict forced re-embed to selected collection

When generateEmbeddings is called with both force and collection, it still executes clearAllEmbeddings(db) before applying the new collection filter. In a multi-collection index, qmd embed -f -c alpha will delete vectors for every collection and then only rebuild alpha hashes from getPendingEmbeddingDocs(..., options?.collection), leaving other collections unembedded until a later full embed run.

Useful? React with 👍 / 👎.

Codex review on #2: when `generateEmbeddings` is called with both `force: true` and `collection: "alpha"`, the original patch still called `clearAllEmbeddings(db)` which wipes every collection's vectors before the filtered re-embed runs — leaving every sibling collection unembedded until a later full run. - `clearAllEmbeddings` now accepts an optional `collection`. When set, it deletes only the rows in `content_vectors` whose hash belongs to a document in that collection, and removes the matching hash_seq entries from `vectors_vec` (vec0 virtual tables don't reliably accept IN-subquery DELETEs, so we enumerate first and delete per row. is preserved so other collections keep working. - forwards to when is set. - New regression test covers the case: embed both collections, then force-re-embed only alpha and assert beta's vector count is unchanged. EOF )

Codex review on #2: when generateEmbeddings is called with both `force: true` and `collection: "alpha"`, the original patch still called `clearAllEmbeddings(db)` which wipes every collection's vectors before the filtered re-embed runs — leaving every sibling collection unembedded until a later full run. - `clearAllEmbeddings` now accepts an optional `collection`. When set, it deletes only the rows in `content_vectors` whose hash belongs to a document in that collection, and removes the matching `hash_seq` entries from `vectors_vec` (vec0 virtual tables do not reliably accept IN-subquery DELETEs, so we enumerate first and delete per row). `vectors_vec` is preserved so other collections keep working. - `generateEmbeddings` forwards `options.collection` to `clearAllEmbeddings` when `force` is set. - New regression test covers the `-f -c` case: embed both collections, then force-re-embed only alpha and assert beta's vector count is unchanged.

lukeboyett · 2026-04-14T13:07:12Z

Addressed in 5467198: clearAllEmbeddings now takes an optional collection and the caller forwards it under force, so embed -f -c alpha no longer wipes beta's vectors. New test generateEmbeddings with force + collection only clears that collection covers the case.

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5467198a82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T13:10:39Z

+    DELETE FROM content_vectors
+    WHERE hash IN (SELECT DISTINCT d.hash FROM documents d WHERE d.collection = ?)
+  `).run(collection);


Avoid deleting hashes still used outside target collection

content_vectors is keyed globally by hash, but this delete removes every hash that appears in the requested collection without checking whether the same hash is still referenced by active documents in other collections. In the --force --collection path, generateEmbeddings then only re-embeds active docs from that one collection, so a hash shared with (or only retained via inactive rows in) the target collection can be removed and left missing for sibling collections, degrading vector search until another embed run repairs it.

Useful? React with 👍 / 👎.

Codex second-round review on #2: `content_vectors` is keyed globally by a content hash. Two documents in different collections with identical bodies share a single `content_vectors` row. The previous patch deleted every row whose hash appears in the target collection, which would remove vectors still referenced by active documents in other collections — the filtered re-embed would not rebuild them. - `clearAllEmbeddings(db, collection)` now deletes only hashes "owned" exclusively by active documents in the target collection, i.e. no active document in any other collection references the same hash. Shared hashes are left in place because their vectors are still valid for the sibling collections. - Inactive rows in other collections do not block deletion (they do not need vector service); the check is active-only. - New regression test `generateEmbeddings with force + collection preserves shared hashes`: two collections each contain a document with identical content plus one unique document; force-re-embed of alpha must leave the shared hash's vector present in `content_vectors`.

lukeboyett · 2026-04-14T13:17:37Z

Addressed in 532fae4: clearAllEmbeddings(db, collection) now only removes hashes whose only active references are in the target collection — hashes shared with active docs in other collections are left alone, because content_vectors is keyed globally by content hash. New test generateEmbeddings with force + collection preserves shared hashes covers identical-body docs across two collections.

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 532fae4f96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T13:23:50Z

+        const embedCollectionOpt = cli.opts.collection;
+        const embedCollection = Array.isArray(embedCollectionOpt)
+          ? embedCollectionOpt[0]
+          : embedCollectionOpt;
        await vectorIndex(DEFAULT_EMBED_MODEL_URI, !!cli.values.force, {


Validate embed collection names before dispatch

The embed handler now forwards cli.opts.collection directly, but unlike search/query flows it never runs resolveCollectionFilter, so a typo or nonexistent collection silently becomes a no-op: getHashesNeedingEmbedding(db, badName) returns 0 and the command reports success while leaving pending docs untouched. This is a behavior regression for -c users because invalid collection input is no longer rejected consistently across commands.

Useful? React with 👍 / 👎.

Codex P2 feedback on #2: the embed handler forwarded `cli.opts.collection` directly, unlike search/query which funnel through `resolveCollectionFilter`. A typo or nonexistent name was accepted silently — `getHashesNeedingEmbedding(db, badName)` returns 0 and the command reported success with no work done, masking the user's mistake. - Route `cli.opts.collection` through `resolveCollectionFilter(..., false)` so an unknown collection errors with the same "Collection not found: <name>" behavior as other commands. - embed targets a single collection; only the first validated value is used. When no `-c` is passed, the filter returns `[]` and embed runs its normal global pass.

lukeboyett · 2026-04-14T13:52:22Z

Two pieces of feedback on 532fae4:

P2 "Validate embed collection names before dispatch" — addressed in 54ab824. Embed now routes -c through resolveCollectionFilter(..., false), so qmd embed -c alhpa errors with "Collection not found: alhpa" instead of silently reporting success. Verified against the local repro.

P1 "Avoid deleting hashes still used outside target collection" — I believe this is already handled as of 532fae4 and may be a re-flag of the earlier concern. The delete now uses exclusiveHashesQuery:

SELECT DISTINCT d.hash
FROM documents d
WHERE d.collection = ? AND d.active = 1
  AND NOT EXISTS (
    SELECT 1 FROM documents d2
    WHERE d2.hash = d.hash
      AND d2.active = 1
      AND d2.collection != d.collection
  )

A hash active in the target AND active in any sibling fails the NOT EXISTS check and is skipped. If the hash is only inactive in siblings, the sibling's vector service isn't affected (inactive docs are excluded from search). Test generateEmbeddings with force + collection preserves shared hashes covers the common case: identical-body docs in alpha + beta, force-re-embed alpha, shared hash stays in content_vectors.

If I'm missing a case — e.g. hash only retained via inactive rows in the target while active in a sibling — happy to add a test and cover it. Can you reproduce a scenario where 532fae4 drops a hash that's still needed?

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54ab82422b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-14T13:57:33Z

+  if (!collection) {
+    db.exec(`DELETE FROM content_vectors`);
+    db.exec(`DROP TABLE IF EXISTS vectors_vec`);
+    return;


Rebuild vec table after scoped force clears all hashes

In clearAllEmbeddings(db, collection), only the unscoped branch drops vectors_vec; the scoped branch always preserves it. When the target collection is the only active collection (or owns all active hashes), a scoped force run can remove every content_vectors row but leave vectors_vec with the old dimension, so a subsequent generateEmbeddings({ force: true, collection, model }) with a different embedding size fails on dimension mismatch instead of rebuilding from scratch. This makes scoped force re-embed unreliable for model/dimension migrations.

Useful? React with 👍 / 👎.

Codex P2 on #2: `clearAllEmbeddings(db, collection)` preserved `vectors_vec` in the scoped branch. If the target collection is the only active collection (or owns every active hash), the scoped clear empties `content_vectors` but leaves `vectors_vec` bound to the old vec0 dimensions. A follow-up `generateEmbeddings({ force: true, collection, model: <different-dim> })` then fails on dimension mismatch instead of recreating the table. - After the scoped clear, check whether `content_vectors` is empty and drop `vectors_vec` if so. The next embed run recreates it with the current model's dimensions via the usual `ensureVecTable` initialization path, matching the unscoped branch. - New regression test: with a single active collection, scoped force re-embed leaves `vectors_vec` healthy and `content_vectors` populated post-rebuild.

lukeboyett · 2026-04-14T14:00:27Z

Two items on 54ab824:

P2 "Rebuild vec table after scoped force clears all hashes" — valid, addressed in 0b28ee0. After the scoped clear, if content_vectors is empty we now drop vectors_vec as well, so ensureVecTable recreates it with the current model's dimensions on the next embed. New test generateEmbeddings scoped force drops vectors_vec when no rows remain covers the empty-table path.

P1 "Avoid deleting hashes still used outside target collection" — I still believe this is already handled by the NOT EXISTS clause in exclusiveHashesQuery (hashes active in a sibling collection are skipped; inactive rows in siblings have no vector need). I asked for a concrete failing case in my previous comment — happy to add a test if one exists. Treating this as a re-flag of the resolved concern unless someone can point to a scenario where the current delete picks a hash that a sibling still needs.

@codex review

chatgpt-codex-connector · 2026-04-14T14:06:57Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector bot reviewed Apr 14, 2026

View reviewed changes

lukeboyett force-pushed the fix/embed-collection-filter branch from a7a91be to 5467198 Compare April 14, 2026 13:07

chatgpt-codex-connector bot reviewed Apr 14, 2026

View reviewed changes

This was referenced Apr 14, 2026

feat(llm): add optional OpenAI embedding backend #3

Open

perf(store): add opt-in TTL cache to getStatus for polling callers #4

Open

feat(bin): opt-in daemon fast-path for search and vsearch #5

Open

Conversation

lukeboyett commented Apr 14, 2026

Summary

Before

After

Why

Changes

Test plan

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

lukeboyett commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

lukeboyett commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

lukeboyett commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

lukeboyett commented Apr 14, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant