Skip to content

fix(embed): honor -c <collection> by filtering pending docs#2

Open
lukeboyett wants to merge 5 commits intofix/db-transaction-typefrom
fix/embed-collection-filter
Open

fix(embed): honor -c <collection> by filtering pending docs#2
lukeboyett wants to merge 5 commits intofix/db-transaction-typefrom
fix/embed-collection-filter

Conversation

@lukeboyett
Copy link
Copy Markdown
Owner

Summary

`qmd embed -c ` accepted the `-c` flag but silently ignored it.

Before

Running `qmd embed -c alpha` against an index with two collections (alpha + beta, each with 3 pending docs) embeds all 6 documents across both collections. The flag is parsed but never threaded into the pending-doc query.

After

Running `qmd embed -c alpha` embeds only the 3 alpha docs. A subsequent `qmd embed -c beta` adds 3 beta vectors without disturbing alpha.

Why

Two upstream code paths silently dropped the `collection` value:

  1. `src/cli/qmd.ts` `case "embed":` never read `cli.opts.collection`.
  2. `src/store.ts` `getPendingEmbeddingDocs(db)` and `getHashesNeedingEmbedding(db)` had no parameter to filter on.

`EmbedOptions` also didn't carry a `collection` field, so even SDK users couldn't scope embedding to one collection.

Changes

  • Add optional `collection` to `EmbedOptions`.
  • Thread it through `generateEmbeddings` -> `getPendingEmbeddingDocs` with `AND d.collection = ?`.
  • Apply the same optional filter to `getHashesNeedingEmbedding` so the "nothing to do" pre-check reflects the requested scope.
  • CLI embed handler forwards `cli.opts.collection` (first value when the parser hands back an array) to `vectorIndex` -> `generateEmbeddings`.

Test plan

  • New regression test `test/store.test.ts`: two-collection fixture, `embed { collection: "alpha" }` embeds only alpha; follow-up `embed { collection: "beta" }` adds beta without re-touching alpha; a re-run with an already-embedded collection is a no-op.
  • `npm run build` passes.
  • Full CI test suite (other tests unaffected since the filter is opt-in).

Notes

  • Based on `fix/db-transaction-type` because a clean `main` currently fails to build; that baseline build fix is the parent PR.
  • Reproduced against a real QMD index before and after this change; numbers quoted in Summary are from that repro.

`qmd embed -c <collection>` accepted the flag but ignored it: the CLI
handler never read cli.opts.collection, and getPendingEmbeddingDocs
selected all unembedded documents regardless of collection. A user
asking to embed a single collection would end up embedding every
pending document in the index.

Changes:
- Add optional `collection` to `EmbedOptions` and thread it through
  `generateEmbeddings` -> `getPendingEmbeddingDocs` with an added
  `AND d.collection = ?` filter.
- Add the same optional filter to `getHashesNeedingEmbedding` so the
  "nothing to do" pre-check reflects the requested scope.
- CLI embed handler forwards `cli.opts.collection` (first value if the
  parser hands back an array) to `vectorIndex` -> `generateEmbeddings`.

Test:
- New `test/store.test.ts` regression covering a two-collection fixture:
  `embed { collection: "alpha" }` embeds only alpha docs; a subsequent
  `embed { collection: "beta" }` adds beta without disturbing alpha.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3e87ed340

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/store.ts Outdated
Comment on lines +1427 to +1430
clearAllEmbeddings(db);
}

const docsToEmbed = getPendingEmbeddingDocs(db);
const docsToEmbed = getPendingEmbeddingDocs(db, options?.collection);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict forced re-embed to selected collection

When generateEmbeddings is called with both force and collection, it still executes clearAllEmbeddings(db) before applying the new collection filter. In a multi-collection index, qmd embed -f -c alpha will delete vectors for every collection and then only rebuild alpha hashes from getPendingEmbeddingDocs(..., options?.collection), leaving other collections unembedded until a later full embed run.

Useful? React with 👍 / 👎.

lukeboyett added a commit that referenced this pull request Apr 14, 2026
Codex review on #2: when `generateEmbeddings` is called with both
`force: true` and `collection: "alpha"`, the original patch still
called `clearAllEmbeddings(db)` which wipes every collection's vectors
before the filtered re-embed runs — leaving every sibling collection
unembedded until a later full run.

- `clearAllEmbeddings` now accepts an optional `collection`. When set,
  it deletes only the rows in `content_vectors` whose hash belongs to
  a document in that collection, and removes the matching hash_seq
  entries from `vectors_vec` (vec0 virtual tables don't reliably accept
  IN-subquery DELETEs, so we enumerate first and delete per row.
   is preserved so other collections keep working.
-  forwards  to
   when  is set.
- New regression test covers the  case: embed both collections,
  then force-re-embed only alpha and assert beta's vector count is
  unchanged.
EOF
)
Codex review on #2: when generateEmbeddings is called with both
`force: true` and `collection: "alpha"`, the original patch still
called `clearAllEmbeddings(db)` which wipes every collection's
vectors before the filtered re-embed runs — leaving every sibling
collection unembedded until a later full run.

- `clearAllEmbeddings` now accepts an optional `collection`. When
  set, it deletes only the rows in `content_vectors` whose hash
  belongs to a document in that collection, and removes the matching
  `hash_seq` entries from `vectors_vec` (vec0 virtual tables do not
  reliably accept IN-subquery DELETEs, so we enumerate first and
  delete per row). `vectors_vec` is preserved so other collections
  keep working.
- `generateEmbeddings` forwards `options.collection` to
  `clearAllEmbeddings` when `force` is set.
- New regression test covers the `-f -c` case: embed both
  collections, then force-re-embed only alpha and assert beta's
  vector count is unchanged.
@lukeboyett lukeboyett force-pushed the fix/embed-collection-filter branch from a7a91be to 5467198 Compare April 14, 2026 13:07
@lukeboyett
Copy link
Copy Markdown
Owner Author

Addressed in 5467198: clearAllEmbeddings now takes an optional collection and the caller forwards it under force, so embed -f -c alpha no longer wipes beta's vectors. New test generateEmbeddings with force + collection only clears that collection covers the case.

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5467198a82

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/store.ts
Comment on lines +3264 to +3266
DELETE FROM content_vectors
WHERE hash IN (SELECT DISTINCT d.hash FROM documents d WHERE d.collection = ?)
`).run(collection);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid deleting hashes still used outside target collection

content_vectors is keyed globally by hash, but this delete removes every hash that appears in the requested collection without checking whether the same hash is still referenced by active documents in other collections. In the --force --collection path, generateEmbeddings then only re-embeds active docs from that one collection, so a hash shared with (or only retained via inactive rows in) the target collection can be removed and left missing for sibling collections, degrading vector search until another embed run repairs it.

Useful? React with 👍 / 👎.

Codex second-round review on #2: `content_vectors` is keyed globally
by a content hash. Two documents in different collections with
identical bodies share a single `content_vectors` row. The previous
patch deleted every row whose hash appears in the target collection,
which would remove vectors still referenced by active documents in
other collections — the filtered re-embed would not rebuild them.

- `clearAllEmbeddings(db, collection)` now deletes only hashes
  "owned" exclusively by active documents in the target collection,
  i.e. no active document in any other collection references the same
  hash. Shared hashes are left in place because their vectors are
  still valid for the sibling collections.
- Inactive rows in other collections do not block deletion (they do
  not need vector service); the check is active-only.
- New regression test `generateEmbeddings with force + collection
  preserves shared hashes`: two collections each contain a document
  with identical content plus one unique document; force-re-embed of
  alpha must leave the shared hash's vector present in
  `content_vectors`.
@lukeboyett
Copy link
Copy Markdown
Owner Author

Addressed in 532fae4: clearAllEmbeddings(db, collection) now only removes hashes whose only active references are in the target collection — hashes shared with active docs in other collections are left alone, because content_vectors is keyed globally by content hash. New test generateEmbeddings with force + collection preserves shared hashes covers identical-body docs across two collections.

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 532fae4f96

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/qmd.ts Outdated
Comment on lines 3111 to 3115
const embedCollectionOpt = cli.opts.collection;
const embedCollection = Array.isArray(embedCollectionOpt)
? embedCollectionOpt[0]
: embedCollectionOpt;
await vectorIndex(DEFAULT_EMBED_MODEL_URI, !!cli.values.force, {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate embed collection names before dispatch

The embed handler now forwards cli.opts.collection directly, but unlike search/query flows it never runs resolveCollectionFilter, so a typo or nonexistent collection silently becomes a no-op: getHashesNeedingEmbedding(db, badName) returns 0 and the command reports success while leaving pending docs untouched. This is a behavior regression for -c users because invalid collection input is no longer rejected consistently across commands.

Useful? React with 👍 / 👎.

Codex P2 feedback on #2: the embed handler forwarded
`cli.opts.collection` directly, unlike search/query which funnel
through `resolveCollectionFilter`. A typo or nonexistent name was
accepted silently — `getHashesNeedingEmbedding(db, badName)` returns
0 and the command reported success with no work done, masking the
user's mistake.

- Route `cli.opts.collection` through `resolveCollectionFilter(..., false)`
  so an unknown collection errors with the same "Collection not found:
  <name>" behavior as other commands.
- embed targets a single collection; only the first validated value is
  used. When no `-c` is passed, the filter returns `[]` and embed runs
  its normal global pass.
@lukeboyett
Copy link
Copy Markdown
Owner Author

Two pieces of feedback on 532fae4:

P2 "Validate embed collection names before dispatch" — addressed in 54ab824. Embed now routes -c through resolveCollectionFilter(..., false), so qmd embed -c alhpa errors with "Collection not found: alhpa" instead of silently reporting success. Verified against the local repro.

P1 "Avoid deleting hashes still used outside target collection" — I believe this is already handled as of 532fae4 and may be a re-flag of the earlier concern. The delete now uses exclusiveHashesQuery:

SELECT DISTINCT d.hash
FROM documents d
WHERE d.collection = ? AND d.active = 1
  AND NOT EXISTS (
    SELECT 1 FROM documents d2
    WHERE d2.hash = d.hash
      AND d2.active = 1
      AND d2.collection != d.collection
  )

A hash active in the target AND active in any sibling fails the NOT EXISTS check and is skipped. If the hash is only inactive in siblings, the sibling's vector service isn't affected (inactive docs are excluded from search). Test generateEmbeddings with force + collection preserves shared hashes covers the common case: identical-body docs in alpha + beta, force-re-embed alpha, shared hash stays in content_vectors.

If I'm missing a case — e.g. hash only retained via inactive rows in the target while active in a sibling — happy to add a test and cover it. Can you reproduce a scenario where 532fae4 drops a hash that's still needed?

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54ab82422b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/store.ts
Comment on lines +3243 to +3246
if (!collection) {
db.exec(`DELETE FROM content_vectors`);
db.exec(`DROP TABLE IF EXISTS vectors_vec`);
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rebuild vec table after scoped force clears all hashes

In clearAllEmbeddings(db, collection), only the unscoped branch drops vectors_vec; the scoped branch always preserves it. When the target collection is the only active collection (or owns all active hashes), a scoped force run can remove every content_vectors row but leave vectors_vec with the old dimension, so a subsequent generateEmbeddings({ force: true, collection, model }) with a different embedding size fails on dimension mismatch instead of rebuilding from scratch. This makes scoped force re-embed unreliable for model/dimension migrations.

Useful? React with 👍 / 👎.

Codex P2 on #2: `clearAllEmbeddings(db, collection)` preserved
`vectors_vec` in the scoped branch. If the target collection is the
only active collection (or owns every active hash), the scoped clear
empties `content_vectors` but leaves `vectors_vec` bound to the old
vec0 dimensions. A follow-up
`generateEmbeddings({ force: true, collection, model: <different-dim> })`
then fails on dimension mismatch instead of recreating the table.

- After the scoped clear, check whether `content_vectors` is empty
  and drop `vectors_vec` if so. The next embed run recreates it with
  the current model's dimensions via the usual `ensureVecTable`
  initialization path, matching the unscoped branch.
- New regression test: with a single active collection, scoped force
  re-embed leaves `vectors_vec` healthy and `content_vectors`
  populated post-rebuild.
@lukeboyett
Copy link
Copy Markdown
Owner Author

Two items on 54ab824:

P2 "Rebuild vec table after scoped force clears all hashes" — valid, addressed in 0b28ee0. After the scoped clear, if content_vectors is empty we now drop vectors_vec as well, so ensureVecTable recreates it with the current model's dimensions on the next embed. New test generateEmbeddings scoped force drops vectors_vec when no rows remain covers the empty-table path.

P1 "Avoid deleting hashes still used outside target collection" — I still believe this is already handled by the NOT EXISTS clause in exclusiveHashesQuery (hashes active in a sibling collection are skipped; inactive rows in siblings have no vector need). I asked for a concrete failing case in my previous comment — happy to add a test if one exists. Treating this as a re-flag of the resolved concern unless someone can point to a scenario where the current delete picks a hash that a sibling still needs.

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant