Skip to content

fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668

Open
isac322 wants to merge 2 commits into
vectorize-io:mainfrom
isac322:fix/vchord-cosine-opclass
Open

fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668
isac322 wants to merge 2 commits into
vectorize-io:mainfrom
isac322:fix/vchord-cosine-opclass

Conversation

@isac322
Copy link
Copy Markdown
Contributor

@isac322 isac322 commented May 20, 2026

Closes #1667.

This PR fixes two related vchord-only correctness bugs that together explain the high CPU on vchord deployments reported in #1667:

  1. Opclass mismatch (the hot one). _vector_index.py mapped vchord to vchordrq (embedding vector_l2_ops), but every ANN query in the engine uses <=> (cosine distance), which vector_l2_ops does not match. The planner ignored the index entirely and fell back to sequential scan + per-row cosine computation, so the production primary sat at ~870m CPU during retain.
  2. Tuning GUC mismatch. SET LOCAL hnsw.ef_search = 60 (and SET hnsw.ef_search = 200 at pool init) only exists in pgvector. On a vchord deployment both statements silently no-op'd, which meant the "recall-vs-latency" trade-off the code was designed around had never actually been applied to vchord backends.

What changes

Switch the vchord opclass to vector_cosine_ops:

  • _vector_index.py:_INDEX_USING_CLAUSES["vchord"] flipped to vector_cosine_ops.
  • The four historical migrations that create vchord indexes inline get the same flip so fresh installs end up on cosine ops from the first migration onward: 5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4.
  • A new online migration b8c9d0e1f2a3_vchord_cosine_opclass rebuilds any existing vector_l2_ops vchordrq index with vector_cosine_ops via CREATE INDEX CONCURRENTLY + drop + rename. No-op when the configured extension is not vchord or when no L2-ops vchordrq indexes are found.

Dispatch ANN tuning GUCs by backend:

  • New helper _vector_index.ann_search_tuning_settings(ext, kind) returns (guc_name, value) pairs per backend. kind="low_latency" powers the retain-side ANN probe (was hnsw.ef_search = 60), kind="high_recall" powers the pool init (was hnsw.ef_search = 200).
  • Defaults: pgvector keeps hnsw.ef_search = 60 / 200; vchord uses vchordrq.probes = 10 / 30. pgvectorscale, pg_diskann, scann return no statements (no equivalent knob exposed today).
  • engine/retain/link_utils.py and engine/memory_engine.py switched to the dispatcher. Callers still own SET vs SET LOCAL since the two sites have different lifetime requirements.

Why

See #1667 for the full diagnosis. On the production cluster the same ANN query goes from 143 ms with <=> against vector_l2_ops (sequential scan, Buffers: shared hit=71531) to 25 ms with the same query body against a vector_cosine_ops index (Index Scan using idx_mu_emb_expr_<hash>, Buffers: shared hit=770 read=520) — ~5.6× per-query, and because the retain pipeline wraps this in CROSS JOIN LATERAL over _ann_seeds the penalty multiplies by seed count.

VectorChord's docs recommend vector_cosine_ops for cosine-similarity datasets; the engine's embeddings are L2-normalized in practice, so the existing <=> SQL is what we want. pgvector / pgvectorscale / pg_diskann were already on cosine ops, so this also restores cross-backend consistency.

Test plan

  • uv run ruff check on the changed files — pass
  • uv run ty check hindsight_api/ — pass
  • uv run pytest tests/test_vector_index.py tests/test_link_utils.py tests/test_migration_shape.py — 119 passed (new regression assertions: test_index_using_clause_vchord_uses_cosine_ops, test_ann_search_tuning_settings_{pgvector,vchord,…}, and the existing test_uses_set_local_for_ann_tuning parametrized over pgvector/vchord)
  • Manual on a vchord cluster: run the new migration, verify pg_indexes shows vector_cosine_ops, confirm EXPLAIN now picks the vchordrq index for the retain ANN query, and observe Postgres CPU return to baseline.

Migration impact

For vchord deployments the new migration rebuilds each vchordrq index online with CREATE INDEX CONCURRENTLY. Disk usage doubles for each index during its rebuild; queries continue to run on the old index until the swap finishes. Downgrade rebuilds back to vector_l2_ops using the same online flow. After the migration completes, vchord pools start applying vchordrq.probes instead of the silently-ignored hnsw.ef_search.

…r backend

Closes vectorize-io#1667.

vchordrq operator classes are bound 1:1 to operators: vector_l2_ops only
matches `<->`, while every Hindsight ANN query uses `<=>` (cosine distance).
The previous vchord mapping used vector_l2_ops, so the planner ignored the
index entirely and fell back to a sequential scan + per-row cosine
computation. Separately, `SET LOCAL hnsw.ef_search = 60` (retain) and
`SET hnsw.ef_search = 200` (pool init) only exist in pgvector and silently
no-op'd under vchord, so the recall-vs-latency trade-off had never been
applied to vchord deployments at all.

This switches the vchord opclass to vector_cosine_ops (matching the
engine's `<=>` queries), updates the four historical migrations that
create vchord indexes inline so fresh installs land on cosine ops, and
adds an online migration that rebuilds any existing L2-ops vchordrq
indexes via CREATE INDEX CONCURRENTLY + drop + rename. Also introduces an
ann_search_tuning_settings dispatcher so link_utils and the pool init
pick the right GUC per backend (hnsw.ef_search for pgvector,
vchordrq.probes for vchord).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@isac322
Copy link
Copy Markdown
Contributor Author

isac322 commented May 20, 2026

The verify-generated-files failure on this PR is pre-existing on main and unrelated to the changes here. The job reports drift on:

  • hindsight-api-slim/hindsight_api/engine/providers/openai_compatible_llm.py
  • skills/hindsight-docs/references/developer/models.md
  • skills/hindsight-docs/references/faq.md

…which are LLM-provider and docs-generation outputs. This PR only touches _vector_index.py, two engine files (engine/retain/link_utils.py, engine/memory_engine.py), five Alembic migrations under hindsight_api/alembic/versions/, and tests/test_vector_index.py + tests/test_link_utils.py — none of which are inputs to those generators.

Happy to rebase once main is regenerated, or to include the regenerated files in this PR if that's preferable. Let me know which direction you'd prefer.

Copy link
Copy Markdown

@elliotclee elliotclee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing this issue and submitting the patch. Just one small change requested below...



def _configured_vector_extension() -> str:
return os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a function (such as this one) in the main hindsight codebase that all places call to find out the configured vector extension. I don't think an alembic migration file is the right place to host such a function, and there are a few places in this patchset that should be using a function like this but are doing the getenv() call directly instead.

This matters because the method of accessing the configuration setting could concievably change in the future, but more importantly because the default value needs to be configurable from one spot.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in c4011bc. New configured_vector_extension() lives in hindsight_api/_vector_index.py and both runtime call sites (link_utils.py, memory_engine.py) and this migration now call it instead of doing os.getenv() directly.

One thing I wanted to flag before merging: hindsight has a deliberate test (test_alembic_vector_migrations_freeze_vector_sql_locally) that explicitly forbids the existing vchord-related migrations from importing hindsight_api._vector_index. The intent reads as "each historical migration's vector SQL must be frozen so it can't drift later if _vector_index.py is changed". The four legacy migrations (5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4) each keep their own inline _vector_index_using_clause() helper for the same reason.

For this PR I narrowed that frozen-state list to those four legacy migrations and let the new b8c9d0e1f2a3 opt into the shared helper, since that matches what you asked for. But I can imagine three ways to draw the line:

  1. Current PR — only the new migration uses the shared helper; legacy 4 stay frozen. ("Don't change history, do use the shared helper going forward.")
  2. Strict frozen-state — revert the new migration to its own inline helper too, so all five vchord migrations look identical. The runtime call-site dedup (link_utils.py, memory_engine.py) still stands.
  3. Drop the frozen-state policy entirely — port the legacy four to the shared helper too and delete the freeze test.

If (2) is what you'd prefer (i.e. you'd like the new migration to stay frozen too), happy to swap. (3) is a bigger policy call that probably belongs in a separate PR if it's worth doing at all. Which direction would you like?

# value tuned for top-50 semantic link creation (lower recall but much
# lower latency than the recall-side default). SET LOCAL auto-reverts
# at commit, so we don't pollute the pool for subsequent queries.
ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.

# returns the right one for the configured extension, tuned for
# the higher recall the per-fact_type semantic queries in
# retrieve_semantic_bm25_combined() need.
ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.

Per review on vectorize-io#1668: the env-var lookup that decides which vector backend
is configured was duplicated in three places (the new migration plus the
two runtime call sites in engine/retain/link_utils.py and
engine/memory_engine.py). Centralize the read + validation in
hindsight_api._vector_index.configured_vector_extension() so the default
value and the access mechanism live in one spot.

The new migration b8c9d0e1f2a3_vchord_cosine_opclass now imports the
shared helper instead of inlining its own. The four legacy vchord
migrations stay frozen (they keep their inline helpers); the frozen-state
test is narrowed to that legacy set so future vchord migrations can opt
into the shared helper on a per-migration basis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@elliotclee elliotclee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice diagnosis and a well-scoped fix — the EXPLAIN evidence in the description (143ms seq scan → 25ms index scan) makes the root cause unambiguous, and the per-backend dispatcher is a cleaner abstraction than the previous hnsw.ef_search-everywhere assumption. Particularly liked:

  • Flipping the four historical migrations to vector_cosine_ops so fresh vchord installs don't bootstrap-then-rebuild.
  • configured_vector_extension() centralizing the env lookup instead of scattering os.getenv calls.
  • Parametrized test_uses_set_local_for_ann_tuning over pgvector/vchord, and the explicit empty-tuple assertions for backends without an equivalent knob.
  • Renaming changed_migrationsfrozen_migrations in test_alembic_vector_migrations_freeze_vector_sql_locally with a comment carve-out for the new migration — that's the right call.

Requesting changes on a few items before merge:

Must address

1. CREATE INDEX CONCURRENTLY failure-recovery in b8c9d0e1f2a3_vchord_cosine_opclass.py.
If _rebuild_vchordrq_indexes errors midway (disk pressure, lock conflict, etc.), Postgres leaves the partial index as INVALID. On re-run:

  • CREATE INDEX CONCURRENTLY IF NOT EXISTS skips the INVALID temp index (it exists).
  • DROP INDEX IF EXISTS drops the old original.
  • ALTER INDEX … RENAME promotes the INVALID temp index into the canonical name.

Result: a broken index masquerading as the original, and queries silently fall back to seq scan again — exactly the bug we're fixing. Suggest either (a) DROP INDEX IF EXISTS … __opclass_swap at the top of each loop iteration, or (b) querying pg_index.indisvalid after the CONCURRENTLY create and raising if false.

2. Schema-prefix inconsistency. The new migration uses context.config.get_main_option(\"target_schema\") or \"public\", while every other PG migration in the tree uses _pg_schema_prefix() which returns \"\" when unset. Align with the existing convention so behavior is the same when target_schema is not configured.

3. Manual vchord verification is unchecked. The test plan's last bullet (run on a real vchord cluster, verify pg_indexes shows vector_cosine_ops, confirm EXPLAIN picks the index, observe CPU return to baseline) is still [ ]. Given this is the whole point of the PR, please complete it before merge and post the result here — automated tests don't exercise the migration against a real vchord backend.

Nice to have

4. GUC values lack a citation. vchordrq.probes = 10 / 30 look heuristic compared to the pgvector 60 / 200 numbers that presumably have history. If the chosen values land on the wrong knee of the recall/latency curve, we'd be substituting one silent miscalibration for another. Either a quick benchmark note in the PR or a follow-up tuning task would be enough.

5. indexdef.replace(idx_name, temp_name, 1) relies on the index name being the first textual occurrence in pg_indexes.indexdef — true in practice for the canonical reconstructed form, but worth a one-line comment noting the assumption.

6. _init_connection exception swallowing. Now that the dispatcher returns an empty tuple for backends without a knob, the broad except Exception is doing less work than it used to. Tightening to asyncpg.exceptions.PostgresError would avoid masking real bugs.

Overall: solid fix, scoped tightly to vchord, with the rest of the backends provably unchanged. The CONCURRENTLY recovery gap is the only thing I'd block on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vchord backend: ANN queries cannot use vchordrq indexes due to opclass/operator mismatch (vector_l2_ops vs <=>)

3 participants