fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668
fix(api): vchord ANN — use cosine opclass and dispatch tuning GUCs per backend#1668isac322 wants to merge 2 commits into
Conversation
…r backend Closes vectorize-io#1667. vchordrq operator classes are bound 1:1 to operators: vector_l2_ops only matches `<->`, while every Hindsight ANN query uses `<=>` (cosine distance). The previous vchord mapping used vector_l2_ops, so the planner ignored the index entirely and fell back to a sequential scan + per-row cosine computation. Separately, `SET LOCAL hnsw.ef_search = 60` (retain) and `SET hnsw.ef_search = 200` (pool init) only exist in pgvector and silently no-op'd under vchord, so the recall-vs-latency trade-off had never been applied to vchord deployments at all. This switches the vchord opclass to vector_cosine_ops (matching the engine's `<=>` queries), updates the four historical migrations that create vchord indexes inline so fresh installs land on cosine ops, and adds an online migration that rebuilds any existing L2-ops vchordrq indexes via CREATE INDEX CONCURRENTLY + drop + rename. Also introduces an ann_search_tuning_settings dispatcher so link_utils and the pool init pick the right GUC per backend (hnsw.ef_search for pgvector, vchordrq.probes for vchord). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The
…which are LLM-provider and docs-generation outputs. This PR only touches Happy to rebase once |
elliotclee
left a comment
There was a problem hiding this comment.
Thank you for noticing this issue and submitting the patch. Just one small change requested below...
|
|
||
|
|
||
| def _configured_vector_extension() -> str: | ||
| return os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower() |
There was a problem hiding this comment.
There should be a function (such as this one) in the main hindsight codebase that all places call to find out the configured vector extension. I don't think an alembic migration file is the right place to host such a function, and there are a few places in this patchset that should be using a function like this but are doing the getenv() call directly instead.
This matters because the method of accessing the configuration setting could concievably change in the future, but more importantly because the default value needs to be configurable from one spot.
There was a problem hiding this comment.
Done in c4011bc. New configured_vector_extension() lives in hindsight_api/_vector_index.py and both runtime call sites (link_utils.py, memory_engine.py) and this migration now call it instead of doing os.getenv() directly.
One thing I wanted to flag before merging: hindsight has a deliberate test (test_alembic_vector_migrations_freeze_vector_sql_locally) that explicitly forbids the existing vchord-related migrations from importing hindsight_api._vector_index. The intent reads as "each historical migration's vector SQL must be frozen so it can't drift later if _vector_index.py is changed". The four legacy migrations (5a366d414dce, a4b5c6d7e8f9, d5e6f7a8b9c0, n9i0j1k2l3m4) each keep their own inline _vector_index_using_clause() helper for the same reason.
For this PR I narrowed that frozen-state list to those four legacy migrations and let the new b8c9d0e1f2a3 opt into the shared helper, since that matches what you asked for. But I can imagine three ways to draw the line:
- Current PR — only the new migration uses the shared helper; legacy 4 stay frozen. ("Don't change history, do use the shared helper going forward.")
- Strict frozen-state — revert the new migration to its own inline helper too, so all five vchord migrations look identical. The runtime call-site dedup (
link_utils.py,memory_engine.py) still stands. - Drop the frozen-state policy entirely — port the legacy four to the shared helper too and delete the freeze test.
If (2) is what you'd prefer (i.e. you'd like the new migration to stay frozen too), happy to swap. (3) is a bigger policy call that probably belongs in a separate PR if it's worth doing at all. Which direction would you like?
| # value tuned for top-50 semantic link creation (lower recall but much | ||
| # lower latency than the recall-side default). SET LOCAL auto-reverts | ||
| # at commit, so we don't pollute the pool for subsequent queries. | ||
| ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower() |
There was a problem hiding this comment.
Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.
| # returns the right one for the configured extension, tuned for | ||
| # the higher recall the per-fact_type semantic queries in | ||
| # retrieve_semantic_bm25_combined() need. | ||
| ext = os.getenv("HINDSIGHT_API_VECTOR_EXTENSION", "pgvector").lower() |
There was a problem hiding this comment.
Addressed in c4011bc — see reply on the first thread for the trade-off discussion around the frozen-state test.
Per review on vectorize-io#1668: the env-var lookup that decides which vector backend is configured was duplicated in three places (the new migration plus the two runtime call sites in engine/retain/link_utils.py and engine/memory_engine.py). Centralize the read + validation in hindsight_api._vector_index.configured_vector_extension() so the default value and the access mechanism live in one spot. The new migration b8c9d0e1f2a3_vchord_cosine_opclass now imports the shared helper instead of inlining its own. The four legacy vchord migrations stay frozen (they keep their inline helpers); the frozen-state test is narrowed to that legacy set so future vchord migrations can opt into the shared helper on a per-migration basis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nicoloboschi
left a comment
There was a problem hiding this comment.
Nice diagnosis and a well-scoped fix — the EXPLAIN evidence in the description (143ms seq scan → 25ms index scan) makes the root cause unambiguous, and the per-backend dispatcher is a cleaner abstraction than the previous hnsw.ef_search-everywhere assumption. Particularly liked:
- Flipping the four historical migrations to
vector_cosine_opsso fresh vchord installs don't bootstrap-then-rebuild. configured_vector_extension()centralizing the env lookup instead of scatteringos.getenvcalls.- Parametrized
test_uses_set_local_for_ann_tuningover pgvector/vchord, and the explicit empty-tuple assertions for backends without an equivalent knob. - Renaming
changed_migrations→frozen_migrationsintest_alembic_vector_migrations_freeze_vector_sql_locallywith a comment carve-out for the new migration — that's the right call.
Requesting changes on a few items before merge:
Must address
1. CREATE INDEX CONCURRENTLY failure-recovery in b8c9d0e1f2a3_vchord_cosine_opclass.py.
If _rebuild_vchordrq_indexes errors midway (disk pressure, lock conflict, etc.), Postgres leaves the partial index as INVALID. On re-run:
CREATE INDEX CONCURRENTLY IF NOT EXISTSskips the INVALID temp index (it exists).DROP INDEX IF EXISTSdrops the old original.ALTER INDEX … RENAMEpromotes the INVALID temp index into the canonical name.
Result: a broken index masquerading as the original, and queries silently fall back to seq scan again — exactly the bug we're fixing. Suggest either (a) DROP INDEX IF EXISTS … __opclass_swap at the top of each loop iteration, or (b) querying pg_index.indisvalid after the CONCURRENTLY create and raising if false.
2. Schema-prefix inconsistency. The new migration uses context.config.get_main_option(\"target_schema\") or \"public\", while every other PG migration in the tree uses _pg_schema_prefix() which returns \"\" when unset. Align with the existing convention so behavior is the same when target_schema is not configured.
3. Manual vchord verification is unchecked. The test plan's last bullet (run on a real vchord cluster, verify pg_indexes shows vector_cosine_ops, confirm EXPLAIN picks the index, observe CPU return to baseline) is still [ ]. Given this is the whole point of the PR, please complete it before merge and post the result here — automated tests don't exercise the migration against a real vchord backend.
Nice to have
4. GUC values lack a citation. vchordrq.probes = 10 / 30 look heuristic compared to the pgvector 60 / 200 numbers that presumably have history. If the chosen values land on the wrong knee of the recall/latency curve, we'd be substituting one silent miscalibration for another. Either a quick benchmark note in the PR or a follow-up tuning task would be enough.
5. indexdef.replace(idx_name, temp_name, 1) relies on the index name being the first textual occurrence in pg_indexes.indexdef — true in practice for the canonical reconstructed form, but worth a one-line comment noting the assumption.
6. _init_connection exception swallowing. Now that the dispatcher returns an empty tuple for backends without a knob, the broad except Exception is doing less work than it used to. Tightening to asyncpg.exceptions.PostgresError would avoid masking real bugs.
Overall: solid fix, scoped tightly to vchord, with the rest of the backends provably unchanged. The CONCURRENTLY recovery gap is the only thing I'd block on.
Closes #1667.
This PR fixes two related vchord-only correctness bugs that together explain the high CPU on vchord deployments reported in #1667:
_vector_index.pymapped vchord tovchordrq (embedding vector_l2_ops), but every ANN query in the engine uses<=>(cosine distance), whichvector_l2_opsdoes not match. The planner ignored the index entirely and fell back to sequential scan + per-row cosine computation, so the production primary sat at ~870m CPU during retain.SET LOCAL hnsw.ef_search = 60(andSET hnsw.ef_search = 200at pool init) only exists in pgvector. On a vchord deployment both statements silently no-op'd, which meant the "recall-vs-latency" trade-off the code was designed around had never actually been applied to vchord backends.What changes
Switch the vchord opclass to
vector_cosine_ops:_vector_index.py:_INDEX_USING_CLAUSES["vchord"]flipped tovector_cosine_ops.5a366d414dce,a4b5c6d7e8f9,d5e6f7a8b9c0,n9i0j1k2l3m4.b8c9d0e1f2a3_vchord_cosine_opclassrebuilds any existingvector_l2_opsvchordrq index withvector_cosine_opsviaCREATE INDEX CONCURRENTLY+ drop + rename. No-op when the configured extension is not vchord or when no L2-ops vchordrq indexes are found.Dispatch ANN tuning GUCs by backend:
_vector_index.ann_search_tuning_settings(ext, kind)returns(guc_name, value)pairs per backend.kind="low_latency"powers the retain-side ANN probe (washnsw.ef_search = 60),kind="high_recall"powers the pool init (washnsw.ef_search = 200).hnsw.ef_search = 60 / 200; vchord usesvchordrq.probes = 10 / 30. pgvectorscale, pg_diskann, scann return no statements (no equivalent knob exposed today).engine/retain/link_utils.pyandengine/memory_engine.pyswitched to the dispatcher. Callers still ownSETvsSET LOCALsince the two sites have different lifetime requirements.Why
See #1667 for the full diagnosis. On the production cluster the same ANN query goes from 143 ms with
<=>againstvector_l2_ops(sequential scan,Buffers: shared hit=71531) to 25 ms with the same query body against avector_cosine_opsindex (Index Scan using idx_mu_emb_expr_<hash>,Buffers: shared hit=770 read=520) — ~5.6× per-query, and because the retain pipeline wraps this inCROSS JOIN LATERALover_ann_seedsthe penalty multiplies by seed count.VectorChord's docs recommend
vector_cosine_opsfor cosine-similarity datasets; the engine's embeddings are L2-normalized in practice, so the existing<=>SQL is what we want. pgvector / pgvectorscale / pg_diskann were already on cosine ops, so this also restores cross-backend consistency.Test plan
uv run ruff checkon the changed files — passuv run ty check hindsight_api/— passuv run pytest tests/test_vector_index.py tests/test_link_utils.py tests/test_migration_shape.py— 119 passed (new regression assertions:test_index_using_clause_vchord_uses_cosine_ops,test_ann_search_tuning_settings_{pgvector,vchord,…}, and the existingtest_uses_set_local_for_ann_tuningparametrized over pgvector/vchord)pg_indexesshowsvector_cosine_ops, confirmEXPLAINnow picks the vchordrq index for the retain ANN query, and observe Postgres CPU return to baseline.Migration impact
For vchord deployments the new migration rebuilds each vchordrq index online with
CREATE INDEX CONCURRENTLY. Disk usage doubles for each index during its rebuild; queries continue to run on the old index until the swap finishes. Downgrade rebuilds back tovector_l2_opsusing the same online flow. After the migration completes, vchord pools start applyingvchordrq.probesinstead of the silently-ignoredhnsw.ef_search.