vchord backend: ANN queries cannot use vchordrq indexes due to opclass/operator mismatch (vector_l2_ops vs <=>)

## Summary

When Hindsight is deployed with the **vchord** vector backend (`HINDSIGHT_API_VECTOR_EXTENSION=vchord`), every ANN query in the codebase falls back to a sequential scan because the `vchordrq` indexes are built with `vector_l2_ops` but all application SQL uses the cosine-distance operator `<=>`. The opclass and the operator are not compatible in PostgreSQL, so the planner never picks the vector index, and the database CPU-walks every embedding in the partition for each query.

I observed this in production on a small dataset (~16k rows, 3072-dim embeddings, `tensorchord/cloudnative-vectorchord:18.3-1.1.1`): the primary Postgres pod sits at ~870m CPU because the retain pipeline's Phase 1 ANN link search (a `CROSS JOIN LATERAL` over `_ann_seeds`) is multiplying the slowdown by the seed count. Other backends (`pgvector`, `pgvectorscale`, `pg_diskann`) are unaffected because they map to `vector_cosine_ops` and match the `<=>` operator.

## Environment

- Hindsight API: `main` (verified on commit `bd86e7ea`, ahead by the recent merge bringing the repo current as of the report date)
- Vector backend: `vchord` (`tensorchord/cloudnative-vectorchord:18.3-1.1.1`, extension `vchord 1.1.1`, `vector 0.8.2`)
- PostgreSQL: 18
- Deployment: CloudNativePG cluster, single primary
- Embedding dimension: 3072 (OpenAI `text-embedding-3-large`-style, L2-normalized; see evidence below)

## Symptoms

- `cnpg-*-1` primary Pod CPU stays at ~870m steady-state during retain operations.
- `pg_stat_activity` shows the Phase 1 ANN query (`engine/retain/link_utils.py`) running for **50+ seconds** with empty `wait_event` (pure CPU, not I/O- or lock-bound).
- The same shape of query is used in recall, link expansion, and reflect paths — they share the latency penalty but are less visible because they are not wrapped in a `CROSS JOIN LATERAL`.

## Root Cause

`vchordrq` operator classes are strictly bound 1:1 to operators ([VectorChord docs — Operator Classes](https://docs.vectorchord.ai/vectorchord/usage/indexing.html)):

| opclass | usable operator |
|---|---|
| `vector_l2_ops` | `<->` only |
| `vector_cosine_ops` | `<=>` only |
| `vector_ip_ops` | `<#>` only |

Hindsight's vchord index mapping is in `hindsight-api-slim/hindsight_api/_vector_index.py`:

```python
INDEX_USING: dict[str, str] = {
    "pgvector":      "USING hnsw    (embedding vector_cosine_ops)",
    "pgvectorscale": "USING diskann (embedding vector_cosine_ops) WITH (num_neighbors = 50)",
    "pg_diskann":    "USING diskann (embedding vector_cosine_ops) WITH (max_neighbors = 50)",
    "vchord":        "USING vchordrq (embedding vector_l2_ops)",   # ← only L2
}
```

…but the ANN SQL across the engine uses cosine distance:

```python
# hindsight-api-slim/hindsight_api/engine/retain/link_utils.py:737-744
SELECT mu.id,
       1 - (mu.embedding <=> s.emb_text::vector) AS similarity
FROM memory_units mu
WHERE mu.bank_id = $1 AND mu.fact_type = $2 AND mu.embedding IS NOT NULL
ORDER BY mu.embedding <=> s.emb_text::vector
LIMIT $3
```

Because `<=>` is not in the operator family of `vector_l2_ops`, the planner cannot use the vchordrq index and falls back to a sequential scan with a per-row cosine computation, sorted by `top-N heapsort`.

## Evidence (production cluster, 2927-row partition)

`opencode::plugbear` / `fact_type = 'experience'` (2927 rows of 3072-dim embeddings):

**Current query (`<=>`, cosine):**

```
Limit  (... actual time=142.418..142.443 rows=50.00)
  Buffers: shared hit=71531 read=1
  -> Sort  (... actual time=142.415..142.426 rows=50.00)
       Sort Key: (mu.embedding <=> ...)
       Sort Method: top-N heapsort  Memory: 31kB
       -> Seq Scan on memory_units mu  (... actual time=2.251..141.358 rows=2927.00)
            Filter: (embedding IS NOT NULL AND bank_id = '...' AND fact_type = 'experience')
            Rows Removed by Filter: 13538
            Buffers: shared hit=71528 read=1
Planning Time: 15.001 ms
Execution Time: 143.828 ms
```

**Same data, equivalent query rewritten with `<->` (L2) — index is used:**

```
Limit  (... actual time=15.993..22.968 rows=50.00)
  Buffers: shared hit=770 read=520
  -> Index Scan using idx_mu_emb_expr_<hash> on memory_units mu
       Order By: (embedding <-> ...)
Planning Time: 16.354 ms
Execution Time: 25.646 ms
```

**~5.6× single-query difference**, and because the retain pipeline wraps this in `CROSS JOIN LATERAL` over a temp table of seeds, the penalty multiplies by seed count. A run with ~350 seeds takes ~50 s with the current opclass; the same workload on a matching opclass would finish in ~9 s.

Embedding norm statistics (verifies that L2 and cosine are monotonically equivalent for this corpus, so the fix is mathematically safe to discuss in either direction):

```
min_norm   avg_norm   max_norm   stddev_norm
0.999316   0.999998   1.000723   0.000197
```

## Affected code

All call sites using `<=>` (every one of these hits the same fallback when the backend is vchord):

- `engine/retain/link_utils.py:737, 742` — **Phase 1 ANN link creation (hottest path)**
- `engine/search/retrieval.py:410, 411, 415, 517, 532` — semantic recall
- `engine/search/link_expansion_retrieval.py:86, 91, 95` — graph link expansion
- `engine/reflect/tools.py:86, 89` — reflect tool retrieval
- `engine/sql/postgresql.py:157, 164, 168` — SQL helper used to compose other queries

## Secondary finding: `SET LOCAL hnsw.ef_search` is a no-op under vchord

`engine/retain/link_utils.py:711-713`:

```python
# Default 400 is tuned for recall precision but at 164k units each HNSW probe
# takes 94ms. ef_search=60 gives 2.7ms per probe (35x faster) ...
await conn.execute("SET LOCAL hnsw.ef_search = 60")
```

The `hnsw.ef_search` GUC is part of pgvector's HNSW index. It does not exist under vchord, so the `SET LOCAL` has no effect when `vchordrq` is in use. The intended "recall-vs-latency" tuning has therefore never been applied to vchord deployments. The equivalent vchord dials are `vchordrq.probes` and `vchordrq.epsilon` (see [Fallback Parameters](https://docs.vectorchord.ai/vectorchord/usage/fallback-parameters.html)).

## Why this only affects vchord users

The other three backends (`pgvector`, `pgvectorscale`, `pg_diskann`) all map to `vector_cosine_ops` in `_vector_index.py`, so their indexes match the `<=>` operator and the same code paths perform as intended.

## What the fix probably looks like (for discussion)

VectorChord's documentation [explicitly recommends `vector_cosine_ops` for normalized/cosine embeddings](https://docs.vectorchord.ai/vectorchord/usage/indexing.html):

> "For most datasets using cosine similarity, enabling `residual_quantization` and `build.internal.spherical_centroids` improves both QPS and recall."

So the cleanest fix is to switch the vchord mapping in `_vector_index.py` to `vector_cosine_ops` (optionally with `residual_quantization = true` and `spherical_centroids = true`) and provide an Alembic migration that re-creates the existing vchordrq indexes — both the global one (`idx_memory_units_embedding_vchordrq`) and the many per-`(bank_id, fact_type)` partial indexes — with the new opclass, ideally via `CREATE INDEX CONCURRENTLY` followed by drops of the old ones.

The secondary `hnsw.ef_search` issue can be fixed by dispatching the tuning GUC by backend (`vchordrq.probes` / `vchordrq.epsilon` for vchord, `hnsw.ef_search` for pgvector, etc.).

I'm happy to send a PR along these lines — opening this issue first so the diagnosis is documented and so anyone hitting the same symptom can find it.

## Reproduction

1. Deploy Hindsight against PostgreSQL with the `vchord` extension and `HINDSIGHT_API_VECTOR_EXTENSION=vchord`.
2. Ingest enough memories that any single `(bank_id, fact_type)` partition contains a few thousand rows.
3. Run any retain (or recall) that triggers ANN search.
4. Observe Postgres CPU climb and the activity table show long-running queries with `<=>` and no `wait_event`.
5. `EXPLAIN (ANALYZE, BUFFERS)` shows `Seq Scan` over the partition with the vchordrq index ignored.

opclass	usable operator
`vector_l2_ops`	`<->` only
`vector_cosine_ops`	`<=>` only
`vector_ip_ops`	`<#>` only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vchord backend: ANN queries cannot use vchordrq indexes due to opclass/operator mismatch (vector_l2_ops vs <=>) #1667

Summary

Environment

Symptoms

Root Cause

Evidence (production cluster, 2927-row partition)

Affected code

Secondary finding: `SET LOCAL hnsw.ef_search` is a no-op under vchord

Why this only affects vchord users

What the fix probably looks like (for discussion)

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vchord backend: ANN queries cannot use vchordrq indexes due to opclass/operator mismatch (vector_l2_ops vs <=>) #1667

Description

Summary

Environment

Symptoms

Root Cause

Evidence (production cluster, 2927-row partition)

Affected code

Secondary finding: SET LOCAL hnsw.ef_search is a no-op under vchord

Why this only affects vchord users

What the fix probably looks like (for discussion)

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Secondary finding: `SET LOCAL hnsw.ef_search` is a no-op under vchord