Skip to content

Conversation

@tjgreen42
Copy link
Collaborator

@tjgreen42 tjgreen42 commented Jan 1, 2026

Summary

  • Implement Block-Max WAND (BMW) for top-k retrieval using block-level upper bounds in V2 segments
  • Add top-k min-heap with O(1) threshold access and O(log k) updates for efficient result tracking
  • BMW fast path for single-term queries (skip blocks that can't contribute to top-k)
  • BMW fast path for multi-term queries with WAND-style doc-ID ordered traversal
  • Batch doc_freq lookups to reduce segment open/close overhead for multi-term queries
  • GUC variables pg_textsearch.enable_bmw and log_bmw_stats for debugging/benchmarking

Performance (MS MARCO 8.8M docs, p50 latency)

Query Length pg_textsearch System X Result
1 token 10.14ms 18.05ms 1.8x faster
2 tokens 12.54ms 17.24ms 1.4x faster
3 tokens 15.14ms 22.94ms 1.5x faster
4 tokens 19.72ms 24.01ms 1.2x faster
5 tokens 25.79ms 26.21ms ~same
6 tokens 32.88ms 33.47ms ~same
7 tokens 42.09ms 32.23ms 1.3x slower
8+ tokens 63.41ms 39.17ms 1.6x slower

Cranfield: 225 queries in 57ms (0.26 ms/query avg)

Note: These benchmarks establish baselines for pg_textsearch—not a head-to-head comparison. System X has different defaults and tuning options; further iteration on configurations required.

Implementation Details

New files:

  • src/query/bmw.h - Top-k heap, BMW stats, and scoring function interfaces
  • src/query/bmw.c - Min-heap implementation, block max score computation, single-term and multi-term BMW scoring

Key algorithm:

  1. Compute block max BM25 score from skip entry metadata (block_max_tf, block_max_norm)
  2. Only score blocks where block_max_score >= current_threshold
  3. Update threshold as better results are found
  4. Memtable scored exhaustively (no skip index)

Multi-term optimization:

  • Sort terms by IDF (highest first) for faster threshold convergence
  • WAND-style doc-ID ordered traversal across terms' posting lists
  • Batch doc_freq lookups: opens each segment once instead of once per term
  • Reduces segment opens from O(terms × segments) to O(segments)

Testing

  • All regression tests pass
  • Shell-based tests pass (concurrency, recovery, segment)
  • Results match exhaustive scoring path with correct tie-breaking

@tjgreen42 tjgreen42 force-pushed the feat/block-max-wand branch from e047e80 to 64a0270 Compare January 1, 2026 21:56
@tjgreen42 tjgreen42 changed the title Add Block-Max WAND (BMW) optimization for single-term queries Add Block-Max WAND (BMW) optimization for top-k queries Jan 2, 2026
@tjgreen42 tjgreen42 requested a review from Copilot January 5, 2026 23:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Block-Max WAND (BMW) optimization for top-k BM25 queries, enabling significant performance improvements through intelligent block skipping. The implementation adds specialized fast paths for single-term and multi-term (2-8 terms) queries that compute block-level upper bounds and skip blocks that cannot contribute to results.

Key changes:

  • Single-term BMW: Pre-computes block max scores from skip entry metadata and skips blocks below the top-k threshold
  • Multi-term BMW: Uses WAND-style block skipping with batched doc_freq lookups to minimize segment I/O
  • Exhaustive fallback: Queries with >8 terms continue using the existing exhaustive scoring path

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/query/bmw.h Top-k min-heap interface, BMW statistics, and scoring function declarations
src/query/bmw.c Complete BMW implementation with heap operations and block-level scoring
src/query/score.c Integration of BMW fast paths with fallback to exhaustive scoring
src/segment/segment.h Exported segment iterator and skip entry reader for BMW block access
src/segment/scan.c Batch doc_freq lookup and iterator function exports
test/sql/bmw.sql Comprehensive BMW test suite covering edge cases and correctness validation
test/expected/bmw.out Expected test outputs
test/expected/segment.out Updated for deterministic CTID ordering
test/expected/merge.out Updated for deterministic CTID ordering
benchmarks/datasets/msmarco/queries.sql Enhanced benchmarks using real MS MARCO queries
benchmarks/datasets/msmarco/parallel_query.pgbench pgbench script for parallel query testing
benchmarks/datasets/msmarco/parallel_bench.sh Parallel benchmark harness with scaling tests
benchmarks/datasets/msmarco/load.sql Fixed CSV loading and query set handling
Makefile Added BMW object files to build
.github/workflows/benchmark.yml Integrated parallel benchmark into CI

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@tjgreen42
Copy link
Collaborator Author

Addressed PR feedback in fc3fce5:

  1. test/sql/bmw.sql:298 - Fixed misleading comment about 'previous version'
  2. src/query/bmw.c:450 - Added justification for BMW_MAX_TERMS=8 threshold
  3. src/query/bmw.c:480 - Clarified DOC_ACCUM_HASH_SIZE comment

Note: Comments on parallel_bench.sh are moot - that file was removed (benchmark changes moved to separate PR).

tjgreen42 and others added 14 commits January 5, 2026 17:41
Implement early termination for top-k retrieval using block-level upper
bounds stored in V2 segments. This skips entire blocks (128 docs) that
cannot contribute to top-k results, improving query performance.

Changes:
- Add bmw.h/bmw.c with top-k min-heap and BMW scoring functions
- Export segment iterator functions for BMW block access
- Integrate BMW fast path in tp_score_documents() for single-term queries
- Fix test expected files for deterministic tie-breaking order

The BMW path computes block max scores from skip entry metadata and only
scores blocks where block_max_score >= current threshold. Memtable is
scored exhaustively (no skip index). Multi-term WAND to follow.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Extend BMW to handle queries with 2-8 terms. For each segment:
- Pre-compute block max scores for all query terms
- Skip blocks where sum of block_max_scores < threshold
- Use hash table to accumulate scores per document within blocks

The multi-term path uses the same top-k heap as single-term BMW.
Memtable is scored exhaustively since it lacks skip index metadata.
Falls back to exhaustive scoring for queries with >8 terms.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
V1 segments are no longer supported - all segments use the V2 format
with skip index metadata for block-max scoring.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Refactor code organization:
- Move bmw.c, bmw.h from src/segment/ to src/query/
- Move score.c, score.h from src/types/ to src/query/
- Update include paths across codebase
- Add comprehensive BMW test suite (test/sql/bmw.sql)

The query/ directory is a more appropriate location for query-time
optimization code (BMW, scoring) rather than segment/ which is for
storage format.
The posting iterator's tp_segment_posting_iterator_next() auto-advances
to subsequent blocks when the current block is exhausted. This caused
BMW to process ALL blocks once the first non-skipped block was entered,
defeating the entire block-skipping optimization.

Fix by:
1. Resetting iter.finished before each block so subsequent blocks can be
   processed
2. Breaking out of the inner while loop when the iterator advances past
   the current block, allowing the outer for loop to apply threshold
   checks to decide whether to process subsequent blocks

This restores BMW's ability to skip blocks whose block_max_score is
below the current top-k threshold.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
For multi-term queries, the previous implementation called
tp_get_unified_doc_freq() in a loop, which opened/closed each segment
once per term. With 5 terms and 6 segments, that's 30 segment
open/close cycles per query.

Add tp_batch_get_segment_doc_freq() which opens each segment once and
looks up all terms, reducing segment opens from O(terms * segments)
to O(segments).

This addresses the ~32% time spent in tp_segment_get_doc_freq seen in
profiling.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Instead of synthetic queries like "what is the capital of france",
use the actual MS-MARCO dev query set (10K queries) for more
representative benchmarking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Use CREATE TABLE/INDEX IF NOT EXISTS to skip if already present
- Avoids rebuilding index on re-runs (saves 30-60 min locally)
- TRUNCATE only if data missing, not on every run
My idempotent changes broke data loading (0 passages loaded).
Reverting to the simple DROP/CREATE approach that works.
The $1 and $2 need backslash escaping to prevent shell expansion
when used inside psql's \copy FROM PROGRAM.
Adds pgbench-based parallel throughput testing:
- parallel_query.pgbench: pgbench script for BM25 queries
- parallel_bench.sh: Runs scaling test with 1-N clients (N = CPUs)
- benchmark.yml: Runs parallel benchmark after sequential queries

Local results (8.8M passages, release build):
- 1 client: 53 QPS (18.8 ms)
- 4 clients: 163 QPS (24.6 ms)
- 8 clients: 210 QPS (38.2 ms)
- 12 clients: 219 QPS (54.9 ms)
- Remove WAND algorithm (score_segment_wand and related cursor code)
- Remove WAND iterator functions from segment/scan.c
- Revert benchmark/MSMARCO changes (moved to separate PR)
- Keep: GUC variables log_bmw_stats and enable_bmw
- Keep: Improved stats tracking (memtable_docs vs segment_docs_scored)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Fix misleading comment about "previous version" in bmw.sql test
- Add justification for BMW_MAX_TERMS=8 threshold
- Clarify DOC_ACCUM_HASH_SIZE comment (2x block size for hash efficiency)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@tjgreen42 tjgreen42 force-pushed the feat/block-max-wand branch from 2a1a849 to a0eb255 Compare January 6, 2026 01:42
tjgreen42 and others added 5 commits January 5, 2026 18:24
Benchmarks show BMW outperforms exhaustive scoring even for 8+ term
queries. The previous assumption that "exhaustive wins beyond 8 terms"
was incorrect - bucket 8 queries were 2.7x slower than System X due
to falling back to exhaustive scanning.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Mark v0.3.0 BMW items as completed
- Document the doc-ID ordered traversal limitation for long queries
- Note: current BMW iterates by block index, not doc ID, which limits
  effectiveness for 8+ term queries

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Detail the two related limitations:
1. Block-index iteration vs doc-ID iteration for multi-term queries
2. Sequential single-block skipping vs binary search multi-block seeking

Include code example and explain where O(log n) seeking would help.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Fix score.c: change query_term_count <= 8 to >= 2 (use BMW for all
  multi-term queries)
- Add tp_topk_free() to properly clean up heap memory
- Fix hash table leak: destroy doc_accum inside block loop, not after
- Rename NYE banner to v0.2.0, add v0.3.0-dev banner

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update image path after rename from nye_2026 to v0.3.0-dev.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@tjgreen42 tjgreen42 requested a review from svenklemm January 6, 2026 04:44
@tjgreen42 tjgreen42 marked this pull request as draft January 6, 2026 15:52
tjgreen42 and others added 3 commits January 6, 2026 09:29
Documents at different block positions across terms get partial scores
instead of complete scores. The wand.sql test documents this bug -
doc 201 should be #1 but BMW misses it entirely.

Expected output shows current buggy behavior. Will be updated when
WAND-style doc-ID traversal is implemented.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The previous block-index iteration assumed blocks were aligned by doc ID
across terms' posting lists. They aren't - each term has its own block
structure. Documents at different block positions across terms got
partial scores instead of complete scores.

This implements WAND-style doc-ID ordered traversal:
- Iterate by minimum doc_id across all term iterators (pivot)
- Accumulate complete scores from all terms at the pivot
- Maintain block-max optimization for early threshold pruning

Added tp_segment_posting_iterator_seek() and current_doc_id() functions
to support efficient doc-ID based iteration with binary search on skip
entries.
DSA memory size varies by environment, causing CI failures.
Bare bm25query casts don't work reliably when BMW is disabled.
@tjgreen42 tjgreen42 marked this pull request as ready for review January 6, 2026 21:25
tjgreen42 and others added 6 commits January 6, 2026 13:27
- Move v0.2.0 to Released (Dec 2025)
- Update v0.3.0 to reflect completed BMW with doc-ID traversal
- Set Jan 2026 dates for v0.3.0 and v0.4.0
- Update benchmark results (4.3x faster than exhaustive)
The zero-copy buffer management in tp_segment_get_direct() was sharing
buffer pins with reader->current_buffer. When multiple term iterators
were active and a different term's initialization read the dictionary
from a different page, tp_segment_read() would release the shared
buffer, invalidating the first iterator's block_postings pointer.

Fix: tp_segment_get_direct() now creates its own independent buffer pin
rather than sharing with reader->current_buffer. The release function
now properly releases the pin it owns.

Also adds a 3-term query test to wand.sql to prevent regression.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
New test sections:
- Section 18: Partial last block (150 docs = 128 + 22)
- Section 19: Term only in memtable (not in segments)
- Section 20: Sparse posting lists (few postings per block)
- Section 21: Multi-term with asymmetric block structures
- Section 22: Threshold exactly equals block_max
- Section 23: All iterators exhaust simultaneously

These tests cover edge cases in the BMW algorithm that were
identified as potential gaps during code review.
- Extract 7 helpers from score_segment_multi_term_bmw (228→51 lines):
  - advance_term_iterator
  - init_segment_term_states / cleanup_segment_term_states
  - find_pivot_doc_id / compute_pivot_max_score
  - score_pivot_document / skip_pivot_document

- Extract 2 helpers from tp_score_all_terms_in_segment_chain (161→28 lines):
  - score_term_postings
  - score_segment_for_all_terms

- Eliminate goto statement by using return value from init function
- Remove duplicated binary search code, reuse tp_segment_posting_iterator_init
- Rewrite wand.sql header as regression test (remove "critical bug" verbiage)
- Remove non-verified "Verify:" comments from bmw.sql tests
- Update expected output files
Copy link
Collaborator Author

@tjgreen42 tjgreen42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed review comments in e87f443:

wand.out:3 - Rewrote header to be a regression test description, removed "critical bug" verbiage.

bmw.out:40, 116 - Removed non-verified "Verify:" comments since they weren't being checked programmatically.

bmw.out:374 (WHERE clauses) - The WHERE clause WHERE content <@> 'term'::bm25query < 0 is necessary: without it, non-matching documents would appear with score 0. The clause ensures we only get documents that actually match the query term.

bmw.out:374 (score stability) - Scores are deterministic based on BM25 formula with fixed k1/b parameters. The full precision is useful for detecting subtle scoring regressions.

bmw.out:390 (validation.sql) - The SECTION 10 approach compares BMW-optimized vs exhaustive paths using EXCEPT, which directly validates that both paths produce identical results. The validation.sql functions compare against a reference SQL implementation, which is a different (complementary) form of validation. Both are valuable; the current approach specifically tests the BMW optimization.

@tjgreen42
Copy link
Collaborator Author

Correction on WHERE clauses:

After investigation, the WHERE clauses serve two different purposes:

  1. Non-existent query terms: When the term doesn't exist in the index, the planner falls back to Seq Scan and returns all docs with score 0. WHERE is necessary to get empty results for these tests (e.g., SECTION 2 'elephant' test).

  2. Existing query terms: Index Scan is used and only returns matching documents. WHERE is redundant but harmless.

So for correctness tests with matching queries, WHERE can be removed. For 'empty results' tests, WHERE must stay.

Given that removing WHERE from matching queries would be a large change with no functional benefit (and WHERE with Index Scan adds only a trivial Filter step), I'd suggest keeping them for consistency unless you feel strongly about removing them.

WHERE clause `content <@> query < 0` is only needed for small tables
where the planner uses seq scan. For large tables (100+ rows), the
index scan is used and only returns matching documents, making WHERE
redundant.

Removed WHERE from tests with large tables:
- SECTION 9: bmw_hybrid (1002 rows)
- SECTION 10: bmw_validate (1000 rows)
- SECTION 11: bmw_monotonic (500 rows)
- SECTION 12: bmw_blocks (643 rows)
- SECTION 15-16: bmw_scattered, bmw_multi_scattered (1000 rows)
- SECTION 17: bmw_multiseg (700 rows)
- SECTION 18-23: bmw_partial through bmw_simul (100-500 rows)

Kept WHERE for small table tests (< 50 rows) where seq scan is used:
- SECTIONS 1-8, 13-14: small functional tests
- Empty result tests: 'elephant', 'xxx yyy zzz'
@tjgreen42
Copy link
Collaborator Author

Correction on WHERE clauses:

After investigation, the WHERE clauses serve two different purposes:

  1. Non-existent query terms: When the term doesn't exist in the index, the planner falls back to Seq Scan and returns all docs with score 0. WHERE is necessary to get empty results for these tests (e.g., SECTION 2 'elephant' test).
  2. Existing query terms: Index Scan is used and only returns matching documents. WHERE is redundant but harmless.

So for correctness tests with matching queries, WHERE can be removed. For 'empty results' tests, WHERE must stay.

Given that removing WHERE from matching queries would be a large change with no functional benefit (and WHERE with Index Scan adds only a trivial Filter step), I'd suggest keeping them for consistency unless you feel strongly about removing them.

Please remove any new ones, they just muddy the waters.

@tjgreen42 tjgreen42 merged commit 87f6ae4 into main Jan 7, 2026
11 checks passed
@tjgreen42 tjgreen42 deleted the feat/block-max-wand branch January 7, 2026 04:07
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna say I love this Stranger Things-style banner 💯

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aww, I was hoping someone would notice!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the "not yet optimized" tagline is not entirely true anymore, as can be seen from the numbers in the PR description. This little extension is starting to kick some butt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants