-
Notifications
You must be signed in to change notification settings - Fork 43
Add Block-Max WAND (BMW) optimization for top-k queries #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e047e80 to
64a0270
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements Block-Max WAND (BMW) optimization for top-k BM25 queries, enabling significant performance improvements through intelligent block skipping. The implementation adds specialized fast paths for single-term and multi-term (2-8 terms) queries that compute block-level upper bounds and skip blocks that cannot contribute to results.
Key changes:
- Single-term BMW: Pre-computes block max scores from skip entry metadata and skips blocks below the top-k threshold
- Multi-term BMW: Uses WAND-style block skipping with batched doc_freq lookups to minimize segment I/O
- Exhaustive fallback: Queries with >8 terms continue using the existing exhaustive scoring path
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/query/bmw.h | Top-k min-heap interface, BMW statistics, and scoring function declarations |
| src/query/bmw.c | Complete BMW implementation with heap operations and block-level scoring |
| src/query/score.c | Integration of BMW fast paths with fallback to exhaustive scoring |
| src/segment/segment.h | Exported segment iterator and skip entry reader for BMW block access |
| src/segment/scan.c | Batch doc_freq lookup and iterator function exports |
| test/sql/bmw.sql | Comprehensive BMW test suite covering edge cases and correctness validation |
| test/expected/bmw.out | Expected test outputs |
| test/expected/segment.out | Updated for deterministic CTID ordering |
| test/expected/merge.out | Updated for deterministic CTID ordering |
| benchmarks/datasets/msmarco/queries.sql | Enhanced benchmarks using real MS MARCO queries |
| benchmarks/datasets/msmarco/parallel_query.pgbench | pgbench script for parallel query testing |
| benchmarks/datasets/msmarco/parallel_bench.sh | Parallel benchmark harness with scaling tests |
| benchmarks/datasets/msmarco/load.sql | Fixed CSV loading and query set handling |
| Makefile | Added BMW object files to build |
| .github/workflows/benchmark.yml | Integrated parallel benchmark into CI |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Addressed PR feedback in fc3fce5:
Note: Comments on |
Implement early termination for top-k retrieval using block-level upper bounds stored in V2 segments. This skips entire blocks (128 docs) that cannot contribute to top-k results, improving query performance. Changes: - Add bmw.h/bmw.c with top-k min-heap and BMW scoring functions - Export segment iterator functions for BMW block access - Integrate BMW fast path in tp_score_documents() for single-term queries - Fix test expected files for deterministic tie-breaking order The BMW path computes block max scores from skip entry metadata and only scores blocks where block_max_score >= current threshold. Memtable is scored exhaustively (no skip index). Multi-term WAND to follow. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Extend BMW to handle queries with 2-8 terms. For each segment: - Pre-compute block max scores for all query terms - Skip blocks where sum of block_max_scores < threshold - Use hash table to accumulate scores per document within blocks The multi-term path uses the same top-k heap as single-term BMW. Memtable is scored exhaustively since it lacks skip index metadata. Falls back to exhaustive scoring for queries with >8 terms. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
V1 segments are no longer supported - all segments use the V2 format with skip index metadata for block-max scoring. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Refactor code organization: - Move bmw.c, bmw.h from src/segment/ to src/query/ - Move score.c, score.h from src/types/ to src/query/ - Update include paths across codebase - Add comprehensive BMW test suite (test/sql/bmw.sql) The query/ directory is a more appropriate location for query-time optimization code (BMW, scoring) rather than segment/ which is for storage format.
The posting iterator's tp_segment_posting_iterator_next() auto-advances to subsequent blocks when the current block is exhausted. This caused BMW to process ALL blocks once the first non-skipped block was entered, defeating the entire block-skipping optimization. Fix by: 1. Resetting iter.finished before each block so subsequent blocks can be processed 2. Breaking out of the inner while loop when the iterator advances past the current block, allowing the outer for loop to apply threshold checks to decide whether to process subsequent blocks This restores BMW's ability to skip blocks whose block_max_score is below the current top-k threshold. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
For multi-term queries, the previous implementation called tp_get_unified_doc_freq() in a loop, which opened/closed each segment once per term. With 5 terms and 6 segments, that's 30 segment open/close cycles per query. Add tp_batch_get_segment_doc_freq() which opens each segment once and looks up all terms, reducing segment opens from O(terms * segments) to O(segments). This addresses the ~32% time spent in tp_segment_get_doc_freq seen in profiling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Instead of synthetic queries like "what is the capital of france", use the actual MS-MARCO dev query set (10K queries) for more representative benchmarking. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Use CREATE TABLE/INDEX IF NOT EXISTS to skip if already present - Avoids rebuilding index on re-runs (saves 30-60 min locally) - TRUNCATE only if data missing, not on every run
My idempotent changes broke data loading (0 passages loaded). Reverting to the simple DROP/CREATE approach that works.
The $1 and $2 need backslash escaping to prevent shell expansion when used inside psql's \copy FROM PROGRAM.
Adds pgbench-based parallel throughput testing: - parallel_query.pgbench: pgbench script for BM25 queries - parallel_bench.sh: Runs scaling test with 1-N clients (N = CPUs) - benchmark.yml: Runs parallel benchmark after sequential queries Local results (8.8M passages, release build): - 1 client: 53 QPS (18.8 ms) - 4 clients: 163 QPS (24.6 ms) - 8 clients: 210 QPS (38.2 ms) - 12 clients: 219 QPS (54.9 ms)
- Remove WAND algorithm (score_segment_wand and related cursor code) - Remove WAND iterator functions from segment/scan.c - Revert benchmark/MSMARCO changes (moved to separate PR) - Keep: GUC variables log_bmw_stats and enable_bmw - Keep: Improved stats tracking (memtable_docs vs segment_docs_scored) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Fix misleading comment about "previous version" in bmw.sql test - Add justification for BMW_MAX_TERMS=8 threshold - Clarify DOC_ACCUM_HASH_SIZE comment (2x block size for hash efficiency) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
2a1a849 to
a0eb255
Compare
Benchmarks show BMW outperforms exhaustive scoring even for 8+ term queries. The previous assumption that "exhaustive wins beyond 8 terms" was incorrect - bucket 8 queries were 2.7x slower than System X due to falling back to exhaustive scanning. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Mark v0.3.0 BMW items as completed - Document the doc-ID ordered traversal limitation for long queries - Note: current BMW iterates by block index, not doc ID, which limits effectiveness for 8+ term queries 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Detail the two related limitations: 1. Block-index iteration vs doc-ID iteration for multi-term queries 2. Sequential single-block skipping vs binary search multi-block seeking Include code example and explain where O(log n) seeking would help. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Fix score.c: change query_term_count <= 8 to >= 2 (use BMW for all multi-term queries) - Add tp_topk_free() to properly clean up heap memory - Fix hash table leak: destroy doc_accum inside block loop, not after - Rename NYE banner to v0.2.0, add v0.3.0-dev banner 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update image path after rename from nye_2026 to v0.3.0-dev. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Documents at different block positions across terms get partial scores instead of complete scores. The wand.sql test documents this bug - doc 201 should be #1 but BMW misses it entirely. Expected output shows current buggy behavior. Will be updated when WAND-style doc-ID traversal is implemented. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The previous block-index iteration assumed blocks were aligned by doc ID across terms' posting lists. They aren't - each term has its own block structure. Documents at different block positions across terms got partial scores instead of complete scores. This implements WAND-style doc-ID ordered traversal: - Iterate by minimum doc_id across all term iterators (pivot) - Accumulate complete scores from all terms at the pivot - Maintain block-max optimization for early threshold pruning Added tp_segment_posting_iterator_seek() and current_doc_id() functions to support efficient doc-ID based iteration with binary search on skip entries.
DSA memory size varies by environment, causing CI failures.
Bare bm25query casts don't work reliably when BMW is disabled.
- Move v0.2.0 to Released (Dec 2025) - Update v0.3.0 to reflect completed BMW with doc-ID traversal - Set Jan 2026 dates for v0.3.0 and v0.4.0 - Update benchmark results (4.3x faster than exhaustive)
The zero-copy buffer management in tp_segment_get_direct() was sharing buffer pins with reader->current_buffer. When multiple term iterators were active and a different term's initialization read the dictionary from a different page, tp_segment_read() would release the shared buffer, invalidating the first iterator's block_postings pointer. Fix: tp_segment_get_direct() now creates its own independent buffer pin rather than sharing with reader->current_buffer. The release function now properly releases the pin it owns. Also adds a 3-term query test to wand.sql to prevent regression. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
New test sections: - Section 18: Partial last block (150 docs = 128 + 22) - Section 19: Term only in memtable (not in segments) - Section 20: Sparse posting lists (few postings per block) - Section 21: Multi-term with asymmetric block structures - Section 22: Threshold exactly equals block_max - Section 23: All iterators exhaust simultaneously These tests cover edge cases in the BMW algorithm that were identified as potential gaps during code review.
- Extract 7 helpers from score_segment_multi_term_bmw (228→51 lines): - advance_term_iterator - init_segment_term_states / cleanup_segment_term_states - find_pivot_doc_id / compute_pivot_max_score - score_pivot_document / skip_pivot_document - Extract 2 helpers from tp_score_all_terms_in_segment_chain (161→28 lines): - score_term_postings - score_segment_for_all_terms - Eliminate goto statement by using return value from init function - Remove duplicated binary search code, reuse tp_segment_posting_iterator_init
- Rewrite wand.sql header as regression test (remove "critical bug" verbiage) - Remove non-verified "Verify:" comments from bmw.sql tests - Update expected output files
tjgreen42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed review comments in e87f443:
wand.out:3 - Rewrote header to be a regression test description, removed "critical bug" verbiage.
bmw.out:40, 116 - Removed non-verified "Verify:" comments since they weren't being checked programmatically.
bmw.out:374 (WHERE clauses) - The WHERE clause WHERE content <@> 'term'::bm25query < 0 is necessary: without it, non-matching documents would appear with score 0. The clause ensures we only get documents that actually match the query term.
bmw.out:374 (score stability) - Scores are deterministic based on BM25 formula with fixed k1/b parameters. The full precision is useful for detecting subtle scoring regressions.
bmw.out:390 (validation.sql) - The SECTION 10 approach compares BMW-optimized vs exhaustive paths using EXCEPT, which directly validates that both paths produce identical results. The validation.sql functions compare against a reference SQL implementation, which is a different (complementary) form of validation. Both are valuable; the current approach specifically tests the BMW optimization.
|
Correction on WHERE clauses: After investigation, the WHERE clauses serve two different purposes:
So for correctness tests with matching queries, WHERE can be removed. For 'empty results' tests, WHERE must stay. Given that removing WHERE from matching queries would be a large change with no functional benefit (and WHERE with Index Scan adds only a trivial Filter step), I'd suggest keeping them for consistency unless you feel strongly about removing them. |
WHERE clause `content <@> query < 0` is only needed for small tables where the planner uses seq scan. For large tables (100+ rows), the index scan is used and only returns matching documents, making WHERE redundant. Removed WHERE from tests with large tables: - SECTION 9: bmw_hybrid (1002 rows) - SECTION 10: bmw_validate (1000 rows) - SECTION 11: bmw_monotonic (500 rows) - SECTION 12: bmw_blocks (643 rows) - SECTION 15-16: bmw_scattered, bmw_multi_scattered (1000 rows) - SECTION 17: bmw_multiseg (700 rows) - SECTION 18-23: bmw_partial through bmw_simul (100-500 rows) Kept WHERE for small table tests (< 50 rows) where seq scan is used: - SECTIONS 1-8, 13-14: small functional tests - Empty result tests: 'elephant', 'xxx yyy zzz'
Please remove any new ones, they just muddy the waters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanna say I love this Stranger Things-style banner 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aww, I was hoping someone would notice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the "not yet optimized" tagline is not entirely true anymore, as can be seen from the numbers in the PR description. This little extension is starting to kick some butt.
Summary
pg_textsearch.enable_bmwandlog_bmw_statsfor debugging/benchmarkingPerformance (MS MARCO 8.8M docs, p50 latency)
Cranfield: 225 queries in 57ms (0.26 ms/query avg)
Note: These benchmarks establish baselines for pg_textsearch—not a head-to-head comparison. System X has different defaults and tuning options; further iteration on configurations required.
Implementation Details
New files:
src/query/bmw.h- Top-k heap, BMW stats, and scoring function interfacessrc/query/bmw.c- Min-heap implementation, block max score computation, single-term and multi-term BMW scoringKey algorithm:
block_max_tf,block_max_norm)block_max_score >= current_thresholdMulti-term optimization:
Testing