Skip to content

Conversation

@tjgreen42
Copy link
Collaborator

@tjgreen42 tjgreen42 commented Jan 8, 2026

Summary

Eliminates memory leaks during CREATE INDEX by using a private DSA that is destroyed and recreated on each spill, providing perfect memory reclamation.

Problem

As identified in PR #117, index builds leak ~110-400MB per spill cycle due to DSA fragmentation. Even with the threshold reduction in #117, a 50M document build still leaks ~17GB cumulative memory.

Solution: Private DSA with Destroy/Recreate

Key Insight: During CREATE INDEX, only one backend is building. We don't need a shared DSA - we can use a private one and destroy it completely between spills.

Implementation:

// Create private DSA for build (not in global registry)
private_dsa = dsa_create(tranche_id);  

// After spill:
dsa_detach(private_dsa);  // Destroys DSA + ALL memory → OS
private_dsa = dsa_create(tranche_id);  // Fresh DSA for next batch

Architecture:

  • BUILD MODE: Private DSA, destroyed/recreated per spill → 0% memory leak
  • RUNTIME MODE: Shared DSA for concurrent inserts (unchanged)
  • Same data structures: dshash, posting lists work identically in both modes
  • Minimal changes: ~100 lines of code

Changes

New functions:

  • tp_create_build_index_state(): Creates private DSA instead of using global
  • tp_recreate_build_dsa(): Destroys old DSA and creates fresh one
  • Updated tp_clear_memtable(): Calls recreation in build mode

Modified files:

  • src/state/state.h: Added is_build_mode flag
  • src/state/state.c: Implemented private DSA lifecycle
  • src/am/build.c: Use build mode during CREATE INDEX

Expected Results

Memory profile with private DSA:

BUILD_START: 23 MB
1M docs: 428 MB (8M postings)
BEFORE_SPILL: 428 MB
AFTER_SPILL: 25 MB  ← Perfect reclamation!
2M docs: 428 MB (no growth!)
BEFORE_SPILL: 428 MB
AFTER_SPILL: 25 MB  ← Still 25 MB!

For any dataset size: Peak stays at ~430MB

Comparison

Approach 50M Docs Peak Memory Leak Code Changes
Original (32M threshold) 26GB (OOM) 400MB/spill 0
PR #117 (8M threshold) ~18GB 110MB/spill ~150 lines
This PR (Private DSA) ~430MB 0MB/spill ~100 lines

Testing Plan

  • Build compiles successfully
  • Existing regression tests pass
  • 1M document build with memory instrumentation shows perfect reclamation
  • 50M document build completes with constant ~430MB peak
  • Concurrent inserts still work (runtime mode validation)

Relationship to PR #117

PR #117 provides immediate mitigation and enables large-scale benchmarks.
This PR provides the complete architectural fix for unlimited scalability.

Both can be merged independently - #117 helps immediately, this PR eliminates the issue entirely.

@CLAassistant
Copy link

CLAassistant commented Jan 8, 2026

CLA assistant check
All committers have signed the CLA.

@tjgreen42 tjgreen42 force-pushed the feat/memorycontext-build-mode branch 6 times, most recently from 49ddcfa to a0a8b60 Compare January 8, 2026 23:54
Copy link
Collaborator Author

@tjgreen42 tjgreen42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test where the new code is exercised in a situation where the memtable already contains data? How is test coverage on new code generally (since it only kicks in for bulk-loads)?

@tjgreen42
Copy link
Collaborator Author

Regarding test coverage for the new build mode code:

Existing tests that exercise the new code paths:

  1. Bulk load tests: The segment.sh Test 3 specifically creates an empty index and then performs multiple spills in a loop - this directly exercises the new private DSA and tp_finalize_build_mode() code path.

  2. Index build with data: All tests that do CREATE INDEX ... ON table where the table already has data exercise the private DSA during build. This includes aerodocs, scoring1-6, segment, merge, etc.

  3. Spill during build: The memory test does a large bulk insert that can trigger auto-spill, which exercises tp_recreate_build_dsa().

  4. Recovery tests: The recovery.sh test verifies that after a crash and restart, indexes can still be used, which exercises the transition from build mode to runtime mode.

For the memtable-already-contains-data case:

When tp_finalize_build_mode() is called, if the memtable contains data, it's first spilled to a segment via the final spill logic in tp_build() (lines 850-897). The tests that have data in the index at build completion (most of them) exercise this path.

The key scenario is:

  • Private DSA created during build
  • Data inserted into memtable
  • Final spill writes data to segment
  • tp_finalize_build_mode() destroys private DSA and creates fresh memtable in global DSA

All the scoring tests (scoring1-6) verify that scores are correctly computed after this transition, validating that the data was properly spilled and the new memtable is functional.

@tjgreen42 tjgreen42 force-pushed the feat/memorycontext-build-mode branch from 11e84c2 to 1740c2b Compare January 9, 2026 00:55
tjgreen42 and others added 8 commits January 8, 2026 19:08
During CREATE INDEX, use a private DSA (Dynamic Shared Memory Area) that
can be completely destroyed and recreated on each memtable spill. This
provides perfect memory reclamation during large index builds.

Key changes:
- Add tp_create_build_index_state() for build mode with private DSA
- Add tp_recreate_build_dsa() to destroy/recreate DSA on spills
- Add tp_finalize_build_mode() to transition to runtime mode after build
- Final spill at end of tp_build() persists memtable data to disk
- tp_clear_memtable() uses DSA destruction in build mode

The private DSA is only used during index builds. After build completes,
the code transitions to using the global shared DSA for normal runtime
operation with concurrent access.
The <@> operator now applies the same fieldnorm quantization as segments
use for storing document lengths. This ensures that the operator produces
identical BM25 scores to index scans, regardless of whether the data is
in the memtable or segments.

Previously, when data was in segments:
- Index scans used quantized fieldnorm (Lucene's SmallFloat encoding)
- Operator evaluation recomputed exact doc length from text

This caused score mismatches between index scan NOTICEs and operator
output columns. Now both paths use quantized lengths consistently.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The test (18) job (PG18) has been hanging indefinitely while
test (17) completes normally. Add:
- 10-minute timeout to prevent indefinite hangs
- Verbose PostgreSQL logging (log_min_messages = LOG)
- Display server logs on failure to help debug the issue

Tests pass locally on both PG17 and PG18.
When creating a local index state for an existing shared state (runtime
path), the is_build_mode field was not initialized. This caused
undefined behavior when tp_clear_memtable checked this flag during spill
operations.

If the uninitialized memory happened to contain a non-zero value, the
code would incorrectly enter the build mode path and call dsa_detach on
the global shared DSA, disconnecting the backend from shared memory and
causing subsequent operations to hang.

This bug was non-deterministic and only manifested when the uninitialized
memory happened to contain garbage that evaluated to true. It was
reliably reproduced in CI on PG18 during the segment.sh Test 3 which
performs multiple spills in a loop.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove PG_TRY/PG_CATCH for ordinary control flow in scan.c
- Change warning messages to proper ereport ERROR
- Remove verbose CI logging (keep timeout as safety measure)
- Change LOG messages to DEBUG1 for build mode diagnostics

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The index scoring uses Lucene's SmallFloat encoding (fieldnorm) to
compress document lengths into single bytes. This causes small
differences between pure BM25 formula and actual index scores.

Update validation SQL to match by:
- Adding fieldnorm_quantize() function with full 256-value decode table
- Using quantized doc lengths in validate_bm25_scoring()
- Using quantized doc lengths in debug_bm25_computation()
- Using quantized doc lengths in compare_bm25_scores()

This ensures the validation functions compute scores identically to how
the index computes them, fixing false validation failures.
When a CREATE INDEX transaction is aborted before tp_finalize_build_mode
is called, the private DSA would be leaked. Add cleanup logic:

- tp_cleanup_build_mode_on_abort() iterates local state cache
- For any index in build mode, detaches private DSA and cleans registry
- Called from transaction callback on XACT_EVENT_ABORT

Also removes leftover CI trigger comment.
With the new private DSA build mode, data is spilled to segments
during CREATE INDEX, so the memtable is empty when bm25_spill_index
is called afterward. Update test expectations to reflect this.
@tjgreen42 tjgreen42 force-pushed the feat/memorycontext-build-mode branch from c042895 to 5631550 Compare January 9, 2026 03:11
@tjgreen42 tjgreen42 merged commit 7bfb0f5 into main Jan 9, 2026
12 checks passed
@tjgreen42 tjgreen42 deleted the feat/memorycontext-build-mode branch January 9, 2026 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants