Skip to content

Handle-Based I/O and Path Caching#13

Merged
vbp1 merged 6 commits intomainfrom
011-handle-caching
Nov 29, 2025
Merged

Handle-Based I/O and Path Caching#13
vbp1 merged 6 commits intomainfrom
011-handle-caching

Conversation

@vbp1
Copy link
Copy Markdown
Owner

@vbp1 vbp1 commented Nov 28, 2025

Summary

Handle-Based I/O, FIFO operation ordering, and materialize_on_read policy.

  • Introduce handle-based pread/pwrite I/O to reduce repeated file opens
  • Add sequence-based barrier mechanism (PendingOps) for FIFO write ordering
  • Implement materialize_on_read policy to keep diff small for read-heavy workloads
  • Fix PostgreSQL "read only 0 of 8192 bytes" error with sparse diff files
  • Remove attr_cache to eliminate race conditions with async writes
  • Add memory cleanup via forget() to prevent accumulation during long sessions

Key Changes

1. Handle-Based I/O (fuse.rs)

  • open() creates persistent file handles stored in handles HashMap
  • read()/write() use pread/pwrite via handles instead of repeated fs::read/fs::write
  • For pg_datafile: handles are NOT created — reads always go through overlay.read_range() to correctly handle sparse diff files and block-level materialization

2. FIFO Operation Ordering (pending.rs)

  • New PendingOps module with sequence-based barrier mechanism
  • Guarantees READ sees all preceding WRITEs without waiting for concurrent ones
  • Uses DashMap + parking_lot::Condvar for efficient per-inode synchronization
  • wait_barrier() in open/read/getattr ensures consistency
  • wait_for_preceding() in write workers ensures FIFO execution order

3. Materialize-on-Read Policy (overlay.rs)

  • PBKFS_MATERIALIZE_ON_READ env var controls block materialization behavior
  • Default false: reads from pg_datafile do NOT copy blocks to diff
  • true: eager materialization (legacy behavior, useful for debugging)
  • New read_range_nonmaterializing() path for non-materializing reads

4. Correctness Fixes

  • Remove attr_cache — was invalidated BEFORE async writes completed, causing stale sizes
  • Remove logical_len/diff_len caching from BlockCacheEntry — file sizes always read fresh
  • Parse n_blocks/full_size from backup_content.control for accurate incremental backup lengths
  • Add forget() to clean up PendingOps state when kernel forgets inode

5. Metrics & Observability (logging/mod.rs)

  • OverlayIoSnapshot struct for cache/handle metrics
  • log_overlay_io_metrics() emits periodic snapshots
  • Track handle_hits/handle_misses and cache_hits/cache_misses

Files Changed

  • src/fs/pending.rs — NEW: FIFO ordering mechanism
  • src/fs/fuse.rs — Handle management, barrier integration, forget()
  • src/fs/overlay.rsmaterialize_on_read policy, remove size caching
  • src/fs/mod.rs — Export pending module
  • src/cli/mount.rs — Debug logging for layer order
  • src/logging/mod.rs — New metrics types
  • Cargo.toml — Add dashmap, parking_lot dependencies

Post-Implementation Notes

  • Attr cache removed: Correctness > performance (~1-5μs per stat is acceptable)
  • File size caching removed: Fresh fs::metadata() prevents "unexpected data beyond EOF"
  • No file handles for pg_datafile: Sparse diff files require overlay.read_range() for correct block sourcing

vbp1 and others added 6 commits November 28, 2025 21:47
This commit addresses the "read only 0 of 8192 bytes" PostgreSQL error
that occurred when reading unmaterialized blocks from sparse diff files.

Root cause: When a pg_datafile was opened with write intent, a file
handle was created pointing to the diff file. Subsequent reads through
this handle would return zeros for blocks that existed in backup but
hadn't been written to diff yet (sparse file holes).

Key changes:

1. Never create file handles for pg_datafile
   - For pg_datafile, reads always go through overlay.read_range()
   - This ensures proper block materialization from backup layers
   - Writes still work correctly via on-demand file opening in worker

2. Remove attr_cache entirely
   - The cache was invalidated BEFORE async writes completed
   - Race condition: concurrent getattr could cache stale file sizes
   - Minor performance impact (~1-5μs per stat) not worth correctness risk

3. Use uncached logical_len for pg_datafile
   - logical_len_for() now calls logical_len() directly for datafiles
   - Prevents stale cached sizes from truncating reads

4. Add FIFO ordering for write operations (pending.rs)
   - Sequence-based barrier mechanism ensures operation ordering
   - READs see all preceding WRITEs, don't wait for concurrent ones
   - Uses dashmap + parking_lot for efficient per-inode synchronization

5. Parse n_blocks/full_size from backup_content.control
   - Enables accurate logical length computation for incremental backups
   - Fixes size reporting for relations with page-level backup streams

Tests:
- sparse_diff_reads_unmaterialized_blocks_from_backup: regression test
  that verifies reading unmaterialized blocks returns backup data
After PostgreSQL extends a file via writes, getattr must return the
actual file size, not a stale cached value. The previous fix addressed
sparse diff reads but still used cached diff_len in datafile_logical_len()
and cached logical_len for non-datafile paths.

Changes:
- Always read diff file size from disk via fs::metadata() in logical_len()
- Remove unused caching infrastructure: cached_logical_len(),
  cached_diff_len(), update_cached_diff_len() functions
- Remove logical_len and diff_len fields from BlockCacheEntry
- Simplify record_write() to only track materialized blocks

The remaining cache tracks:
- materialized: which blocks are already copied to diff
- sources: which layer each block came from (for fallback/debugging)
- pg_block_index: pg_probackup page stream indexes (static, built once)
- file_meta: backup_content.control metadata (static, from backup)

These are safe to cache as they don't change after writes.
Triggered on version tags (v*), builds a fully static binary using musl:
- Uses Rust 1.80 with x86_64-unknown-linux-musl target
- Strips binary for minimal size (~3.3MB)
- Creates GitHub Release with binary and SHA256 checksums
- Auto-generates release notes from commits

Usage: git tag v0.1.0 && git push origin v0.1.0
- Remove unnecessary let binding in pending.rs test (let_and_return)
- Add explicit truncate(true) to OpenOptions in overlay test (suspicious_open_options)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Add forget() FUSE method that calls pending_ops.remove(ino)
- Clean up per-inode state when kernel forgets about an inode
- Add inode_count() method for testing
- Add tests for remove() and memory cleanup behavior

This prevents DashMap entries from accumulating during long mount
sessions with many temporary files.
@vbp1 vbp1 changed the title Fix sparse diff file reads and improve caching correctness Phase 11: Handle-Based I/O and Path Caching Nov 29, 2025
@vbp1 vbp1 changed the title Phase 11: Handle-Based I/O and Path Caching Handle-Based I/O and Path Caching Nov 29, 2025
@vbp1 vbp1 merged commit 7404797 into main Nov 29, 2025
1 check passed
@vbp1 vbp1 mentioned this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant