Handle-Based I/O and Path Caching by vbp1 · Pull Request #13 · vbp1/pbkfs

vbp1 · 2025-11-28T18:48:07Z

Summary

Handle-Based I/O, FIFO operation ordering, and materialize_on_read policy.

Introduce handle-based pread/pwrite I/O to reduce repeated file opens
Add sequence-based barrier mechanism (PendingOps) for FIFO write ordering
Implement materialize_on_read policy to keep diff small for read-heavy workloads
Fix PostgreSQL "read only 0 of 8192 bytes" error with sparse diff files
Remove attr_cache to eliminate race conditions with async writes
Add memory cleanup via forget() to prevent accumulation during long sessions

Key Changes

1. Handle-Based I/O (`fuse.rs`)

open() creates persistent file handles stored in handles HashMap
read()/write() use pread/pwrite via handles instead of repeated fs::read/fs::write
For pg_datafile: handles are NOT created — reads always go through overlay.read_range() to correctly handle sparse diff files and block-level materialization

2. FIFO Operation Ordering (`pending.rs`)

New PendingOps module with sequence-based barrier mechanism
Guarantees READ sees all preceding WRITEs without waiting for concurrent ones
Uses DashMap + parking_lot::Condvar for efficient per-inode synchronization
wait_barrier() in open/read/getattr ensures consistency
wait_for_preceding() in write workers ensures FIFO execution order

3. Materialize-on-Read Policy (`overlay.rs`)

PBKFS_MATERIALIZE_ON_READ env var controls block materialization behavior
Default false: reads from pg_datafile do NOT copy blocks to diff
true: eager materialization (legacy behavior, useful for debugging)
New read_range_nonmaterializing() path for non-materializing reads

4. Correctness Fixes

Remove attr_cache — was invalidated BEFORE async writes completed, causing stale sizes
Remove logical_len/diff_len caching from BlockCacheEntry — file sizes always read fresh
Parse n_blocks/full_size from backup_content.control for accurate incremental backup lengths
Add forget() to clean up PendingOps state when kernel forgets inode

5. Metrics & Observability (`logging/mod.rs`)

OverlayIoSnapshot struct for cache/handle metrics
log_overlay_io_metrics() emits periodic snapshots
Track handle_hits/handle_misses and cache_hits/cache_misses

Files Changed

src/fs/pending.rs — NEW: FIFO ordering mechanism
src/fs/fuse.rs — Handle management, barrier integration, forget()
src/fs/overlay.rs — materialize_on_read policy, remove size caching
src/fs/mod.rs — Export pending module
src/cli/mount.rs — Debug logging for layer order
src/logging/mod.rs — New metrics types
Cargo.toml — Add dashmap, parking_lot dependencies

Post-Implementation Notes

Attr cache removed: Correctness > performance (~1-5μs per stat is acceptable)
File size caching removed: Fresh fs::metadata() prevents "unexpected data beyond EOF"
No file handles for pg_datafile: Sparse diff files require overlay.read_range() for correct block sourcing

This commit addresses the "read only 0 of 8192 bytes" PostgreSQL error that occurred when reading unmaterialized blocks from sparse diff files. Root cause: When a pg_datafile was opened with write intent, a file handle was created pointing to the diff file. Subsequent reads through this handle would return zeros for blocks that existed in backup but hadn't been written to diff yet (sparse file holes). Key changes: 1. Never create file handles for pg_datafile - For pg_datafile, reads always go through overlay.read_range() - This ensures proper block materialization from backup layers - Writes still work correctly via on-demand file opening in worker 2. Remove attr_cache entirely - The cache was invalidated BEFORE async writes completed - Race condition: concurrent getattr could cache stale file sizes - Minor performance impact (~1-5μs per stat) not worth correctness risk 3. Use uncached logical_len for pg_datafile - logical_len_for() now calls logical_len() directly for datafiles - Prevents stale cached sizes from truncating reads 4. Add FIFO ordering for write operations (pending.rs) - Sequence-based barrier mechanism ensures operation ordering - READs see all preceding WRITEs, don't wait for concurrent ones - Uses dashmap + parking_lot for efficient per-inode synchronization 5. Parse n_blocks/full_size from backup_content.control - Enables accurate logical length computation for incremental backups - Fixes size reporting for relations with page-level backup streams Tests: - sparse_diff_reads_unmaterialized_blocks_from_backup: regression test that verifies reading unmaterialized blocks returns backup data

After PostgreSQL extends a file via writes, getattr must return the actual file size, not a stale cached value. The previous fix addressed sparse diff reads but still used cached diff_len in datafile_logical_len() and cached logical_len for non-datafile paths. Changes: - Always read diff file size from disk via fs::metadata() in logical_len() - Remove unused caching infrastructure: cached_logical_len(), cached_diff_len(), update_cached_diff_len() functions - Remove logical_len and diff_len fields from BlockCacheEntry - Simplify record_write() to only track materialized blocks The remaining cache tracks: - materialized: which blocks are already copied to diff - sources: which layer each block came from (for fallback/debugging) - pg_block_index: pg_probackup page stream indexes (static, built once) - file_meta: backup_content.control metadata (static, from backup) These are safe to cache as they don't change after writes.

Triggered on version tags (v*), builds a fully static binary using musl: - Uses Rust 1.80 with x86_64-unknown-linux-musl target - Strips binary for minimal size (~3.3MB) - Creates GitHub Release with binary and SHA256 checksums - Auto-generates release notes from commits Usage: git tag v0.1.0 && git push origin v0.1.0

- Remove unnecessary let binding in pending.rs test (let_and_return) - Add explicit truncate(true) to OpenOptions in overlay test (suspicious_open_options) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add forget() FUSE method that calls pending_ops.remove(ino) - Clean up per-inode state when kernel forgets about an inode - Add inode_count() method for testing - Add tests for remove() and memory cleanup behavior This prevents DashMap entries from accumulating during long mount sessions with many temporary files.

vbp1 and others added 6 commits November 28, 2025 21:47

Apply cargo fmt

a376ca7

vbp1 changed the title ~~Fix sparse diff file reads and improve caching correctness~~ Phase 11: Handle-Based I/O and Path Caching Nov 29, 2025

vbp1 changed the title ~~Phase 11: Handle-Based I/O and Path Caching~~ Handle-Based I/O and Path Caching Nov 29, 2025

vbp1 merged commit 7404797 into main Nov 29, 2025
1 check passed

vbp1 mentioned this pull request Dec 12, 2025

--no-wal mount option #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle-Based I/O and Path Caching#13

Handle-Based I/O and Path Caching#13
vbp1 merged 6 commits intomainfrom
011-handle-caching

vbp1 commented Nov 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vbp1 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Handle-Based I/O (fuse.rs)

2. FIFO Operation Ordering (pending.rs)

3. Materialize-on-Read Policy (overlay.rs)

4. Correctness Fixes

5. Metrics & Observability (logging/mod.rs)

Files Changed

Post-Implementation Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vbp1 commented Nov 28, 2025 •

edited

Loading

1. Handle-Based I/O (`fuse.rs`)

2. FIFO Operation Ordering (`pending.rs`)

3. Materialize-on-Read Policy (`overlay.rs`)

5. Metrics & Observability (`logging/mod.rs`)