Skip to content

Conversation

@oandreeva-nv
Copy link
Contributor

@oandreeva-nv oandreeva-nv commented Nov 20, 2025

Overview:

This PR Fixes KVBM disk cache allocation failures on Lustre filesystems.
DIS-1085

Details:

Zero-fill fallback used unaligned vec![0u8] which fails with EINVAL on Lustre when using O_DIRECT.

I introduced AlignedBuffer to guarantee page-aligned (4096-byte) buffers for O_DIRECT writes.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added environment variable flag to control disk I/O optimization behavior, improving compatibility with different filesystem types and configurations.
  • Tests

    • Expanded test coverage for disk storage operations, including alignment validation and I/O optimization behavior verification.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Olga Andreeva <[email protected]>
@oandreeva-nv oandreeva-nv requested a review from a team as a code owner November 20, 2025 17:36
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@oandreeva-nv
Copy link
Contributor Author

/ok to test 67aeefb

@oandreeva-nv oandreeva-nv changed the title fix : Page-align O_DIRECT writes for Lustre compatibility fix: Page-align O_DIRECT writes for Lustre compatibility Nov 20, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 20, 2025

Walkthrough

The change introduces page-aligned disk I/O support with O_DIRECT handling for zero-fill allocation, including a new AlignedBuffer type, environment-controlled O_DIRECT toggling, improved error handling, and comprehensive alignment constraint tests.

Changes

Cohort / File(s) Summary
Disk I/O alignment and O_DIRECT support
lib/llm/src/block_manager/storage/disk.rs
Introduces AlignedBuffer type for page-aligned zero-filled allocations; adds DISK_DISABLE_O_DIRECT_KEY environment flag for O_DIRECT toggling; implements page-aligned zero-fill fallback path with partial-write alignment handling and enhanced error reporting; extends DiskStorage::new to honor disable flag; expands test suite with aligned buffer behavior validation, O_DIRECT constraint checks, and zerofill loop alignment tests

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Buffer alignment and safety: Verify correct page-aligned allocation, lifetime management, and that the AlignedBuffer properly handles partial writes with alignment constraints
  • O_DIRECT semantics: Ensure environment flag toggling correctly enables/disables O_DIRECT and that fallback behavior is properly implemented
  • Error handling paths: Validate comprehensive error reporting for write failures, partial writes, and EINVAL guidance
  • Test coverage: Confirm tests adequately cover alignment edge cases, O_DIRECT constraint simulation, and enable/disable scenarios

Poem

🐰 With buffers aligned to pages so neat,
And O_DIRECT toggled by environment's beat,
Zero-fill writes march in perfect stride,
No gaps of misalignment can hide,
Our disk I/O dreams are now complete! 🎉✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing page-aligned O_DIRECT writes for Lustre compatibility, which matches the core objective of the PR.
Description check ✅ Passed The description includes Overview, Details, and Related Issues sections as required by template, though the 'Where should the reviewer start?' section is empty/incomplete.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
lib/llm/src/block_manager/storage/disk.rs (4)

20-21: AlignedBuffer and PAGE_SIZE-based alignment look correct; minor portability/overflow nits

The AlignedBuffer implementation (manual Layout, alloc_zeroed, and Drop) looks memory-safe and matches the O_DIRECT alignment needs, and marking it Send/Sync is consistent with the read-only slice API.

Two small follow-ups to consider:

  • The aligned_size computation (size + PAGE_SIZE - 1) / PAGE_SIZE * PAGE_SIZE could theoretically overflow for very large size. Current call sites (e.g., ZERO_BUF_SIZE) are well within bounds, but a defensive checked_* sequence or an explicit limit would make this more future-proof.
  • Hard-coding PAGE_SIZE = 4096 is fine for the current Linux/Lustre target, but if this ever runs on systems with different page or block sizes, discovering alignment at runtime (e.g., via the OS page size or filesystem block size) or making it configurable would improve portability.

Also applies to: 35-95


98-102: Zero-fill fallback alignment & error handling are robust; be aware of >100% progress and overshoot

The updated zero-fill path does a good job of:

  • Enforcing page-aligned buffer address and I/O size (via AlignedBuffer and to_write rounding).
  • Detecting partial writes and bailing with detailed diagnostics.
  • Emitting clear guidance when encountering EINVAL, including the suggestion to disable O_DIRECT via the new env var.
  • Truncating back to the exact requested size when the last aligned write overshoots.

One behavioral quirk to be aware of: because written tracks the actual bytes written (including the final aligned overshoot), the trace log "Zero-fill progress: {written}/{size}" may briefly show values >100% and written > size for non–page-aligned sizes (before the subsequent truncate). That’s logically correct but could look surprising in logs; if that’s confusing in practice, splitting “logical size” vs “physical bytes written” in the logging would clarify it.

Also applies to: 109-117, 122-135, 136-198, 200-217


380-545: Test scaffolding for O_DIRECT alignment is strong; one test is more demonstrative than assertive

The StrictODirectWriter and the tests around it provide good coverage:

  • test_aligned_buffer_with_strict_writer effectively validates that AlignedBuffer::new satisfies the strict “address and size must be page-aligned” contract.
  • test_zerofill_write_loop_alignment mirrors the production zero-fill loop and ensures all simulated writes satisfy the strict O_DIRECT constraints.

One nuance in test_unaligned_vec_fails_strict_writer:

  • The test intentionally allows both outcomes and only asserts if an error occurs, otherwise it just logs that vec! “happened to be aligned”. This makes the test non-deterministic from a behavioral standpoint (it doesn’t enforce any invariant beyond logging).
  • If you want this to be a true assertion test rather than a demonstration, you could force misalignment by taking a subslice with an offset (e.g., from a known-aligned buffer) so the strict writer always rejects it, and then assert on EINVAL unconditionally.

If the goal is just documentation of the prior bug, the current form is fine; if the goal is automated regression detection, tightening this test would be worthwhile.


547-609: Ignored integration tests exercise the right paths; consider guarding env vars more defensively

The ignored integration tests for test_zerofill_with_o_direct and test_disable_o_direct are well-targeted:

  • They explicitly drive the DISK_ZEROFILL_FALLBACK_KEY and DISK_DISABLE_O_DIRECT_KEY paths and validate both allocation and readback behavior.
  • Env vars are cleaned up with remove_var at the end of each test.

Two minor considerations if these ever become more widely used:

  • If the test panics before cleanup, env vars could be left set for subsequent tests in the same process; a small RAII guard or helper to set+restore env vars would make this more robust.
  • You might also want an assertion that O_DIRECT is actually toggled as expected (e.g., via a helper that inspects fd flags), to ensure the env wiring continues to behave as intended.

These are non-blocking but could be useful as the O_DIRECT behavior evolves.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a14f9e and 67aeefb.

📒 Files selected for processing (1)
  • lib/llm/src/block_manager/storage/disk.rs (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
lib/llm/src/block_manager/storage/disk.rs (1)
lib/llm/src/block_manager/storage.rs (4)
  • size (165-165)
  • size (385-387)
  • size (465-467)
  • size (506-508)

@oandreeva-nv
Copy link
Contributor Author

/ok to test 67aeefb

Signed-off-by: Olga Andreeva <[email protected]>
@oandreeva-nv
Copy link
Contributor Author

/ok to test 5385df5

Signed-off-by: Olga Andreeva <[email protected]>
@oandreeva-nv
Copy link
Contributor Author

/ok to test d382123

Signed-off-by: Olga Andreeva <[email protected]>
@oandreeva-nv oandreeva-nv force-pushed the oandreeva_o_direct_alignment branch from 05e63bc to 21bba95 Compare November 22, 2025 01:18
Signed-off-by: Olga Andreeva <[email protected]>
@oandreeva-nv
Copy link
Contributor Author

/ok to test 8ba2211

/// A page-aligned buffer for O_DIRECT I/O operations.
/// On filesystems like Lustre, O_DIRECT requires both buffer address and I/O size
/// to be aligned to the filesystem block size (typically page size).
struct AlignedBuffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and what the aligned-vec crate does? https://crates.io/crates/aligned-vec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for a single-purpose disk cache allocation with O_DIRECT, a limited implementation here should be enough, unless we already use aligned-vec somewhere. Besides that, I'm not eager to add extra dependency just for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have aligned-vec as a dependency:

block-manager = ["dep:nixl-sys", "dep:cudarc", "dep:nix", "dep:aligned-vec"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, let me re-use it then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the code, please review

Copy link
Contributor

@ryanolson ryanolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need to back-port this change into dynamo-memory

@oandreeva-nv oandreeva-nv force-pushed the oandreeva_o_direct_alignment branch from c0e9e78 to 715e17a Compare November 25, 2025 18:27
Signed-off-by: Olga Andreeva <[email protected]>
@oandreeva-nv
Copy link
Contributor Author

/ok to test 91a5068

Comment on lines +217 to +219
// mkostemp only supports flags like O_CLOEXEC, not O_RDWR/O_DIRECT.
// The file is always opened O_RDWR by mkostemp.
// We'll use fcntl to set O_DIRECT after creation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a great insight. does it mean we are not effectively using O_DIRECT before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather mkostemp, leading to a potentially undefined behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants