-
Notifications
You must be signed in to change notification settings - Fork 710
fix: Page-align O_DIRECT writes for Lustre compatibility #4508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Olga Andreeva <[email protected]>
|
/ok to test 67aeefb |
WalkthroughThe change introduces page-aligned disk I/O support with O_DIRECT handling for zero-fill allocation, including a new AlignedBuffer type, environment-controlled O_DIRECT toggling, improved error handling, and comprehensive alignment constraint tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks✅ Passed checks (3 passed)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
lib/llm/src/block_manager/storage/disk.rs (4)
20-21: AlignedBuffer and PAGE_SIZE-based alignment look correct; minor portability/overflow nitsThe AlignedBuffer implementation (manual
Layout,alloc_zeroed, andDrop) looks memory-safe and matches the O_DIRECT alignment needs, and marking itSend/Syncis consistent with the read-only slice API.Two small follow-ups to consider:
- The
aligned_sizecomputation(size + PAGE_SIZE - 1) / PAGE_SIZE * PAGE_SIZEcould theoretically overflow for very largesize. Current call sites (e.g.,ZERO_BUF_SIZE) are well within bounds, but a defensivechecked_*sequence or an explicit limit would make this more future-proof.- Hard-coding
PAGE_SIZE = 4096is fine for the current Linux/Lustre target, but if this ever runs on systems with different page or block sizes, discovering alignment at runtime (e.g., via the OS page size or filesystem block size) or making it configurable would improve portability.Also applies to: 35-95
98-102: Zero-fill fallback alignment & error handling are robust; be aware of >100% progress and overshootThe updated zero-fill path does a good job of:
- Enforcing page-aligned buffer address and I/O size (via
AlignedBufferandto_writerounding).- Detecting partial writes and bailing with detailed diagnostics.
- Emitting clear guidance when encountering
EINVAL, including the suggestion to disable O_DIRECT via the new env var.- Truncating back to the exact requested size when the last aligned write overshoots.
One behavioral quirk to be aware of: because
writtentracks the actual bytes written (including the final aligned overshoot), the trace log"Zero-fill progress: {written}/{size}"may briefly show values >100% andwritten > sizefor non–page-aligned sizes (before the subsequent truncate). That’s logically correct but could look surprising in logs; if that’s confusing in practice, splitting “logical size” vs “physical bytes written” in the logging would clarify it.Also applies to: 109-117, 122-135, 136-198, 200-217
380-545: Test scaffolding for O_DIRECT alignment is strong; one test is more demonstrative than assertiveThe
StrictODirectWriterand the tests around it provide good coverage:
test_aligned_buffer_with_strict_writereffectively validates thatAlignedBuffer::newsatisfies the strict “address and size must be page-aligned” contract.test_zerofill_write_loop_alignmentmirrors the production zero-fill loop and ensures all simulated writes satisfy the strict O_DIRECT constraints.One nuance in
test_unaligned_vec_fails_strict_writer:
- The test intentionally allows both outcomes and only asserts if an error occurs, otherwise it just logs that
vec!“happened to be aligned”. This makes the test non-deterministic from a behavioral standpoint (it doesn’t enforce any invariant beyond logging).- If you want this to be a true assertion test rather than a demonstration, you could force misalignment by taking a subslice with an offset (e.g., from a known-aligned buffer) so the strict writer always rejects it, and then assert on
EINVALunconditionally.If the goal is just documentation of the prior bug, the current form is fine; if the goal is automated regression detection, tightening this test would be worthwhile.
547-609: Ignored integration tests exercise the right paths; consider guarding env vars more defensivelyThe ignored integration tests for
test_zerofill_with_o_directandtest_disable_o_directare well-targeted:
- They explicitly drive the
DISK_ZEROFILL_FALLBACK_KEYandDISK_DISABLE_O_DIRECT_KEYpaths and validate both allocation and readback behavior.- Env vars are cleaned up with
remove_varat the end of each test.Two minor considerations if these ever become more widely used:
- If the test panics before cleanup, env vars could be left set for subsequent tests in the same process; a small RAII guard or helper to set+restore env vars would make this more robust.
- You might also want an assertion that O_DIRECT is actually toggled as expected (e.g., via a helper that inspects fd flags), to ensure the env wiring continues to behave as intended.
These are non-blocking but could be useful as the O_DIRECT behavior evolves.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/llm/src/block_manager/storage/disk.rs(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
lib/llm/src/block_manager/storage/disk.rs (1)
lib/llm/src/block_manager/storage.rs (4)
size(165-165)size(385-387)size(465-467)size(506-508)
|
/ok to test 67aeefb |
Signed-off-by: Olga Andreeva <[email protected]>
|
/ok to test 5385df5 |
Signed-off-by: Olga Andreeva <[email protected]>
|
/ok to test d382123 |
Signed-off-by: Olga Andreeva <[email protected]>
05e63bc to
21bba95
Compare
Signed-off-by: Olga Andreeva <[email protected]>
|
/ok to test 8ba2211 |
| /// A page-aligned buffer for O_DIRECT I/O operations. | ||
| /// On filesystems like Lustre, O_DIRECT requires both buffer address and I/O size | ||
| /// to be aligned to the filesystem block size (typically page size). | ||
| struct AlignedBuffer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and what the aligned-vec crate does? https://crates.io/crates/aligned-vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for a single-purpose disk cache allocation with O_DIRECT, a limited implementation here should be enough, unless we already use aligned-vec somewhere. Besides that, I'm not eager to add extra dependency just for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have aligned-vec as a dependency:
Line 24 in e75bcf6
| block-manager = ["dep:nixl-sys", "dep:cudarc", "dep:nix", "dep:aligned-vec"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, let me re-use it then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the code, please review
ryanolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need to back-port this change into dynamo-memory
Signed-off-by: Olga Andreeva <[email protected]>
c0e9e78 to
715e17a
Compare
Signed-off-by: Olga Andreeva <[email protected]>
|
/ok to test 91a5068 |
| // mkostemp only supports flags like O_CLOEXEC, not O_RDWR/O_DIRECT. | ||
| // The file is always opened O_RDWR by mkostemp. | ||
| // We'll use fcntl to set O_DIRECT after creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a great insight. does it mean we are not effectively using O_DIRECT before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather mkostemp, leading to a potentially undefined behavior
Overview:
This PR Fixes KVBM disk cache allocation failures on Lustre filesystems.
DIS-1085
Details:
Zero-fill fallback used unaligned
vec![0u8]which fails withEINVALon Lustre when usingO_DIRECT.I introduced
AlignedBufferto guarantee page-aligned (4096-byte) buffers for O_DIRECT writes.Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.