-
Notifications
You must be signed in to change notification settings - Fork 463
feat[Store]: Introduce shm helper for dummy #1177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @YiXR, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the shared memory management within the Mooncake Store by introducing a dedicated Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a ShmHelper singleton to centralize shared memory allocation, which is a good step towards code reuse. However, the current implementation of the singleton pattern has several critical flaws. It leaks the singleton object and the underlying shared memory resources. The allocation method is not thread-safe against multiple calls, which can lead to resource leaks. Most critically, client code (DummyClient) incorrectly manages the lifecycle of the global singleton, leading to potential use-after-free bugs. I've provided detailed comments and suggestions to address these issues by adopting a modern C++ singleton pattern and correcting the resource management logic.
mooncake-store/src/dummy_client.cpp
Outdated
| LOG(ERROR) << "Failed to register SHM via IPC"; | ||
| munmap(shm_base_addr_, shm_size_); | ||
| close(shm_fd_); | ||
| shm_helper_->cleanup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling shm_helper_->cleanup() from DummyClient::setup_dummy on registration failure is incorrect. ShmHelper is a process-wide singleton, and one client failing should not tear down a global resource that other components (like MooncakeHostMemAllocatorPyWrapper or other DummyClient instances) might be using. This can lead to use-after-free bugs and crashes. The singleton should manage its own lifecycle. Please remove this call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
mooncake-store/src/dummy_client.cpp
Outdated
| close(shm_fd_); | ||
| shm_fd_ = -1; | ||
| } | ||
| shm_helper_->cleanup(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling shm_helper_->cleanup() from DummyClient::tearDownAll is incorrect. ShmHelper is a process-wide singleton, and tearing down one client should not destroy a global resource that other components might still be using. This can lead to use-after-free bugs and crashes. The singleton should manage its own lifecycle and clean up its resources only when the process exits. Please remove this call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| ShmHelper* ShmHelper::getInstance() { | ||
| if (instance_ == nullptr) { | ||
| std::lock_guard<std::mutex> lock(instance_mutex_); | ||
| if (instance_ == nullptr) { | ||
| instance_ = new ShmHelper(); | ||
| } | ||
| } | ||
| return instance_; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current singleton implementation for ShmHelper leaks memory because the instance created with new is never deleted. This also prevents the destructor from running, which would leak the shared memory file descriptor and mapping.
A better approach in modern C++ is to use a static local variable (Meyers' Singleton). This is thread-safe since C++11 and guarantees proper cleanup at program exit.
To apply this, you should also:
- Remove the
instance_andinstance_mutex_static members from theShmHelperclass definition indummy_client.h. - Remove their definitions at the top of
dummy_client.cpp(lines 18-19).
ShmHelper* ShmHelper::getInstance() {
static ShmHelper instance;
return &instance;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| } | ||
|
|
||
| void* ShmHelper::allocate(size_t size) { | ||
| shm_size_ = size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ShmHelper::allocate method is not safe to be called more than once. If it's called a second time, it will overwrite the internal shm_fd_ and shm_base_addr_ without cleaning up the old resources, leading to leaked file descriptors and memory mappings. The check in DummyClient::setup_dummy is not sufficient to prevent this in a multi-threaded environment, as it creates a race condition.
To make this thread-safe and prevent resource leaks, you should add a guard inside allocate to prevent re-allocation.
if (shm_fd_ != -1) {
throw std::runtime_error("Shared memory has already been allocated.");
}
shm_size_ = size;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| #else | ||
| errno = ENOSYS; | ||
| return -1; | ||
| return -1; // Or appropriate fallback/error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memfd_create_wrapper function no longer sets errno when __NR_memfd_create is not available. This can lead to misleading error messages in ShmHelper::allocate, as strerror(errno) will report an unrelated or old error. Please restore setting errno to ENOSYS in the else branch to ensure correct error reporting on systems that do not support memfd_create.
| return -1; // Or appropriate fallback/error | |
| errno = ENOSYS; | |
| return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep this,it‘s ok
| } | ||
|
|
||
| ShmHelper* ShmHelper::getInstance() { | ||
| static ShmHelper instance; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make ShmHelper a singleton implementation .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shm is a global mem pool. It can only init once. Once app like SGLang call shm alloc, the dummy client will reuse the shm and doesn't need to alloc a new one.
| py::gil_scoped_release release; | ||
| if (!self.shm_helper_) { | ||
| LOG(ERROR) << "Shared memory allocator is not initialized"; | ||
| return static_cast<uintptr_t>(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise a fatal error may be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| py::class_<MooncakeHostMemAllocatorPyWrapper>(m, "MooncakeHostMemAllocator") | ||
| .def(py::init<>()) | ||
| .def("alloc", [](MooncakeHostMemAllocatorPyWrapper &self, size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a free api is also needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| shm_size_ = 0; | ||
| } | ||
|
|
||
| void* ShmHelper::allocate(size_t size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a mutex guard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| int shm_fd_ = -1; | ||
| void *shm_base_addr_ = nullptr; | ||
| size_t shm_size_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we need an optional shm_name here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we use abstract shm, thus shm name can not be the identifier, we don't use it any more.
mooncake-store/src/dummy_client.cpp
Outdated
|
|
||
| void ShmHelper::cleanup() { | ||
| if (shm_base_addr_) { | ||
| munmap(shm_base_addr_, shm_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
munmap failure requires error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new ShmHelper singleton class to extract and centralize shared memory allocation logic from DummyClient, making it reusable by external applications like SGLang. The refactoring simplifies the IPC protocol by removing the shm_name parameter (now generated internally by RealClient) and exposes the helper through Python bindings via MooncakeHostMemAllocator.
Key Changes
- Introduced
ShmHelpersingleton class for shared memory allocation usingmemfd_create - Refactored
DummyClient::setup_dummy()to useShmHelperinstead of inline allocation - Removed
shm_namefromShmRegisterRequeststruct andmap_shm_internal()signature - Added Python bindings for
MooncakeHostMemAllocatorPyWrapperto expose shared memory allocation
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| mooncake-store/include/dummy_client.h | Defines new ShmHelper singleton class with allocation/cleanup methods and updates DummyClient to use ShmHelper* instead of shm_name |
| mooncake-store/src/dummy_client.cpp | Implements ShmHelper with memfd-based allocation, refactors setup_dummy() to use singleton, simplifies tearDownAll() cleanup |
| mooncake-store/include/real_client.h | Updates map_shm_internal() signature to remove shm_name parameter |
| mooncake-store/src/real_client.cpp | Generates shm_name internally in map_shm_internal(), removes formatting-only changes to log statements |
| mooncake-store/include/pyclient.h | Removes shm_name field from ShmRegisterRequest struct |
| mooncake-integration/store/store_py.cpp | Adds MooncakeHostMemAllocatorPyWrapper class to expose ShmHelper to Python with alloc() method |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| close(shm_fd_); | ||
| shm_fd_ = -1; | ||
| shm_base_addr_ = nullptr; | ||
| shm_size_ = 0; |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When IPC registration fails, the code resets the local member variables but doesn't clean up the allocated shared memory in the singleton ShmHelper. Since ShmHelper is a singleton with process-wide lifetime, the allocated memory will remain until process termination. Consider calling shm_helper_->cleanup() on failure, or document this behavior if intentional.
| shm_size_ = 0; | |
| shm_size_ = 0; | |
| if (shm_helper_) { | |
| shm_helper_->cleanup(); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not cleanup here, as the shm_helper_ is still valid
| void* ShmHelper::allocate(size_t size) { | ||
| if (shm_fd_ != -1) { | ||
| throw std::runtime_error("Shared memory has already been allocated."); | ||
| } | ||
|
|
||
| shm_size_ = size; | ||
|
|
||
| // Create memfd | ||
| shm_fd_ = memfd_create_wrapper("mooncake_shm", MFD_CLOEXEC); | ||
| if (shm_fd_ == -1) { | ||
| throw std::runtime_error("Failed to create anonymous shared memory: " + | ||
| std::string(strerror(errno))); | ||
| } | ||
|
|
||
| // Set size | ||
| if (ftruncate(shm_fd_, shm_size_) == -1) { | ||
| close(shm_fd_); | ||
| throw std::runtime_error("Failed to set shared memory size: " + | ||
| std::string(strerror(errno))); | ||
| } | ||
|
|
||
| // Map memory | ||
| shm_base_addr_ = mmap(nullptr, shm_size_, PROT_READ | PROT_WRITE, | ||
| MAP_SHARED, shm_fd_, 0); | ||
| if (shm_base_addr_ == MAP_FAILED) { | ||
| close(shm_fd_); | ||
| throw std::runtime_error("Failed to map shared memory: " + | ||
| std::string(strerror(errno))); | ||
| } | ||
|
|
||
| return shm_base_addr_; | ||
| } |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allocate method is not thread-safe. Multiple threads could concurrently call this method on the singleton instance, leading to race conditions when checking shm_fd_ != -1 and when modifying member variables. Consider adding a mutex to protect the allocation process, or document that this method should only be called once from a single thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| shm_base_addr_ = mmap(nullptr, shm_size_, PROT_READ | PROT_WRITE, | ||
| MAP_SHARED, shm_fd_, 0); | ||
| if (shm_base_addr_ == MAP_FAILED) { | ||
| close(shm_fd_); |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When mmap fails, the function closes shm_fd_ but doesn't reset it to -1 or reset shm_size_ before throwing. This leaves the object in an inconsistent state. Add shm_fd_ = -1; and shm_size_ = 0; after close(shm_fd_); to ensure proper cleanup on error.
| close(shm_fd_); | |
| close(shm_fd_); | |
| shm_fd_ = -1; | |
| shm_size_ = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if (!self.shm_helper_) { | ||
| LOG(ERROR) << "Shared memory allocator is not initialized"; | ||
| return static_cast<uintptr_t>(0); | ||
| } |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous issue, the null check if (!self.shm_helper_) will never be true because ShmHelper::getInstance() always returns a non-null pointer. This check can be removed, or if you want to handle initialization failures, this should be checked after the getInstance() call in the constructor.
| if (!self.shm_helper_) { | |
| LOG(ERROR) << "Shared memory allocator is not initialized"; | |
| return static_cast<uintptr_t>(0); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| // Set size | ||
| if (ftruncate(shm_fd_, shm_size_) == -1) { | ||
| close(shm_fd_); |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When ftruncate fails, the function closes shm_fd_ but doesn't reset it to -1 before throwing. This leaves shm_fd_ in an invalid state (pointing to a closed file descriptor). Add shm_fd_ = -1; after close(shm_fd_); to maintain consistency with cleanup() pattern.
| close(shm_fd_); | |
| close(shm_fd_); | |
| shm_fd_ = -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
mooncake-store/src/dummy_client.cpp
Outdated
| if (!shm_helper_) { | ||
| LOG(ERROR) << "Failed to get shared memory allocator"; | ||
| return -1; | ||
| } |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The null check if (!shm_helper_) is unnecessary because ShmHelper::getInstance() always returns a non-null pointer to a static instance. This check will never be true and can be removed.
| if (!shm_helper_) { | |
| LOG(ERROR) << "Failed to get shared memory allocator"; | |
| return -1; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
mooncake-store/src/dummy_client.cpp
Outdated
| shm_base_addr_ = shm_helper_->allocate(shm_size_); | ||
| if (shm_base_addr_ == MAP_FAILED) { | ||
| LOG(ERROR) << "Failed to allocate shared memory"; |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check if (shm_base_addr_ == MAP_FAILED) will never be true because allocate() throws an exception on failure and never returns MAP_FAILED. Either handle the exception with a try-catch block, or remove this check. Currently, if allocation fails, the exception will propagate and the error log won't be printed.
| shm_base_addr_ = shm_helper_->allocate(shm_size_); | |
| if (shm_base_addr_ == MAP_FAILED) { | |
| LOG(ERROR) << "Failed to allocate shared memory"; | |
| try { | |
| shm_base_addr_ = shm_helper_->allocate(shm_size_); | |
| } catch (const std::exception& e) { | |
| LOG(ERROR) << "Failed to allocate shared memory: " << e.what(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| .def("alloc", [](MooncakeHostMemAllocatorPyWrapper &self, size_t size) { | ||
| py::gil_scoped_release release; | ||
| if (!self.shm_helper_) { | ||
| LOG(ERROR) << "Shared memory allocator is not initialized"; | ||
| return static_cast<uintptr_t>(0); | ||
| } | ||
| void *ptr = self.shm_helper_->allocate(size); | ||
| return reinterpret_cast<uintptr_t>(ptr); | ||
| }); |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alloc method should handle exceptions from shm_helper_->allocate(size), which can throw std::runtime_error. Currently, if allocation fails, an exception will propagate to Python. Consider wrapping the call in a try-catch block to log the error and return 0, or let the exception propagate but document this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mem is necessary for Python, so it must throw error.
| } else { | ||
| // Shared memory already allocated, just get the base address and size | ||
| shm_base_addr_ = shm_helper_->get_base_addr(); | ||
| shm_size_ = shm_helper_->get_size(); |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch logic assumes that if shared memory is already allocated, the existing allocation size (shm_helper_->get_size()) matches the requested size (local_buffer_size + mem_pool_size). However, if setup_dummy is called with different parameters on subsequent calls, this could lead to mismatches between the actual buffer size and the expected size. Consider validating that shm_size_ matches the requested size, or document that the first allocation parameters are permanent.
| shm_size_ = shm_helper_->get_size(); | |
| shm_size_ = shm_helper_->get_size(); | |
| size_t requested_size = local_buffer_size + mem_pool_size; | |
| if (shm_size_ != requested_size) { | |
| LOG(ERROR) << "Shared memory size mismatch: existing size (" << shm_size_ | |
| << ") does not match requested size (" << requested_size << ")"; | |
| return -1; | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Signed-off-by: Xingrui Yi <[email protected]>
Description
Split shm alloc logic into a helper class.
Thus, the APP like SGLang can directly use this helper to create share mem.
Type of Change
How Has This Been Tested?
Checklist