Skip to content

Conversation

@alogfans
Copy link
Collaborator

@alogfans alogfans commented Nov 27, 2025

Description

Type of Change

  • Types
    • Bug fix
    • New feature
      • Transfer Engine
      • Mooncake Store
      • Mooncake EP
      • Integration
      • P2P Store
      • Python Wheel
    • Breaking change
    • CI/CD
    • Documentation update
    • Other

How Has This Been Tested?

This patch tries to mitigate problems from EndPoint Store management, including:

  • When EP count exceeds a soft limit rather than a hard limit, we evict some unused EPs eagerly.
  • Delete RDMA resources immediately when calling deleteEndpoint, if it is OK.
  • Make "pending-to-close" as a seperate state along with "handshake" or "connected".

Simple test in local testbed.

Checklist

  • I have performed a self-review of my own code.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

@alogfans alogfans marked this pull request as ready for review December 1, 2025 08:51
Comment on lines 232 to 239
// if (!endpoint->active()) {
// if (endpoint->inactiveTime() > 1.0)
// context_.deleteEndpoint(
// entry.first); // enable for re-establishation
// for (auto &slice : entry.second)
// failed_slice_list.push_back(slice); entry.second.clear();
// continue;
// }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same

@stmatengss
Copy link
Collaborator

Does this PR resolve the limited QP number issue?

@alogfans
Copy link
Collaborator Author

alogfans commented Dec 8, 2025

Does this PR resolve the limited QP number issue?

This patch will eagerly destroy QPs if it is possible. So I think it's helpful to mitigate this issue. However Endpoint capacity is remained.

@stmatengss
Copy link
Collaborator

Full review is required before merging this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants