feat: support expert dynamic load balancing for DeepSeek. #26

DongheJin · 2025-08-25T10:31:44Z

No description provided.

yq33victor · 2025-08-26T02:15:56Z

xllm/core/layers/npu/deepseek_v2_decoder_layer.cpp

  ep_size_ = parallel_args.ep_size();
  ep_local_tp_size_ = parallel_args.world_size() / ep_size_;
  CHECK_EQ(parallel_args.world_size(), ep_size_ * ep_local_tp_size_);
  ep_local_tp_rank_ = parallel_args.rank() % ep_local_tp_size_;
  num_experts_per_partition_ = model_args.n_routed_experts() / ep_size_;
+  if (FLAGS_enable_eplb) {
+    num_experts_per_partition_++;


Would it be better to make the num of redundant experts as a configurable param ?

The redundant experts num has been modified to be a configurable parameter.

yq33victor · 2025-08-26T02:28:04Z

xllm/core/layers/npu/deepseek_v2_decoder_layer.cpp

+  std::vector<int32_t> values;
+
+  for (const auto& [k, v] : expert_routing_map) {
+    keys.emplace_back(k);


Can it be combined with the loop above?

std::vector<int64_t> keys; std::vector<int32_t> values; for (auto& [key, indices] : expert_routing_map) { int num_of_duplications = indices.size(); int selected_index = ep_rank_ % num_of_duplications; indices = {indices[selected_index]}; keys.emplace_back(key); values.emplace_back(static_cast<int32_t>(indices[0])); }

Done. Thanks for the suggestion!

yq33victor · 2025-08-26T02:49:36Z

xllm/core/layers/npu/deepseek_v2_decoder_layer.cpp

+      matches_pos.emplace_back(
+          std::distance(device_expert_list_.begin(), iter) - start_idx);
+    }
+  }
  std::lock_guard<std::mutex> lock(experts_mutex_);
  torch::Tensor tmp_tensor =
      is_sharded ? get_sharded_tensor(state_dict,


tmp_tensor and tmp_tensor_shm are the same tensor. We should be able to use one of them, maybe no need to retrieve it again from the shard tensor ?

Done. Thanks for the suggestion!

yq33victor · 2025-08-26T02:51:18Z

xllm/core/layers/npu/deepseek_v2_decoder_layer.cpp

    bool transpose) {
+  auto merge_experts_weights_sart = std::chrono::high_resolution_clock::now();
+


nit: The variable merge_experts_weights_sart is unused.

yq33victor · 2025-08-26T02:54:09Z

xllm/core/layers/npu/deepseek_v2_decoder_layer.cpp

+  target_buffer = at_npu::native::npu_format_cast(target_buffer.contiguous(), 2)
+                      .reshape({num_experts, gate_dim + up_dim, hidden_dim});
+
+  prepare_experts_weights_start = std::chrono::high_resolution_clock::now();


nit: prepare_experts_weights_start is not used

yq33victor · 2025-08-26T03:26:00Z

xllm/core/runtime/llm_engine.cpp

+                           {num_layers, num_device_experts},
+                           torch::TensorOptions().dtype(torch::kInt64))
+              .clone());
+      layer_ids[worker_rank] = result.value().prepared_layer_id;


I don’t understand this part. The size of the layer_ids array is num_device_experts - 1, but the loop below uses the size of results (where results.size() equals worker_clients_.size()). Can these two sizes match?

The redundant experts num has been modified to be a configurable parameter.

yq33victor · 2025-08-26T03:28:33Z

xllm/core/runtime/llm_engine.h

  // For multi-node serving
  // engine brpc server, all workers connect to engine_server_,
  // engine_server_ will send a UniqueId for workers to
  // create process group. And workers send worker brpc server
  // address to engine, engine will create WorkerClient for each worker.
  // Engine call workers to step via these WorkerClients.
  std::shared_ptr<DistManager> dist_manager_ = nullptr;
+
+  std::unique_ptr<EplbManager> eplb_manager_ = nullptr;
+  std::unique_ptr<EplbPolicy> eplb_policy_ = nullptr;


nit: eplb_policy_ is only used in eplb_manager_, maybe it's more elegant to put it in eplb_manager_.

yq33victor · 2025-08-26T06:02:34Z

xllm/core/framework/eplb/eplb_policy.cpp

+    auto prev_max_val = torch::max(prev_load).item<double>() + 1e-6f;
+
+    current_load = (current_load / current_max_val).unsqueeze(0);
+    ;


nit: delete ;

yq33victor · 2025-08-26T06:23:29Z

xllm/core/framework/eplb/eplb_manager.cpp

+        state_.expert_load_queue.pop();
+        int64_t current_time = absl::ToUnixSeconds(absl::Now());
+        if (current_time - latest_record_time >= FLAGS_eplb_update_rate) {
+          latest_record_time = current_time;


nit: FLAGS_eplb_update_rate -> FLAGS_eplb_update_interval ?

yq33victor · 2025-08-26T06:26:51Z

xllm/core/runtime/params_utils.cpp

+      std::vector<int32_t>(pb_forward_input->eplb_info().expert_ids().begin(),
+                           pb_forward_input->eplb_info().expert_ids().end());
+  eplb_info.update_layer_id = pb_forward_input->eplb_info().update_layer_id();
+  forward_inputs.eplb_info = eplb_info;


nit: forward_inputs.eplb_info = eplb_info; This line of code is redundant. eplb_info is a reference of forward_inputs.eplb_info

linkerzhang · 2025-08-28T02:52:27Z

xllm/core/common/types.h

@@ -236,4 +236,10 @@ struct JsonTool {
      : type(tool_type), function(func) {}
 };

+struct EplbInfo {


add comments for struct / class and its public fields / methods please.

Done! Added detailed comments.

linkerzhang · 2025-08-28T02:53:40Z

xllm/core/framework/eplb/eplb_executor.h

+#pragma once
+
+#include <torch/torch.h>
+


list declarations in alphabetical order please.

This is clang-format's output based on our current rules. The ordering follows these specific priority rules:
1. Headers with .h suffix (like torch/torch.h) get highest priority
2. Other system headers (like functional) come next

…abled.

DongheJin requested review from liutongxuan and yq33victor August 25, 2025 10:31

yq33victor reviewed Aug 26, 2025

View reviewed changes

linkerzhang reviewed Aug 28, 2025

View reviewed changes

DongheJin force-pushed the feature/eplb branch from 2172181 to c48d6f9 Compare August 28, 2025 09:20

DongheJin added 3 commits August 29, 2025 12:35

feat: support expert dynamic load balancing for DeepSeek.

17962bd

bugfix: fix compile issuse for EPLB.

b8553b9

bugfix: fix coredump issue when both EPLB and schedule overlap are en…

25b0d5f

…abled.

DongheJin force-pushed the feature/eplb branch 5 times, most recently from 6943a2d to bfd630d Compare August 29, 2025 08:07

feat: support variable number of redundant expert.

906d13b

DongheJin force-pushed the feature/eplb branch from bfd630d to 906d13b Compare August 29, 2025 08:10

yq33victor approved these changes Aug 30, 2025

View reviewed changes

liutongxuan approved these changes Aug 30, 2025

View reviewed changes

liutongxuan merged commit e25e7a6 into jd-opensource:main Aug 30, 2025
2 checks passed

		bool transpose) {
		auto merge_experts_weights_sart = std::chrono::high_resolution_clock::now();

		#pragma once

		#include <torch/torch.h>

feat: support expert dynamic load balancing for DeepSeek. #26

feat: support expert dynamic load balancing for DeepSeek. #26

Uh oh!

Conversation

DongheJin commented Aug 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!