feat: support multi-priority and on/offline unified request schedule. #48

weizhehuang0827 · 2025-08-28T10:00:12Z

implement multi-priority and on/offline unified request schedule.

weizhehuang0827 · 2025-08-28T14:13:00Z

Note that a simpler alternative strategy is to continue using deque to implement decode queue and sort it directly each time when scheduling, which seems to have higher complexity in theory.

yq33victor · 2025-08-29T06:30:31Z

xllm/core/framework/request/request.cpp

+                 const std::string& service_request_id,
+                 bool offline,
+                 int32_t slo_ms,
+                 xllm::proto::Priority priority)


Create a new enum Priority to replace xllm::proto::Priority, we'd better avoid using protocol buffer types during serving.

yq33victor · 2025-08-29T06:35:31Z

xllm/core/framework/request/request_params.h

+
+  int32_t slo_ms = 0;
+
+  xllm::proto::Priority priority = xllm::proto::Priority::NORMAL;


As mentioned above, a new enum Priority type should be used here.

yq33victor · 2025-08-29T06:37:54Z

xllm/core/scheduler/disagg_pd_scheduler.cpp

-    std::shared_ptr<Request> request = prefill_request_queue_.pop();
+
+    auto poped_result = prefill_request_queue_.try_pop();
+    // OPTIMIZE 之后改为：多次尝试读取在线 prefill


nit: emmm... comments in English. :)

yq33victor · 2025-08-29T07:16:02Z

xllm/core/common/global_flags.cpp

+
+DEFINE_string(priority_strategy, "FCFS", "priority strategy for requests");
+
+DEFINE_bool(enable_on_preempt_off,


nit: modify the flag name, as it may be mistaken for enable, on_preempt, off at first glance.

use enable_online_preempt_offline or some names else.

yq33victor · 2025-08-29T07:37:32Z

xllm/core/framework/block/block_manager_impl.cpp

+      num_blocks_can_evict += seq->kv_state().num_kv_blocks();
+    }
+    if ((num_blocks_needed <= num_blocks_can_evict) ||
+        has_enough_blocks(num_blocks_needed - num_blocks_can_evict)) {


We don't need to call has_enough_blocks every time. This is because before calling check_if_enough_to_evict, the allocate operation must have failed. Additionally, the entire execution code runs serially with no concurrency (in the PD separation scenario, there is concurrency when allocating blocks for remote prompts, but this only involves block allocation and no block release). Therefore, the number of free blocks will not increase.

The has_enough_blocks function call prefix_cache eviction logic every time, and this may cost to much.

Based on the above considerations, we can place the function check_if_enough_to_evict directly in the contiguous scheduler file. This is because we only need to judge whether the total num_kv_blocks in running_queue_to_evict meets the size requirement of num_request_to_evict.
BlockManagerImpl is inherently used for managing block allocations, it is not very appropriate to pass parameters such as DecodePriorityQueue to it.

yq33victor · 2025-08-29T07:45:01Z

xllm/core/scheduler/continuous_scheduler.cpp

-  while (!running_queue_.empty() &&
+    size_t& num_offd_preempt_off_requests,
+    size_t& num_ond_preempt_on_requests,
+    size_t& num_ond_preempt_off_requests,


nit: offd and ond , It’s not immediately known what it means. :(

just want to make name short and align with num_preempt_requests orz. OK I will change it to whole name.

yq33victor · 2025-08-29T07:49:44Z

xllm/core/scheduler/continuous_scheduler.cpp

-    size_t& num_preempted_requests) {
-  // Do nothing: have new prefill requests to handle, or have no running
-  // requests
-  if (!running_sequences_.empty() || running_queue_.empty()) {


emmm... What is the purpose of removing this logic? In ContinuousScheduler, prefill will be will be executed with high priority.

I move this logic to prepare_batch and remove the redundant second empty. I hope to wrap the two handle_decode_requests to ensure that only prefill or decode is processed.

yq33victor · 2025-08-29T08:02:25Z

xllm/core/scheduler/continuous_scheduler.cpp

      size_t updated_num_tokens =
-          sequence->num_tokens() + options_.num_speculative_tokens() + 1;
+          sequence->num_tokens() + options_.num_speculative_tokens();


the target new generated tokens is options_.num_speculative_tokens() + 1

In my test script, I confirm the new token has been added to sequence num_tokens. There is no need to +1 again. There is a logic difference between ContinuousScheduler and ChunkedPrefillScheduler here before.

yq33victor · 2025-08-29T08:18:32Z

xllm/core/scheduler/continuous_scheduler.cpp

@@ -537,7 +691,8 @@ std::vector<Batch> ContinuousScheduler::schedule_request(
      return batch;
    }

-    if (!waiting_priority_queue_.empty() || !running_queue_.empty()) {
+    if (!waiting_priority_queue_.empty() || !running_queue_->empty() ||
+        !waiting_priority_queue_offline_.empty()) {


Add running_queue_offline_ ?

yq33victor · 2025-08-29T08:35:38Z

xllm/core/scheduler/disagg_pd_scheduler.cpp

+      poped_result = prefill_request_queue_offline_.try_pop();
+      if (!poped_result.has_value()) {
+        // no offline request, sleep for a while and try again
+        absl::SleepFor(absl::Milliseconds(100));


this maybe block online request.

A more appropriate strategy is to design the pop interface as a blocking interface with timeout waiting (for example, it automatically returns null after 100ms), and then use try_pop to read offline requests. This way, what we block are offline requests rather than online requests.

std::shared_ptr<Request> poped_result = prefill_request_queue_.pop(); if (poped_result.has_value()) { // nothing } else { poped_result = prefill_request_queue_offline_.try_pop(); }

pop function:

absl::optional<T> pop(absl::Duration timeout) { absl::MutexLock lock(&mutex_); bool has_value = mutex_.AwaitWithTimeout( absl::Condition( +[](std::queue<T>* queue) { return !queue->empty(); }, &queue_ ), timeout ); if (!has_value) { return absl::nullopt; } T value = std::move(queue_.front()); queue_.pop(); return value; }

feat: support multi-priority and on/offline unified request schedule.

59dad2b

weizhehuang0827 requested review from yq33victor and liutongxuan August 28, 2025 10:00

yq33victor reviewed Aug 29, 2025

View reviewed changes


		int32_t slo_ms = 0;

		xllm::proto::Priority priority = xllm::proto::Priority::NORMAL;


		DEFINE_string(priority_strategy, "FCFS", "priority strategy for requests");

		DEFINE_bool(enable_on_preempt_off,

feat: support multi-priority and on/offline unified request schedule. #48

Are you sure you want to change the base?

feat: support multi-priority and on/offline unified request schedule. #48

Conversation

weizhehuang0827 commented Aug 28, 2025

Uh oh!

weizhehuang0827 commented Aug 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!