Improve turbomind's prefix cache #3332

lvhan028 · 2025-03-25T09:33:21Z

This Pull Request implements several improvements to the tm engine:

cache prompt tokens and generated tokens when prefix caching is enabled
remove stateful inference
update the chat API and CLI, get_logits and get_ppl accordingly
enable_prefix_caching defaults to True
merge pt's chat.py and tm's chat.py into one

Remaining refactoring will be processed in another PR:

remove get_prompt in model.py
deprecate the built-in chat template, and AutoTokenizer's apply_chat_template instead

test cases:

llm model evaluation
https://aicarrier.feishu.cn/wiki/JxoDwiOO7i0xnvkZHzxc3PdGnJh?sheet=95YWnW

xliangwu · 2025-04-29T04:50:45Z

性能提高多少？

lzhangzz · 2025-06-24T08:45:33Z

src/turbomind/models/llama/LlamaBatch.cc

@@ -1011,16 +958,16 @@ void LlamaBatch::OutputLogits(const Tensor& logits, int first, int last, Generat

            int diff = (history_len + offset) - cache_len;

-            const int valid_len = input_len - std::max(0, (history_len + offset) - cache_len);
+            const int valid_len = input_len - std::max(0, diff);


do not replace (history_len + offset) - cache_len with diff, it's harder to understand the code. diff is only used to print debug info.

lzhangzz · 2025-06-25T06:55:20Z

src/turbomind/models/llama/SequenceManager.cc

        freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end());
    }
-    it = sequences_.erase(it);
+    else {


The else branch has nothing to do with the function parameter and is going to be called multiple times when multiple sequence are erased.

lzhangzz · 2025-06-26T10:28:51Z

src/turbomind/models/llama/SequenceManager.cc

+        BlockIds                               block_ids;
+        UniqueIds                              block_unique_ids;
+        std::vector<std::shared_ptr<TrieNode>> nodes;
+        std::tie(block_ids, block_unique_ids, nodes) = block_trie_->Cache(seq, seq.prompt);


Consider prune the tree while matching prefixes instead of the match-verify-remove pattern

lvhan028 added 5 commits March 17, 2025 16:16

add log

750aaa8

Merge branch 'main' into improve-tm-prefix-cache

8886124

refactor tm prefix caching

7b4304a

refactor tm prefix cache

8be44f8

Merge branch 'dev' into improve-tm-prefix-cache

dfdde01

lvhan028 added the improvement label Mar 25, 2025

lvhan028 added 2 commits March 25, 2025 18:10

fix linting

fda1e25

fix linting

a4ffe41

lvhan028 changed the base branch from main to dev March 25, 2025 11:13

lvhan028 added 13 commits March 27, 2025 16:47

combine Get&Create

acf4092

update

a2352d1

clear blocks

1e940df

INFO log to DEBUG log

533941d

refactor chat.py

91d1412

unlock the unmatched blocks when id is reused

ce08974

merge main

3891782

remove start_flag and end_flag from tm csrc

9c3ebc8

update output_logits

d41683a

update

70399b4

update

1b99728

fix api_client

c5a2962

remove interactive chat API

499b709

lvhan028 changed the base branch from dev to main April 3, 2025 01:54

lvhan028 added 7 commits April 3, 2025 11:23

fix build error on windows platform

617d317

fix chat

50e56e2

update generate.ps1

38ea2ae

fix clang-format error

e1489a5

fix clang-format error

9d1df28

fix vlm chat error

e2a0c7a

merge main

604b101

lvhan028 added 5 commits April 4, 2025 16:12

fix get_logits

5e34425

remove killing from tm csrc

1cbdf5a

fix clang-format

afd531d

update

3dc9ffa

enable_prefix_caching defaults to True

14eb22a

lvhan028 requested review from lzhangzz and irexyc April 8, 2025 04:33

lvhan028 added 2 commits April 8, 2025 13:22

merge pt chat.py and tm chat.py

7e13a18

remove pt chat.py and tm chat.py

22cf302

lvhan028 mentioned this pull request Apr 8, 2025

Default enable_prefix_caching True #3407

Closed

lvhan028 added 4 commits April 9, 2025 14:43

update

8531df8

Merge branch 'default-prefix-cache' into improve-tm-prefix-cache

3ddec13

fix

f3ef0d4

update

87dfbb9

lvhan028 added the BC-breaking label Apr 10, 2025

lvhan028 added 2 commits May 12, 2025 20:26

merge main and resove the conflicts

6fd0f56

update

61f2f0a

lzhangzz reviewed Jun 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve turbomind's prefix cache #3332

Improve turbomind's prefix cache #3332

Uh oh!

lvhan028 commented Mar 25, 2025 •

edited

Loading

Uh oh!

xliangwu commented Apr 29, 2025

Uh oh!

lzhangzz Jun 24, 2025

Uh oh!

lzhangzz Jun 25, 2025

Uh oh!

lzhangzz Jun 26, 2025

Uh oh!

Uh oh!

Improve turbomind's prefix cache #3332

Are you sure you want to change the base?

Improve turbomind's prefix cache #3332

Uh oh!

Conversation

lvhan028 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xliangwu commented Apr 29, 2025

Uh oh!

lzhangzz Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

lzhangzz Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

lzhangzz Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lvhan028 commented Mar 25, 2025 •

edited

Loading