-
Notifications
You must be signed in to change notification settings - Fork 564
Improve turbomind's prefix cache #3332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
性能提高多少? |
@@ -1011,16 +958,16 @@ void LlamaBatch::OutputLogits(const Tensor& logits, int first, int last, Generat | |||
|
|||
int diff = (history_len + offset) - cache_len; | |||
|
|||
const int valid_len = input_len - std::max(0, (history_len + offset) - cache_len); | |||
const int valid_len = input_len - std::max(0, diff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not replace (history_len + offset) - cache_len
with diff
, it's harder to understand the code. diff
is only used to print debug info.
freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end()); | ||
} | ||
it = sequences_.erase(it); | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else branch has nothing to do with the function parameter and is going to be called multiple times when multiple sequence are erased.
BlockIds block_ids; | ||
UniqueIds block_unique_ids; | ||
std::vector<std::shared_ptr<TrieNode>> nodes; | ||
std::tie(block_ids, block_unique_ids, nodes) = block_trie_->Cache(seq, seq.prompt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider prune the tree while matching prefixes instead of the match-verify-remove pattern
This Pull Request implements several improvements to the tm engine:
enable_prefix_caching
defaults to TrueRemaining refactoring will be processed in another PR:
get_prompt
in model.pyapply_chat_template
insteadtest cases:
https://aicarrier.feishu.cn/wiki/JxoDwiOO7i0xnvkZHzxc3PdGnJh?sheet=95YWnW