-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The inference latency of model MiniMA-3B #6
Comments
Where do you obtain the results? We have not conducted such experiments in our paper. |
I used model |
Somehow weird. Could you please check that settings of them are the same? For example, they are both using flash attention and they are both using kv cache. |
And you can provide more information so that I can help you to identify the problem. |
And one point that should be noted is that: if more tokens are generated, then the latency should be larger. If so, the latency would be better normalized by the length of the generated tokens. |
Here is the code for measuring latency:
I set the attribute |
I do not see clear reasons why the latency would be that large. How about trying loading MiniMA with LlamaForCausalLM instead since MiniMA is typically using the LLaMA architecture. And please kindly check whether flash attention is turned on for LLaMA but off for MiniMA. |
I have tried using |
By |
Thanks,the result of |
The FLOPs here in the Table are training FLOPs. However, MiniMA is supposed to also have a smaller inference FLOPs than LLaMA-7B does due to its smaller model scale, therefore a smaller latency than LLaMA-7B in expectation (if they are tested under exactly the same setting). So I suspect there is still a diff somewhere uncovered. Perhaps the vocabulary size? MiniMA indeed has a slightly larger vocabulary than LLaMA-7B does (~50000 vs ~30000). But I have not expected the impact be that large. |
I'm not sure.The vocab_size of |
Rather than modifying the vocabulary size, you can directly use LLaMA-7B tokenizer for MiniMA-3B and carry out a test since these two models share the very first 32000 tokens. Good luck! |
😂Directly using tokenizer of model |
Let me have a try ; ) |
Here are the results I obtained in a similar way as yours: The below is the code snippet:
|
Thanks,
Thank you for your answer, I know what the reason is now.It's due to the accuracy of the model.The accuracy of the two models is different. |
You mean precision right? i.e., FP16 or BF16? |
That's also interesting. I did not expect the precision could impact the latency that much ; ) |
I also saw that your code has set specific precision, so I thought about trying it out. I didn't expect that the reason was indeed this |
I see, good luck to your work! |
@GeneZC Why the inference latency of model


MiniMA-3B
is longer than modelLlama-7B
:The text was updated successfully, but these errors were encountered: