-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about Exo #199
Comments
Hey, thanks for reporting this. The reason here is likely: I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with |
Thank you, I will try again later. It's not convenient to try now. I have another question. When I am reasoning, it feels like it is being executed by the CPU. Words appear one by one, and there is a noticeable lag. I set the environment variable CUDA=1 and the video memory has a main.py program. What is the reason for his slow reasoning.Does this also align with the reason you mentioned above? |
@yuqiao9 the benefits of quantisation is not just that a “big” LLM can fit into a smaller amount of memory, it also reduces the amount of data that has to be moved into GPU for each calculation, which makes it much faster. FP16 (i.e. 16-bit) model has far more math to do than 4-bit quantised which just takes a lot longer, for every token. You can probably close this issue if your questions are answered now. |
I have successfully deployed Exo. Could you explain why, for the same Llama 3.1, Exo’s model size is about three times larger than OllaMa’s, and also why Exo’s execution speed is far inferior to that of OllaMa?
The text was updated successfully, but these errors were encountered: