Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Exo #199

Closed
yuqiao9 opened this issue Sep 5, 2024 · 4 comments
Closed

Question about Exo #199

yuqiao9 opened this issue Sep 5, 2024 · 4 comments

Comments

@yuqiao9
Copy link

yuqiao9 commented Sep 5, 2024

I have successfully deployed Exo. Could you explain why, for the same Llama 3.1, Exo’s model size is about three times larger than OllaMa’s, and also why Exo’s execution speed is far inferior to that of OllaMa?

@yuqiao9
Copy link
Author

yuqiao9 commented Sep 5, 2024

image
Is it caused by a configuration error on my side?

@AlexCheema
Copy link
Contributor

AlexCheema commented Sep 5, 2024

Hey, thanks for reporting this.

The reason here is likely:
ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.

I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.

@yuqiao9
Copy link
Author

yuqiao9 commented Sep 6, 2024

Hey, thanks for reporting this.

The reason here is likely: ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.

I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.

Thank you, I will try again later. It's not convenient to try now. I have another question. When I am reasoning, it feels like it is being executed by the CPU. Words appear one by one, and there is a noticeable lag. I set the environment variable CUDA=1 and the video memory has a main.py program. What is the reason for his slow reasoning.Does this also align with the reason you mentioned above?

@Rjvs
Copy link

Rjvs commented Oct 21, 2024

@yuqiao9 the benefits of quantisation is not just that a “big” LLM can fit into a smaller amount of memory, it also reduces the amount of data that has to be moved into GPU for each calculation, which makes it much faster. FP16 (i.e. 16-bit) model has far more math to do than 4-bit quantised which just takes a lot longer, for every token.

You can probably close this issue if your questions are answered now.

@yuqiao9 yuqiao9 closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants