Question about Exo #199

yuqiao9 · 2024-09-05T08:20:41Z

I have successfully deployed Exo. Could you explain why, for the same Llama 3.1, Exo’s model size is about three times larger than OllaMa’s, and also why Exo’s execution speed is far inferior to that of OllaMa?

yuqiao9 · 2024-09-05T08:43:29Z

Is it caused by a configuration error on my side?

AlexCheema · 2024-09-05T12:54:47Z

Hey, thanks for reporting this.

The reason here is likely:
ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.

I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.

yuqiao9 · 2024-09-06T01:02:14Z

Hey, thanks for reporting this.

The reason here is likely: ollama is using a 4-bit quantized model. However exo is using the fp16 unquantized model.

I have created an issue here (with $300 bounty) to add quantized model support to the tinygrad inference engine: #148. Once this is fixed, it should use less memory and be as fast or faster than ollama. If you want you can also try running with BEAM=2 e.g. BEAM=2 python3 main.py which should be quite a bit faster (just tried it myself on one MacBook and seems to be ~20% faster). Note that running with BEAM=2 might be a bit slower at the start but then faster.

Thank you, I will try again later. It's not convenient to try now. I have another question. When I am reasoning, it feels like it is being executed by the CPU. Words appear one by one, and there is a noticeable lag. I set the environment variable CUDA=1 and the video memory has a main.py program. What is the reason for his slow reasoning.Does this also align with the reason you mentioned above?

Rjvs · 2024-10-21T04:29:01Z

@yuqiao9 the benefits of quantisation is not just that a “big” LLM can fit into a smaller amount of memory, it also reduces the amount of data that has to be moved into GPU for each calculation, which makes it much faster. FP16 (i.e. 16-bit) model has far more math to do than 4-bit quantised which just takes a lot longer, for every token.

You can probably close this issue if your questions are answered now.

yuqiao9 closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Exo #199

Question about Exo #199

yuqiao9 commented Sep 5, 2024

yuqiao9 commented Sep 5, 2024

AlexCheema commented Sep 5, 2024 •

edited

Loading

yuqiao9 commented Sep 6, 2024

Rjvs commented Oct 21, 2024

Question about Exo #199

Question about Exo #199

Comments

yuqiao9 commented Sep 5, 2024

yuqiao9 commented Sep 5, 2024

AlexCheema commented Sep 5, 2024 • edited Loading

yuqiao9 commented Sep 6, 2024

Rjvs commented Oct 21, 2024

AlexCheema commented Sep 5, 2024 •

edited

Loading