-
Notifications
You must be signed in to change notification settings - Fork 66
Help to understand #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey, apologies for the late response! Vision Transformers are not used for text generation, so they are not supported at this moment. As for your question about speculative decoding, it essentially uses a smaller draft model to predict the outputs of the larger model. Hope that this is helpful! |
Hi, same question. Thanks |
Having the same issue. For example, this just loads the zephyr model twice so not sure if I am using this properly:
|
Take a look at this video, where Horace He explains the concept nicely: In essence, you would want to use a smaller model that is an order of magnitude faster and let it run for a number of steps and then check its results in parallel with the big model (which costs almost the same as one token, because it is memory bound by the weights). |
@andreas-solti thanks for the response. I watched the video and understood that it is basically the smaller model which should be relatively fast, so the actual model only has to select the best as explained in the video. So I really need to understand what's the criteria of selecting the smaller/ draft model. |
Yes, you need a draft model that is compatible with the tokens, but as Horace mentioned in his presentation, you are also free to use somthing else that is able to predict the next tokens fast. He said it is possible to also use a trigram model for example (in theory you can use any model, but you'd have to convert the output to the token ids that is compatible with what you need in your big model. Think of the example: you have the prompt-context "Explain what a rainbow is." and the model is in the decoding phase now. Then, if your small model predicts "A rainbow is a beautiful phenomenon...", you can feed the corresponding tokens into your big model in parallel. Then you check, if the ouputs also correspond to the next predicted token. If it does not, then you need to recompute based on the generated output of the big model. Good luck. Also try asking ChatGPT with your error messages that you get. |
Hi!
I don't quite understand how this project works, I guess my main question is :
what is a draft model ?
For example, I would like to speed-up the inference of OwlVit (https://huggingface.co/google/owlvit-base-patch32) which I use through the
transformers
library. Can I do that with GPTFast ?Thanks !
The text was updated successfully, but these errors were encountered: