-
Notifications
You must be signed in to change notification settings - Fork 60
Extend on-device sampling support for dual QPC VLMs #597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
af8e673 to
df3501a
Compare
quic-hemagnih
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the CI test cases.
Signed-off-by: quic-xiyushi <[email protected]>
df3501a to
d722a5a
Compare
Signed-off-by: quic-xiyushi <[email protected]>
d722a5a to
e06e175
Compare
Signed-off-by: quic-sanising <[email protected]> Signed-off-by: sanising <[email protected]>
900aee5 to
3e242ce
Compare
Signed-off-by: quic-xiyushi <[email protected]>
Signed-off-by: sanising <[email protected]>
| if kwargs.pop("qaic_config", None): | ||
| raise NotImplementedError("On-device sampling is not supported for single QPC multimodal models yet.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't qaic_config used for spec decoding as well?
Are we not supporting spec decode for single qpc?
Should we just error out or check what's passed in the config?
| self.config = model.config | ||
| self.vision_model = QEffVisionEncoderForTextImageToTextModel(model, **kwargs) | ||
| self.lang_model = QEffCausalLMForTextImageToTextModel(model, **kwargs) | ||
| self.lang_model = QEffCausalLMForTextImageToTextModel(model, continuous_batching=continuous_batching, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace line 984-985 with following
if kwargs.pop("full_batch_size", None):
continuous_batching = True
warnings.warn(
"full_batch_size argument is deprecated. Use continuous_batching=True instead.", DeprecationWarning, 2
)
| def get_sampling_inputs_and_outputs( | ||
| self, | ||
| example_inputs: Dict[str, torch.Tensor], | ||
| output_names: List[str], | ||
| dynamic_axes: Dict[str, Dict[int, str]], | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this method different from QEFFAutoModelForCausalLM.get_sampling_inputs_and_outputs
Can we combine these and create a different method in utils or spd_utils for this?
remove code duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for intern model i,e VLM in dual qpc mode?
| QEffGPTJForCausalLM, | ||
| QEffGraniteForCausalLM, | ||
| QEffGraniteMoeForCausalLM, | ||
| QEffInternDecoderWrapper, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we are enabling sampling only for intern model?
Will other VLMs also be supported?
No description provided.