-
Notifications
You must be signed in to change notification settings - Fork 657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added quantization utils to allow extending FP16 CoreML models to FP32 #637
base: main
Are you sure you want to change the base?
Conversation
Thanks for the changes @tianrui! Can you comment on the use cases for FP16 -> FP32? |
@tianrui I think there is already a mode that does this:
This mode hasn't been documented but is currently being used by unit tests that test the weight quantization feature. |
@1duo I was working on demoing the CoreML BERT model from Apple that can be optimized if using MPS, but only for FP32 parameters at the moment. I did notice that dequantization mode is possible, and I will verify them with my PR. |
When trying to quantize using _dequantize_nn_spec() from quantization_utils.py with the spec that was extracted from model.get_spec(), I hit an attribute error: layers. Is there another way to dequantize the model that I'm not aware of? I've verified that the performance of the dequantized model is the same as the FP16 model downloaded from https://docs-assets.developer.apple.com/coreml/models/Text/QuestionAnswering/BERT_SQUAD/BERTSQUADFP16.mlmodel. |
Did you also try by using |
Hi @aseemw, I tried the function you suggested, but this mode fails when dequantizing a layer of embeddings in BERT, where it makes a call to _dequantize_wp(), and assumes there is a LUT where it doesn't exist, so the call to _dequantize_lut() fails. The FP16 weight parameter has a field float16Value holding the byte array of weights, and empty in its rawValue and floatValue fields. Do you have any suggestions on further verifying the feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding a new API, fix the quantization_mode= dequantization
in the existing API (quantize_weights(quantization_mode="dequantization")
)
@tianrui There seems to be a bug, if its a LUT and its not. Can you look into fixing that bug? Which is the line where it surfaces that error? (maybe the check that quantization type is linear or LUT is missing) |
Extend the parameters of a FP16 MLModel to FP32 by typecasting in numpy and converting definitions in the model's graph definition protobuf.