Try some QD8-BF16 Experiments #11466
Draft
+55
−48
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We've prototyped some new QD8-BF16-QB4W kernels in XNNPACK. Let's try leveraging them in ExecuTorch and see what our performance looks like:
Exports:
There is no change in size since this only affects activations. To make the comparisons more fair, we removed delegation of all other operators aside from qd8-bf16-qb4w. This is because in XNNPACK we are still lacking those bf16 operators.
Some things to notice here is that the BF16 model uses 1/3 of the memory as the fp32 model. Additionally, we see some performance drops in BF16. This is likely because the GEMM kernel contains an extra shift to perform bf16 (things are still calculated in f32, just right shifted before storing). Additionaly the Quantize kernel for bf16 --> qd8 is still a naive implementation so it is a bit slower. Another thing to notice is that the results seem to be nonsensical.