Skip to content
This repository was archived by the owner on Aug 28, 2024. It is now read-only.

what int8 inference pipeline looks like #12

Open
dengzheng-cloud opened this issue Aug 17, 2022 · 0 comments
Open

what int8 inference pipeline looks like #12

dengzheng-cloud opened this issue Aug 17, 2022 · 0 comments

Comments

@dengzheng-cloud
Copy link

i am trying to achieve int8 quantization in wenet base network(conformer) submodule(RelPositionalMHA),
now i have some questions about how to achieve custom int8 quantization tensorrt plugin.

about input, i read the code of faster transformer, tensorrt plugin wenet ,you used invokeQuantization. does it means you change the model and put quantization op in plugin weight, read it during inference init.

about pos_emb, i refered the wenet code to output numpy version relpositional mha inference. while i didn't find cuda code about pos_emb, if it means that pos_emb input is none during inference

about ppq, i used onnxruntime to quantize my submodule model to speed it up,while it slows compare with raw model convert to tensorrt engine. quantized model look like this
image

looking forward to your reply.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant