what int8 inference pipeline looks like #12

dengzheng-cloud · 2022-08-17T08:37:04Z

i am trying to achieve int8 quantization in wenet base network(conformer) submodule(RelPositionalMHA),
now i have some questions about how to achieve custom int8 quantization tensorrt plugin.

about input, i read the code of faster transformer, tensorrt plugin wenet ,you used invokeQuantization. does it means you change the model and put quantization op in plugin weight, read it during inference init.

about pos_emb, i refered the wenet code to output numpy version relpositional mha inference. while i didn't find cuda code about pos_emb, if it means that pos_emb input is none during inference

about ppq, i used onnxruntime to quantize my submodule model to speed it up,while it slows compare with raw model convert to tensorrt engine. quantized model look like this

looking forward to your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what int8 inference pipeline looks like #12

what int8 inference pipeline looks like #12

dengzheng-cloud commented Aug 17, 2022

what int8 inference pipeline looks like #12

what int8 inference pipeline looks like #12

Comments

dengzheng-cloud commented Aug 17, 2022