Skip to content

Why is the inference speed of the quantized model using QAT so slow? #1050

Open
@elfisworking

Description

@elfisworking

i get a quantized model using torchtune package
The test log show me: INFO:torchtune.utils._logging:Time for inference: 66.56 sec total, 4.51 tokens/sec
4.51 tokens/sec is even lower than that of the unquantized model.
It is normal for QAT ? Thank you very much if someone is willing to answer.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions