Hardware support is required to achieve better performance with quantization on GPUs.
You need a device that supports Tensor Core int8 computation, like T4 or A100.
ONNX Runtime leverages the TensorRT Execution Provider for quantization on GPU.
Seonglae Cho
Seonglae Cho