ONNX quantization pre-processing

The goal of these steps is to improve quantization quality.

Symbolic shape inference. This is best suited for transformer models.

Model optimization: This step uses ONNX Runtime native library to rewrite the computation graph, including merging computation nodes, eliminating redundancies to improve runtime efficiency.

Unfortunately, a known issue in ONNX Runtime is that model optimization can not output a model size greater than 2GB. So for large models, optimization must be skipped.

ONNX shape inference.

ONNX quantization pre-processing

The goal of these steps is to improve quantization quality.

Recommendations