The goal of these steps is to improve quantization quality.
- Symbolic shape inference. This is best suited for transformer models.
- Model optimization: This step uses ONNX Runtime native library to rewrite the computation graph, including merging computation nodes, eliminating redundancies to improve runtime efficiency.
Unfortunately, a known issue in ONNX Runtime is that model optimization can not output a model size greater than 2GB. So for large models, optimization must be skipped.
- ONNX shape inference.