The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size.

Model Quantization Algorithms
Β
Β
Β
4bit or 8bit
The case for 4-bit precision: k-bit Inference Scaling Laws
Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model...
https://arxiv.org/abs/2212.09720

Introduction to Quantization cooked in π€ with ππ§βπ³
A Blog post by Merve Noyan on Hugging Face
https://huggingface.co/blog/merve/quantization
Overview of natively supported quantization schemes in π€ Transformers
Weβre on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/blog/overview-quantization-transformers
Β
Β
Β

Seonglae Cho