The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size.
Model Quantization Algorithms
Β
Β
Β
4bit or 8bit
Β
Β
Β