Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Development/AI Optimization/Model Quantization/
Model Quantization Algorithm
Search

Model Quantization Algorithm

Created
Created
2023 Jul 5 8:2
Editor
Editor
Seonglae Cho
Creator
Creator
Seonglae Cho
Edited
Edited
2024 Oct 18 22:30
Refs
Refs
Hessian Matrix
The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size.
notion image
Model Quantization Algorithms
GPTQ
AWQ
SparseGPT
LUT Gemm
BCQ
SpQR
HAWQ
ZeroGuant
BitNet
NF4
VPTQ
SmoothQuant
 
 
 

4bit or 8bit

The case for 4-bit precision: k-bit Inference Scaling Laws
Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model...
The case for 4-bit precision: k-bit Inference Scaling Laws
https://arxiv.org/abs/2212.09720
The case for 4-bit precision: k-bit Inference Scaling Laws
Introduction to Quantization cooked in πŸ€— with πŸ’—πŸ§‘β€πŸ³
A Blog post by Merve Noyan on Hugging Face
Introduction to Quantization cooked in πŸ€— with πŸ’—πŸ§‘β€πŸ³
https://huggingface.co/blog/merve/quantization
Introduction to Quantization cooked in πŸ€— with πŸ’—πŸ§‘β€πŸ³
Overview of natively supported quantization schemes in πŸ€— Transformers
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Overview of natively supported quantization schemes in πŸ€— Transformers
https://huggingface.co/blog/overview-quantization-transformers
Overview of natively supported quantization schemes in πŸ€— Transformers
 
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Development/AI Optimization/Model Quantization/
Model Quantization Algorithm
Copyright Seonglae Cho