GPTQ

Creator
Creator
Seonglae ChoSeonglae Cho
Created
Created
2023 Jun 7 16:8
Editor
Edited
Edited
2023 Dec 9 6:43

One-shot weight quantization method

A post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU’s global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.
GPTQ Notion
 
 
GPTQ Usages
 
 
 
 
 

Recommendations