GPTQ

One-shot weight quantization method

A post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU’s global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.

GPTQ Notion

GPTQ Usages

qwopqwop200 • Updated 2023 Jul 8 19:32

GPT FAST

GPTQ: Accurate Post-Training Quantization for Generative...

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high...