One-shot weight quantization method
A post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU’s global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate.
GPTQ Notion
GPTQ Usages
GPTQ: Accurate Post-Training Quantization for Generative...
Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high...
https://arxiv.org/abs/2210.17323

gptq
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
https://pypi.org/project/gptq/

Making LLMs lighter with AutoGPTQ and transformers
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/blog/gptq-integration

Seonglae Cho