Model Quantization Algorithm

Creator

Creator

Seonglae Cho

Created

Created

2023 Jul 5 8:2

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Oct 18 22:30

Refs

Refs

Hessian Matrix

The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size.

notion image

Model Quantization Algorithms

4bit or 8bit

The case for 4-bit precision: k-bit Inference Scaling Laws

Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model...

https://arxiv.org/abs/2212.09720

Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳

A Blog post by Merve Noyan on Hugging Face

Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳

https://huggingface.co/blog/merve/quantization

Introduction to Quantization cooked in 🤗 with 💗🧑‍🍳

Overview of natively supported quantization schemes in 🤗 Transformers

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Overview of natively supported quantization schemes in 🤗 Transformers

https://huggingface.co/blog/overview-quantization-transformers

Overview of natively supported quantization schemes in 🤗 Transformers

Recommendations

///////