Model Quantization

Creator

Creator

Seonglae Cho

Created

Created

2023 Jun 7 16:6

Editor

Editor

Seonglae Cho

Edited

Edited

2024 Dec 20 0:10

Refs

Refs

Vector Quantization

Model Compression

Low precision bits mapping reduce memory and model size, Improve inference speed

Not every layer can be quantized

Not every model reacts the same way to quantization

Model Quantization Notion

Model Quantization Method

Model Quantization Type

Quantization Error

Double Quantization

Residual Vector Quantization

Model Quantization Usages

Model Quantization Algorithm

Model Quantization Tool

GPU Memory with quantization

Calculating GPU memory for serving LLMs | Substratus.AI

How many GPUs do I need to be able to serve Llama 70B? In order to answer that, you need to know how much GPU memory will be required by the Large Language Model.

https://www.substratus.ai/blog/calculating-gpu-memory-for-llm

Calculating GPU memory for serving LLMs | Substratus.AI

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Quantization

https://huggingface.co/docs/transformers/main/en/quantization

Quantization

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Quantization

https://huggingface.co/docs/optimum/concept_guides/quantization

Quantization

Backlinks

LLaMa3 Attention Mechanism AI Memory Capacity AI Compiler Optimization Apple Intelligence ONNX TGI Machine Learning AI Inference Tool AI Optimization QLoRA

Recommendations

//////