Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Development/AI Optimization/
Model Quantization
Search

Model Quantization

Creator
Creator
Seonglae Cho
Created
Created
2023 Jun 7 16:6
Editor
Editor
Seonglae Cho
Edited
Edited
2024 Dec 20 0:10
Refs
Refs
Vector Quantization

Model Compression

Low precision bits mapping reduce memory and model size, Improve inference speed

  • Not every layer can be quantized
  • Not every model reacts the same way to quantization
Model Quantization Notion
Model Quantization Method
Model Quantization Type
Quantization Error
Double Quantization
Residual Vector Quantization
 
 
 
Model Quantization Usages
Model Quantization Algorithm
Model Quantization Tool
 
 
 

GPU Memory
with quantization

Calculating GPU memory for serving LLMs | Substratus.AI
How many GPUs do I need to be able to serve Llama 70B? In order to answer that, you need to know how much GPU memory will be required by the Large Language Model.
Calculating GPU memory for serving LLMs | Substratus.AI
https://www.substratus.ai/blog/calculating-gpu-memory-for-llm
Calculating GPU memory for serving LLMs | Substratus.AI
Quantization
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Quantization
https://huggingface.co/docs/transformers/main/en/quantization
Quantization
Quantization
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Quantization
https://huggingface.co/docs/optimum/concept_guides/quantization
Quantization
 
 

Recommendations

Texonom
Texonom
/
Engineering
Engineering
/Data Engineering/Artificial Intelligence/AI Development/AI Optimization/
Model Quantization
Copyright Seonglae Cho