GGML

Creator

Creator

Created

Created

2023 May 10 16:59

Editor

Editor

Edited

Edited

2025 Jan 23 21:43

Refs

Refs

ggerganov • Updated 2023 May 10 16:58

Model Quantization

CPU + GPU

GGML Models

They follow a particular naming convention: “q” + the number of bits used to store the weights (precision) + a particular variant, based on model cards made by TheBloke

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.

q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K

q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K

q3_k_s: Uses Q3_K for all tensors

q4_0: Original quant method, 4-bit.

q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.

q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K

q4_k_s: Uses Q4_K for all tensors

q5_0: Higher accuracy, higher resource usage and slower inference.

q5_1: Even higher accuracy, resource usage and slower inference.

q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K

q5_k_s: Uses Q5_K for all tensors

q6_k: Uses Q8_K for all tensors

q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html

ML Blog - Quantize Llama models with GGUF and llama.cpp

GGML vs. GPTQ vs. NF4

https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html

Backlinks

AI Code Explainer Local LLM RWKV Whisper

Recommendations

//////