Quantization Primer¶
What you'll learn
- int8 / int4 quantization — compressing fp16 weights into integers
- What symmetric vs asymmetric and per-tensor vs per-channel actually mean
- Post-training quantization (PTQ) in one pass — the simplest path
- 1/4 to 1/8 memory reduction with less than 5–10% accuracy loss
Prerequisites
1. Concept — Fewer Bits¶
| Format | bytes | Representable values | Accuracy loss |
|---|---|---|---|
| fp32 | 4 | ±3.4×10³⁸ | 0 (baseline) |
| fp16 | 2 | ±6.5×10⁴ | <1% |
| int8 | 1 | -128 to 127 (256 values) | 2–5% |
| int4 | 0.5 | -8 to 7 (16 values) | 5–15% |
| int2 | 0.25 | -2 to 1 (4 values) | 30%+ (not practical) |
The 10M model from this book: fp16 = 20 MB → int4 = 5 MB. Very lightweight for mobile and laptops.
Quantization formula (symmetric)¶
s(scale) — ratio of the weight's absolute max to the integer maxq— the integer value- Dequantize with:
x ≈ q × s
Loss happens at the rounding step. Narrower, more uniform distributions lose less.
2. Why It Matters — Memory, Speed, Power¶
| Device | fp16 model limit | int4 model limit |
|---|---|---|
| Mobile (4 GB) | 1B | 7B |
| Laptop (16 GB) | 7B | 30B |
| Colab T4 (16 GB) | 8B | 40B |
| A100 80 GB | 40B | 160B |
Without quantization, you can't fit large models on small devices. The 10M model in this book fits anywhere even without it, but when you move to 1B+ models in the capstone, quantization becomes essential.
Quantization also speeds up inference — int matmul is roughly 2× faster than fp16, when hardware supports it.
3. Where It's Used — Four Variants¶
3.1 Per-tensor vs Per-channel¶
- Per-tensor: one scale for the entire weight matrix. Simple, but loses more.
- Per-channel: separate scale per row (or column). More precise, more metadata.
The standard is per-channel.
3.2 Symmetric vs Asymmetric¶
- Symmetric: scale only. Zero point = 0. Works when the weight distribution is centered at 0.
- Asymmetric: scale + zero point. Needed when activations (e.g., after ReLU) are skewed to one side.
The common pattern: weights use symmetric, activations use asymmetric.
3.3 PTQ (Post-Training Quantization)¶
Apply quantization to a trained model with no additional training. Simplest approach. This is the path this book takes.
3.4 QAT (Quantization-Aware Training)¶
Simulate quantization during training. Minimum accuracy loss, but training costs more.
This book covers PTQ only — QLoRA in Part 7 is effectively PTQ + LoRA.
4. Minimal Example — int8 PTQ by Hand¶
- 127 is the int8 positive max. -128 is possible, but symmetric quantization normally uses ±127.
- Memory: weight (256·256·4 = 262 KB) → q (256·256·1 = 65 KB) + scale (256·2 = 512 B) = roughly 1/4.
Typical mean abs error: 0.0008 (under 1% of init weight magnitude). Slightly higher on trained models.
5. Real Example — int8/int4 on the Book's Model¶
- Skip the embedding layer — its impact is disproportionately large on small models.
- Simulation: real int8 matmul requires hardware support. The PyTorch approach is quantize → immediately dequantize → run in floating point. True int8 inference comes in the next chapter with GGUF.
Results on the book's 10M model:
int8 is nearly lossless. int4 is still practical. Memory savings of 1/2 and 1/4 are real.
6. int4 Quantization — Down to 16 Values¶
Half the resolution of int8, more loss, but still useful.
- PyTorch has no int4 dtype, so we store in int8 but only use values in the -8 to 7 range.
Group-wise quantization — separate scale every 128 elements. More precise than per-row, slightly more metadata. This is the standard for GGUF int4.
7. Common Failure Points¶
1. Quantizing the embedding too — On a 10M model, embeddings make up ~30% of parameters. Quantizing them adds 5–10% extra PPL loss. Keep embeddings in fp16.
2. Quantizing RMSNorm gamma — It's a 1D scalar, so there's nothing to gain. Quantization targets 2D matmul weights only.
3. Using per-tensor only — When a matrix has both large and small values, both get squeezed. Per-channel / group-wise is the standard.
4. Applying asymmetric to weights — Weights follow a zero-centered distribution (RMSNorm + init). Asymmetric adds metadata with no benefit here.
5. Skipping evaluation after PTQ — If you don't re-measure PPL after int4 quantization, you won't know where things broke. Always compare PPL and generation samples before and after.
6. Forgetting KV cache quantization — For larger models, KV cache memory can exceed weight memory. int8 quantization of the KV cache is also needed — GGUF in Ch 20 handles this automatically.
7. Confusing simulation with real inference — Dequantize → fp16 compute = fp16 speed (only memory is saved). Real int8 acceleration requires int8 kernels on the GPU/CPU — that's the next chapter with llama.cpp.
8. Ops Checklist¶
Quantization decision gate:
- Measure baseline PPL (fp16)
- Apply int8 → compare PPL (within 5% is OK)
- Apply int4 → compare PPL (within 10% is OK)
- Confirm embedding is excluded (per-row weight only)
- Use per-channel / group-wise
- Symmetric (weights) / asymmetric (activations, if needed)
- Compare 5 generated samples — check for capability loss beyond numbers
- Measure memory — confirm actual reduction ratio
- (Optional) Measure speed — check int quantization acceleration on your hardware
9. Exercises¶
- Apply §5 int8 quantization to the book's 10M model. How does PPL change?
- Compare §6 int4 with group_size=128 vs group_size=64. A smaller group is more precise but uses more metadata.
- What's the PPL difference between quantizing the embedding vs. leaving it in fp16?
- Use the quantized model to regenerate the 5 fairytales from Ch 15. Can you see a difference?
- (Think about it) Same 1B model in int4 vs a 250M model in fp16 — similar memory footprint. Which performs better? Does the answer depend on the task?
References¶
- Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339
- Frantar et al. (2022). GPTQ. arXiv:2210.17323
- Lin et al. (2023). AWQ. arXiv:2306.00978
- llama.cpp GGUF quantization specs — Q4_0, Q4_K_M, etc.
- HuggingFace
bitsandbyteslibrary docs