To run massive models (e.g., Llama 3 70B) on consumer hardware, we must use Quantization. In this lesson, we learn the technical trade-offs between different quantization levels (Q2 to Q8) and how to select the right one for your agency's tasks.
| Level | Bits | Memory Save | Accuracy Loss | Best For |
|---|---|---|---|---|
| Q8_0 | 8-bit | 50% | Negligible | Critical reasoning, final drafting. |
| Q4_K_M | 4-bit | 75% | 1-2% | General discovery, lead scoring. |
| Q2_K | 2-bit | 85% | 10%+ | Hardware testing, simple summaries. |
VRAM Needed = (Model Parameters * Bits per weight) / 8 + Context Buffer.
Example: 70B model at Q4 (4 bits):
(70 * 4) / 8 = 35GB VRAM + 5GB Context = 40GB VRAM total.
Accuracy loss is measured by Perplexity. A Q4_K_M quantization is the industry "Sweet Spot" because it offers a massive memory saving while keeping the model's logic almost indistinguishable from the full 16-bit version.
Identify your current GPU's VRAM. Research 3 different models (8B, 14B, 32B) and determine the highest quantization level you can run for each while keeping a 4k token context window.