Back to Curriculum

GPU VRAM vs. System RAM: The Inference Engine

In local LLM deployment, your GPU's VRAM (Video RAM) is the primary bottleneck for performance and context window size. In this lesson, we breakdown the technical requirements for building a local "Laptop Server."

🏗️ The Memory Hierarchy

To run a model, the entire set of weights must reside in the fastest memory possible.

  1. VRAM (GPU): 100x faster than system RAM. Essential for low-latency responses.
  2. Unified Memory (Apple Silicon): Shared between CPU/GPU. Allows for massive models (70B+) on a single chip.
  3. System RAM: The "Swap" space. If a model doesn't fit in VRAM, it spills over here, causing a 10x-50x speed drop.

🛠️ Technical Snippet: VRAM Calculation for Quantized Models

A 7B parameter model at 4-bit quantization (Q4_K_M) requires approximately: (7 Billion Parameters * 0.7 bytes per weight) + 1GB Context Buffer = ~6GB VRAM.


🔍 Nuance: The Context Buffer

As your chat gets longer, the "KV Cache" grows. If you only have 8GB of VRAM, you can run a 7B model, but your context window will be limited to ~4k tokens before it overflows into slow system RAM.


⚡ Practice Lab: Hardware Benchmarking

  1. Identify: Open your Task Manager (Windows) or Activity Monitor (Mac). Find your "Dedicated Video Memory."
  2. Benchmark: Download LM Studio and load a Llama-3-8B-Q4_K_M model.
  3. Analyze: Run a long prompt and monitor the "Tokens Per Second" (TPS). Note when the speed drops as the context window fills.

📝 Homework: The PKR Economics of Compute

Calculate the cost of an RTX 3060 (12GB VRAM) vs. an RTX 4060 (8GB VRAM) in the local Pakistani market. Which card is better for running a private "Lead Scoring" bot 24/7? Justify your choice based on Context Window size.