Back to Curriculum

Hardware Benchmarking: Measuring Inference Performance

In local AI operations, raw clock speed is secondary to Tokens Per Second (TPS). In this lesson, we learn how to benchmark your local hardware to determine its capacity for industrial-scale automation.

🏗️ The 3 Primary Metrics

  1. Time to First Token (TTFT): The latency between command and initial response. Critical for real-time voicebots.
  2. Tokens Per Second (TPS): The raw throughput. 5-10 TPS is human-readable; 50+ TPS is required for high-volume data processing.
  3. VRAM Utilization: How much of your GPU memory is consumed by the model weights vs. the context window.

🛠️ Technical Snippet: TPS Calculation Logic

To measure performance in Ollama or LM Studio:

# In Ollama, use the /verbose flag
/set verbose
"Write a 500 word technical brief on RAG."
# Check the output for:
# eval count: 512 tokens
# eval duration: 10.2s
# TPS = eval count / eval duration = ~50 TPS

🔍 Nuance: Thermal Throttling

Unlike standard gaming, LLM inference is highly intensive for long durations. If your "Laptop Server" hits 90°C, your TPS will drop by 50%. A professional setup requires active cooling (laptop stands or server racks).


⚡ Practice Lab: The Multi-Model Benchmark

  1. Load: Load Llama-3-8B at Q4 quantization. Record the TPS.
  2. Scale: Load the same model at Q8 (higher fidelity). Record the TPS drop.
  3. Analyze: Determine the "Fidelity vs. Speed" trade-off for your specific hardware.

📝 Homework: The TPS Target

Identify a task that requires processing 100,000 words. Based on your current TPS, calculate how many hours it would take to finish. Propose one hardware upgrade to cut that time in half.