Back to Curriculum

LM Studio & Ollama Setup: Local Inference Deployment

Running private models requires a reliable deployment environment. In this lesson, we implement the two industry standards for local inference: LM Studio (for discovery) and Ollama (for automation).

🏗️ The Tooling Comparison

FeatureLM StudioOllama
InterfaceGUI (Desktop App)CLI (Terminal)
Best ForTesting context and quantization.Running background bot services.
APILocal server (OpenAI compatible).REST API / Local server.
GPU SupportAuto-detect (Nvidia/Mac).Auto-detect (CUDA/MLX/Vulkan).

🛠️ Technical Snippet: The Ollama Command Line

Essential commands for managing your local "Scout" models:

# 1. Pull a model
ollama run llama3:8b

# 2. Check active models and VRAM usage
ollama ps

# 3. Run a lightweight model for technical scoring
ollama run phi3

🔍 Nuance: Model Quantization Levels

In LM Studio, look for the "GGUF" file format. Always check the "Quantization Level" (e.g., Q4_K_M). Higher numbers mean more fidelity but require more VRAM. For agency-scale lead scoring, Q4 is the optimal balance of speed and logic.


⚡ Practice Lab: The CLI Execution

  1. Install: Setup Ollama on your machine.
  2. Execute: Run ollama run llama3 and ask it to "Extract all emails from this text: [Paste messy text]."
  3. Bench: Record the time taken. This is your baseline for private, zero-cost data extraction.

📝 Homework: The Local API Connection

Enable the "Local Server" in LM Studio. Use a Python script (with the openai library) to send a request to http://localhost:1234/v1. Verify your script can receive AI responses from your local model.