LM Studio & Ollama Setup: Local Inference Deployment

Running private models requires a reliable deployment environment. In this lesson, we implement the two industry standards for local inference: LM Studio (for discovery) and Ollama (for automation).

🏗️ The Tooling Comparison

Feature	LM Studio	Ollama
Interface	GUI (Desktop App)	CLI (Terminal)
Best For	Testing context and quantization.	Running background bot services.
API	Local server (OpenAI compatible).	REST API / Local server.
GPU Support	Auto-detect (Nvidia/Mac).	Auto-detect (CUDA/MLX/Vulkan).

🛠️ Technical Snippet: The Ollama Command Line

Essential commands for managing your local "Scout" models:

# 1. Pull a model
ollama run llama3:8b

# 2. Check active models and VRAM usage
ollama ps

# 3. Run a lightweight model for technical scoring
ollama run phi3

🔍 Nuance: Model Quantization Levels

In LM Studio, look for the "GGUF" file format. Always check the "Quantization Level" (e.g., Q4_K_M). Higher numbers mean more fidelity but require more VRAM. For agency-scale lead scoring, Q4 is the optimal balance of speed and logic.

⚡ Practice Lab: The CLI Execution

Install: Setup Ollama on your machine.
Execute: Run ollama run llama3 and ask it to "Extract all emails from this text: [Paste messy text]."
Bench: Record the time taken. This is your baseline for private, zero-cost data extraction.

📝 Homework: The Local API Connection

Enable the "Local Server" in LM Studio. Use a Python script (with the openai library) to send a request to http://localhost:1234/v1. Verify your script can receive AI responses from your local model.