Running private models requires a reliable deployment environment. In this lesson, we implement the two industry standards for local inference: LM Studio (for discovery) and Ollama (for automation).
| Feature | LM Studio | Ollama |
|---|---|---|
| Interface | GUI (Desktop App) | CLI (Terminal) |
| Best For | Testing context and quantization. | Running background bot services. |
| API | Local server (OpenAI compatible). | REST API / Local server. |
| GPU Support | Auto-detect (Nvidia/Mac). | Auto-detect (CUDA/MLX/Vulkan). |
Essential commands for managing your local "Scout" models:
# 1. Pull a model
ollama run llama3:8b
# 2. Check active models and VRAM usage
ollama ps
# 3. Run a lightweight model for technical scoring
ollama run phi3
In LM Studio, look for the "GGUF" file format. Always check the "Quantization Level" (e.g., Q4_K_M). Higher numbers mean more fidelity but require more VRAM. For agency-scale lead scoring, Q4 is the optimal balance of speed and logic.
ollama run llama3 and ask it to "Extract all emails from this text: [Paste messy text]."Enable the "Local Server" in LM Studio. Use a Python script (with the openai library) to send a request to http://localhost:1234/v1. Verify your script can receive AI responses from your local model.