Running state-of-the-art models locally is the ultimate flex for an automation agency. In this lesson, we implement the deployment of Llama 3 (70B) and DeepSeek-V3 using optimized local stacks for maximum tokens-per-second (TPS).
To run Llama 3 with optimized memory usage:
# Pull the quantized 70B version
ollama run llama3:70b-instruct-q4_K_M
# Increase the context window to 8k
ollama run llama3:70b-instruct-q4_K_M --context 8192
If your agency is running 10+ bots simultaneously, Ollama will bottleneck. In this case, we move to vLLM, which utilizes "PagedAttention" to handle dozens of parallel requests on a single GPU without crashing.
Design a hardware/software stack for a private agency server. It must be capable of running a 70B model for final drafting and an 8B model for technical scouting simultaneously.