Run Llama Locally: Complete Guide to Local LLM Deployment
Deploy Llama and other open-source LLMs locally. Learn Ollama, llama.cpp, quantization, and build private AI applications without cloud APIs.
Moshiour Rahman
Advertisement
Why Run LLMs Locally?
Running Large Language Models locally offers privacy, cost savings, and offline capability. With tools like Ollama and llama.cpp, you can run powerful models on consumer hardware.
Benefits
| Cloud APIs | Local LLMs |
|---|---|
| Usage costs | One-time setup |
| Data leaves device | Complete privacy |
| Internet required | Works offline |
| Rate limits | Unlimited usage |
| Vendor lock-in | Full control |
Ollama: Easiest Setup
Installation
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows
# Download from https://ollama.com/download
# Start Ollama service
ollama serve
Basic Usage
# Pull and run a model
ollama run llama3.2
# List available models
ollama list
# Pull specific model
ollama pull llama3.2:8b
ollama pull codellama:13b
ollama pull mistral:7b
# Run with custom parameters
ollama run llama3.2 --verbose
# Remove model
ollama rm llama3.2
Popular Models
# General purpose
ollama pull llama3.2 # Meta's latest
ollama pull mistral # Fast and efficient
ollama pull gemma2 # Google's model
# Code generation
ollama pull codellama # Code-focused Llama
ollama pull deepseek-coder # Coding specialist
ollama pull starcoder2 # Code completion
# Small/Fast models
ollama pull phi3 # Microsoft's small model
ollama pull tinyllama # Very lightweight
ollama pull qwen2:0.5b # Ultra-small
# Large models (need more RAM)
ollama pull llama3.2:70b # Powerful, needs 48GB+
ollama pull mixtral # Mixture of experts
Python Integration
import ollama
# Simple generation
response = ollama.generate(
model='llama3.2',
prompt='What is machine learning?'
)
print(response['response'])
# Chat format
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': 'You are a helpful coding assistant.'},
{'role': 'user', 'content': 'Write a Python function to reverse a string.'}
]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Explain quantum computing'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
REST API
import requests
# Generate endpoint
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.2',
'prompt': 'Why is the sky blue?',
'stream': False
})
print(response.json()['response'])
# Chat endpoint
response = requests.post('http://localhost:11434/api/chat', json={
'model': 'llama3.2',
'messages': [
{'role': 'user', 'content': 'Hello!'}
],
'stream': False
})
print(response.json()['message']['content'])
# List models
response = requests.get('http://localhost:11434/api/tags')
for model in response.json()['models']:
print(f"{model['name']}: {model['size']}")
llama.cpp: Maximum Performance
Installation
# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build (CPU)
make
# Build with CUDA (NVIDIA GPU)
make LLAMA_CUDA=1
# Build with Metal (Apple Silicon)
make LLAMA_METAL=1
# Build with OpenBLAS
make LLAMA_OPENBLAS=1
Download Models
# Download from Hugging Face
pip install huggingface-hub
# Download GGUF model
huggingface-cli download TheBloke/Llama-2-7B-GGUF \
llama-2-7b.Q4_K_M.gguf \
--local-dir ./models
# Or use Python
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="TheBloke/Llama-2-7B-GGUF",
filename="llama-2-7b.Q4_K_M.gguf",
local_dir="./models"
)
Run Inference
# Basic inference
./main -m models/llama-2-7b.Q4_K_M.gguf \
-p "What is the capital of France?" \
-n 100
# Interactive mode
./main -m models/llama-2-7b.Q4_K_M.gguf \
--interactive \
--color \
-n 256
# With GPU layers
./main -m models/llama-2-7b.Q4_K_M.gguf \
-ngl 35 \
-p "Explain machine learning"
# Server mode
./server -m models/llama-2-7b.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 35
Python Bindings
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="models/llama-2-7b.Q4_K_M.gguf",
n_ctx=2048, # Context window
n_gpu_layers=35, # GPU layers (0 for CPU only)
verbose=False
)
# Generate text
output = llm(
"What is Python programming?",
max_tokens=256,
temperature=0.7,
top_p=0.9,
stop=["User:", "\n\n"]
)
print(output['choices'][0]['text'])
# Chat completion
output = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about coding."}
],
temperature=0.7
)
print(output['choices'][0]['message']['content'])
# Streaming
for chunk in llm(
"Explain neural networks",
max_tokens=256,
stream=True
):
print(chunk['choices'][0]['text'], end='', flush=True)
Quantization
Understanding Quantization
Model Size vs Quality Trade-offs:
Q2_K - Smallest, lowest quality (not recommended)
Q3_K_S - Very small, low quality
Q3_K_M - Small, decent quality
Q4_0 - Small, good quality
Q4_K_S - Small, better quality
Q4_K_M - Medium, good balance ← Recommended
Q5_0 - Medium, high quality
Q5_K_S - Medium, higher quality
Q5_K_M - Medium-large, very good quality
Q6_K - Large, excellent quality
Q8_0 - Largest quantized, near-original quality
F16 - Half precision, original quality
F32 - Full precision, maximum quality
Convert Models
# Convert HF model to GGUF
python convert.py models/llama-2-7b-hf \
--outfile models/llama-2-7b.gguf \
--outtype f16
# Quantize model
./quantize models/llama-2-7b.gguf \
models/llama-2-7b.Q4_K_M.gguf \
Q4_K_M
LangChain Integration
from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
# Basic Ollama LLM
llm = Ollama(model="llama3.2")
response = llm.invoke("What is Python?")
print(response)
# Chat model
chat = ChatOllama(model="llama3.2", temperature=0.7)
# With prompt template
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful {role}."),
("user", "{question}")
])
chain = prompt | chat
response = chain.invoke({
"role": "Python expert",
"question": "How do I use list comprehensions?"
})
print(response.content)
# RAG with local embeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
documents=docs,
embedding=embeddings,
persist_directory="./chroma_db"
)
Build a Local Chatbot
import ollama
import gradio as gr
class LocalChatbot:
def __init__(self, model: str = "llama3.2"):
self.model = model
self.history = []
def chat(self, message: str) -> str:
self.history.append({"role": "user", "content": message})
response = ollama.chat(
model=self.model,
messages=self.history
)
assistant_message = response['message']['content']
self.history.append({"role": "assistant", "content": assistant_message})
return assistant_message
def clear(self):
self.history = []
# Gradio interface
chatbot = LocalChatbot()
def respond(message, history):
response = chatbot.chat(message)
return response
def clear_history():
chatbot.clear()
return []
with gr.Blocks() as demo:
gr.Markdown("# Local LLM Chatbot")
chatbot_ui = gr.Chatbot()
msg = gr.Textbox(placeholder="Type your message...")
clear = gr.Button("Clear")
msg.submit(respond, [msg, chatbot_ui], [chatbot_ui])
clear.click(clear_history, [], [chatbot_ui])
demo.launch()
FastAPI Server
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import ollama
app = FastAPI()
class ChatRequest(BaseModel):
model: str = "llama3.2"
messages: list[dict]
temperature: float = 0.7
max_tokens: int = 500
class GenerateRequest(BaseModel):
model: str = "llama3.2"
prompt: str
temperature: float = 0.7
max_tokens: int = 500
@app.post("/chat")
async def chat(request: ChatRequest):
try:
response = ollama.chat(
model=request.model,
messages=request.messages,
options={
"temperature": request.temperature,
"num_predict": request.max_tokens
}
)
return {"response": response['message']['content']}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/generate")
async def generate(request: GenerateRequest):
try:
response = ollama.generate(
model=request.model,
prompt=request.prompt,
options={
"temperature": request.temperature,
"num_predict": request.max_tokens
}
)
return {"response": response['response']}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/models")
async def list_models():
models = ollama.list()
return {"models": [m['name'] for m in models['models']]}
# Run: uvicorn server:app --reload
Hardware Requirements
RAM Guidelines
| Model Size | Minimum RAM | Recommended |
|---|---|---|
| 7B Q4 | 8GB | 16GB |
| 13B Q4 | 16GB | 32GB |
| 30B Q4 | 32GB | 64GB |
| 70B Q4 | 48GB | 128GB |
GPU VRAM
For full GPU inference:
- 7B Q4: ~4GB VRAM
- 13B Q4: ~8GB VRAM
- 30B Q4: ~20GB VRAM
- 70B Q4: ~40GB VRAM
Partial offloading:
- Use n_gpu_layers to control GPU usage
- Higher = faster but more VRAM
Summary
| Tool | Best For |
|---|---|
| Ollama | Easy setup, beginners |
| llama.cpp | Maximum performance |
| LangChain | Building applications |
| vLLM | Production serving |
Local LLMs provide privacy, cost savings, and full control over your AI applications.
Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
Local LLMs with Ollama: Build AI Agents with Zero API Costs
Run AI agents 100% locally with Ollama. Learn to set up Llama 3.2, Mistral, and DeepSeek, then build production-ready agents that work offline with full privacy.
PythonAI Agents Fundamentals: Build Your First Agent from Scratch
Master AI agents from the ground up. Learn the agent loop, build a working agent in pure Python, and understand the foundations that power LangGraph and CrewAI.
PythonFine-Tuning LLMs: Complete Guide to Custom AI Models
Learn to fine-tune large language models for your use case. Master LoRA, QLoRA, dataset preparation, and deploy custom models with OpenAI and Hugging Face.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.