Local LLMs with Ollama: Build AI Agents with Zero API Costs
Run AI agents 100% locally with Ollama. Learn to set up Llama 3.2, Mistral, and DeepSeek, then build production-ready agents that work offline with full privacy.
Moshiour Rahman
Advertisement
AI Agents Mastery Series
This is Part 3 of our comprehensive AI Agents series.
| Part | Topic | Level |
|---|---|---|
| 1 | Fundamentals - Build from Scratch | Beginner |
| 2 | LangGraph Deep Dive | Intermediate |
| 3 | Local LLMs with Ollama | Intermediate |
| 4 | Tool-Using Agents | Intermediate |
| 5 | Multi-Agent Systems | Advanced |
| 6 | Production Deployment | Advanced |
Why Local LLMs?
Cloud APIs are great, but they have downsides:
| Cloud APIs | Local LLMs |
|---|---|
| Pay per token ($$$) | Free after download |
| Data sent to third party | 100% private |
| Requires internet | Works offline |
| Rate limits | Unlimited requests |
| Vendor lock-in | Model freedom |
In 2025, local LLMs have become production-viable. Models like Llama 3.2, Mistral, and DeepSeek run efficiently on consumer hardware.
Ollama: Docker for LLMs
Ollama packages LLMs like Docker packages applications—everything in one command:
ollama run llama3.2
That’s it. Model downloaded, configured, and running.
System Requirements
| Model Size | RAM Required | GPU (Optional) | Example Models |
|---|---|---|---|
| 1-3B | 8GB | Not needed | Llama 3.2 1B, Phi-3 Mini |
| 7-8B | 16GB | RTX 3060+ | Llama 3.2 8B, Mistral 7B |
| 13B+ | 32GB | RTX 3080+ | Llama 2 13B, CodeLlama |
| 70B | 64GB+ | RTX 4090 / A100 | Llama 2 70B |
Installing Ollama
macOS
brew install ollama
Or download from ollama.ai
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.ai or use WSL2.
Verify Installation
ollama --version
# ollama version 0.4.x
Your First Local Model
# Download and run Llama 3.2 (3B - fast, good quality)
ollama run llama3.2
# You're now in an interactive chat!
>>> What is the capital of France?
The capital of France is Paris...
Press Ctrl+D to exit.
Popular Models for Agents
| Model | Size | Best For | Command |
|---|---|---|---|
| llama3.2 | 3B | General, fast | ollama run llama3.2 |
| llama3.2:8b | 8B | Better reasoning | ollama run llama3.2:8b |
| mistral | 7B | Balanced performance | ollama run mistral |
| deepseek-r1 | 7B | Reasoning tasks | ollama run deepseek-r1 |
| codellama | 7B | Code generation | ollama run codellama |
| qwen2.5-coder | 7B | Code + chat | ollama run qwen2.5-coder |
Download Models
# Pull without running
ollama pull llama3.2
ollama pull mistral
ollama pull codellama
# List downloaded models
ollama list
# Remove a model
ollama rm llama3.2
Ollama API: OpenAI Compatible
Ollama exposes an API that’s compatible with OpenAI’s format. Start the server:
# Ollama runs as a service, but you can also start manually
ollama serve
The API is available at http://localhost:11434.
Direct API Usage
# ollama_api.py
import requests
def chat_with_ollama(prompt: str, model: str = "llama3.2") -> str:
"""Chat with Ollama using the REST API."""
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
# Test it
print(chat_with_ollama("Explain recursion in programming in 2 sentences."))
OpenAI-Compatible Endpoint
Ollama also provides an OpenAI-compatible endpoint at /v1:
# ollama_openai_compat.py
from openai import OpenAI
# Point to Ollama's local server
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but not used
)
response = client.chat.completions.create(
model="llama3.2",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
]
)
print(response.choices[0].message.content)
This means any code using OpenAI’s SDK works with Ollama just by changing the base URL!
Python Ollama Library
For a native experience, use the official library:
pip install ollama
# ollama_native.py
import ollama
# Simple generation
response = ollama.generate(
model='llama3.2',
prompt='Why is the sky blue?'
)
print(response['response'])
# Chat format
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'What is the capital of Japan?'}
]
)
print(response['message']['content'])
# Streaming
for chunk in ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a haiku about coding'}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
Ollama + LangChain Integration
LangChain has built-in Ollama support:
pip install langchain-ollama
# ollama_langchain.py
from langchain_ollama import OllamaLLM, ChatOllama
# Basic LLM
llm = OllamaLLM(model="llama3.2")
response = llm.invoke("Explain quantum computing simply")
print(response)
# Chat model (recommended for agents)
chat = ChatOllama(model="llama3.2")
response = chat.invoke([
{"role": "system", "content": "You are a Python expert."},
{"role": "user", "content": "What's the difference between list and tuple?"}
])
print(response.content)
Build a Local AI Agent
Now let’s build a fully local agent using LangGraph + Ollama:
# local_agent.py
import os
from typing import Annotated, TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from datetime import datetime
import subprocess
# Initialize local LLM
llm = ChatOllama(
model="llama3.2",
temperature=0 # More deterministic for tool calling
)
# Define tools
@tool
def get_current_time() -> str:
"""Get the current date and time."""
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
@tool
def run_python_code(code: str) -> str:
"""Execute Python code and return the output.
Use this for calculations or data processing."""
try:
# Create a safe execution environment
result = subprocess.run(
['python', '-c', code],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return result.stdout or "Code executed successfully (no output)"
return f"Error: {result.stderr}"
except subprocess.TimeoutExpired:
return "Error: Code execution timed out"
except Exception as e:
return f"Error: {str(e)}"
@tool
def read_file(filepath: str) -> str:
"""Read the contents of a file."""
try:
with open(filepath, 'r') as f:
content = f.read()
if len(content) > 2000:
return content[:2000] + "\n... (truncated)"
return content
except FileNotFoundError:
return f"Error: File '{filepath}' not found"
except Exception as e:
return f"Error reading file: {str(e)}"
@tool
def write_file(filepath: str, content: str) -> str:
"""Write content to a file."""
try:
with open(filepath, 'w') as f:
f.write(content)
return f"Successfully wrote to {filepath}"
except Exception as e:
return f"Error writing file: {str(e)}"
@tool
def list_directory(path: str = ".") -> str:
"""List files and directories in a path."""
try:
import os
items = os.listdir(path)
return "\n".join(items) if items else "Directory is empty"
except Exception as e:
return f"Error: {str(e)}"
tools = [get_current_time, run_python_code, read_file, write_file, list_directory]
# Bind tools to LLM
llm_with_tools = llm.bind_tools(tools)
# State definition
class State(TypedDict):
messages: Annotated[list, add_messages]
# Agent node
def agent(state: State) -> State:
"""The agent decides what to do."""
response = llm_with_tools.invoke(state["messages"])
return {"messages": [response]}
# Build the graph
graph_builder = StateGraph(State)
graph_builder.add_node("agent", agent)
graph_builder.add_node("tools", ToolNode(tools=tools))
graph_builder.add_edge(START, "agent")
graph_builder.add_conditional_edges("agent", tools_condition)
graph_builder.add_edge("tools", "agent")
# Compile
local_agent = graph_builder.compile()
def run_local_agent(query: str) -> str:
"""Run a query through the local agent."""
print(f"\n{'='*60}")
print(f"🦙 LOCAL AGENT (Ollama + LangGraph)")
print(f"Query: {query}")
print('='*60)
result = local_agent.invoke({
"messages": [{"role": "user", "content": query}]
})
# Get final response
final_message = result["messages"][-1]
response = final_message.content if hasattr(final_message, 'content') else str(final_message)
print(f"\n📝 Response:\n{response}")
return response
if __name__ == "__main__":
# Test queries
run_local_agent("What time is it right now?")
run_local_agent("Calculate the factorial of 10 using Python code")
run_local_agent("List the files in the current directory")
Handling Tool Calling with Local Models
Not all local models support function/tool calling natively. Here’s a pattern that works with any model:
# universal_tool_agent.py
import json
import re
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
llm = ChatOllama(model="llama3.2")
TOOLS = {
"calculator": {
"description": "Performs math calculations",
"usage": 'calculator(expression="2+2")'
},
"get_time": {
"description": "Gets current time",
"usage": "get_time()"
}
}
def execute_tool(name: str, args: dict) -> str:
if name == "calculator":
try:
expr = args.get("expression", "0")
allowed = set('0123456789+-*/.() ')
if all(c in allowed for c in expr):
return str(eval(expr))
return "Invalid expression"
except:
return "Calculation error"
elif name == "get_time":
from datetime import datetime
return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
return f"Unknown tool: {name}"
def create_prompt() -> str:
tool_list = "\n".join([
f"- {name}: {info['description']}. Usage: {info['usage']}"
for name, info in TOOLS.items()
])
return f"""You are an AI assistant with access to tools.
Available tools:
{tool_list}
When you need a tool, respond ONLY with this exact format:
TOOL: tool_name
ARGS: {{"param": "value"}}
When you have the final answer, respond normally without TOOL/ARGS.
Think step by step."""
def parse_tool_call(response: str) -> tuple:
"""Parse tool call from response."""
tool_match = re.search(r'TOOL:\s*(\w+)', response)
args_match = re.search(r'ARGS:\s*({.+?})', response, re.DOTALL)
if tool_match:
tool_name = tool_match.group(1)
args = {}
if args_match:
try:
args = json.loads(args_match.group(1))
except:
pass
return tool_name, args
return None, None
class UniversalToolAgent:
def __init__(self, max_iterations: int = 5):
self.max_iterations = max_iterations
def run(self, query: str) -> str:
messages = [
SystemMessage(content=create_prompt()),
HumanMessage(content=query)
]
for i in range(self.max_iterations):
response = llm.invoke(messages)
response_text = response.content
print(f"\n[Iteration {i+1}] Agent: {response_text[:200]}...")
tool_name, args = parse_tool_call(response_text)
if tool_name:
print(f" → Tool: {tool_name}({args})")
result = execute_tool(tool_name, args)
print(f" → Result: {result}")
messages.append(AIMessage(content=response_text))
messages.append(HumanMessage(content=f"Tool result: {result}"))
else:
# No tool call - this is the final answer
return response_text
return "Max iterations reached"
# Test
if __name__ == "__main__":
agent = UniversalToolAgent()
print(agent.run("What is 15 * 47 + 123?"))
print(agent.run("What time is it?"))
Performance Optimization
GPU Acceleration
If you have an NVIDIA GPU, Ollama uses it automatically. Verify:
ollama run llama3.2 --verbose
# Look for "gpu" in the output
Model Quantization
Smaller quantized models run faster with minimal quality loss:
# 4-bit quantized (fastest, smallest)
ollama run llama3.2:3b-instruct-q4_0
# 8-bit quantized (balanced)
ollama run llama3.2:3b-instruct-q8_0
Concurrent Requests
Ollama handles multiple requests automatically:
# concurrent_requests.py
import asyncio
import aiohttp
import time
async def query_ollama(session, prompt, model="llama3.2"):
async with session.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
) as response:
result = await response.json()
return result["response"]
async def main():
queries = [
"What is Python?",
"Explain JavaScript",
"What is Rust?",
"Describe Go language"
]
start = time.time()
async with aiohttp.ClientSession() as session:
tasks = [query_ollama(session, q) for q in queries]
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
print(f"Processed {len(queries)} queries in {elapsed:.2f}s")
for q, r in zip(queries, results):
print(f"\nQ: {q}\nA: {r[:100]}...")
asyncio.run(main())
Custom Models with Modelfiles
Create specialized models with custom system prompts:
# Modelfile
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM """You are a senior Python developer. You write clean,
efficient, well-documented code. You always include type hints
and follow PEP 8 style guidelines. When explaining code, you
break it down step by step."""
Build and run:
ollama create python-expert -f Modelfile
ollama run python-expert
Comparing Models
Here’s a benchmark of popular models for agent tasks:
| Model | Tool Calling | Reasoning | Speed | RAM |
|---|---|---|---|---|
| llama3.2 (3B) | Good | Good | Fast | 8GB |
| llama3.2 (8B) | Better | Better | Medium | 16GB |
| mistral (7B) | Good | Good | Fast | 16GB |
| deepseek-r1 (7B) | Excellent | Excellent | Medium | 16GB |
| qwen2.5-coder (7B) | Good | Good (code) | Fast | 16GB |
For agent tasks, llama3.2 (3B) is the best balance of speed and capability for most use cases.
Common Issues & Solutions
| Issue | Cause | Solution |
|---|---|---|
| Slow responses | No GPU / small RAM | Use smaller model, add GPU |
| ”Model not found” | Not pulled | ollama pull model-name |
| Connection refused | Ollama not running | ollama serve |
| Out of memory | Model too large | Use quantized version |
| Poor tool calling | Model limitation | Use structured prompts |
Summary
| What You Learned | Key Takeaway |
|---|---|
| Why local LLMs | Privacy, cost savings, offline capability |
| Ollama basics | Pull, run, and manage models |
| API usage | REST API + OpenAI compatibility |
| LangChain integration | ChatOllama for agents |
| Tool calling | Works with proper prompting |
| Optimization | GPU, quantization, concurrency |
What’s Next?
In Part 4, we’ll build agents with real-world tools—web search, code execution, file operations, and API integrations.
Continue to Part 4: Tool-Using Agents →
Full Code Repository
git clone https://github.com/Moshiour027/ai-agents-mastery.git
cd ai-agents-mastery/03-ollama
pip install -r requirements.txt
python local_agent.py Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
Run Llama Locally: Complete Guide to Local LLM Deployment
Deploy Llama and other open-source LLMs locally. Learn Ollama, llama.cpp, quantization, and build private AI applications without cloud APIs.
PythonAI Agents Fundamentals: Build Your First Agent from Scratch
Master AI agents from the ground up. Learn the agent loop, build a working agent in pure Python, and understand the foundations that power LangGraph and CrewAI.
PythonTool-Using AI Agents: Web Search, Code Execution & API Integration
Build powerful AI agents with real-world tools. Learn to integrate web search, execute code safely, work with files, and connect to external APIs using LangGraph.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.