Python 11 min read

Local LLMs with Ollama: Build AI Agents with Zero API Costs

Run AI agents 100% locally with Ollama. Learn to set up Llama 3.2, Mistral, and DeepSeek, then build production-ready agents that work offline with full privacy.

MR

Moshiour Rahman

Advertisement

AI Agents Mastery Series

This is Part 3 of our comprehensive AI Agents series.

PartTopicLevel
1Fundamentals - Build from ScratchBeginner
2LangGraph Deep DiveIntermediate
3Local LLMs with OllamaIntermediate
4Tool-Using AgentsIntermediate
5Multi-Agent SystemsAdvanced
6Production DeploymentAdvanced

Why Local LLMs?

Cloud APIs are great, but they have downsides:

Cloud APIsLocal LLMs
Pay per token ($$$)Free after download
Data sent to third party100% private
Requires internetWorks offline
Rate limitsUnlimited requests
Vendor lock-inModel freedom

In 2025, local LLMs have become production-viable. Models like Llama 3.2, Mistral, and DeepSeek run efficiently on consumer hardware.

Ollama: Docker for LLMs

Ollama packages LLMs like Docker packages applications—everything in one command:

ollama run llama3.2

That’s it. Model downloaded, configured, and running.

System Requirements

Model SizeRAM RequiredGPU (Optional)Example Models
1-3B8GBNot neededLlama 3.2 1B, Phi-3 Mini
7-8B16GBRTX 3060+Llama 3.2 8B, Mistral 7B
13B+32GBRTX 3080+Llama 2 13B, CodeLlama
70B64GB+RTX 4090 / A100Llama 2 70B

Installing Ollama

macOS

brew install ollama

Or download from ollama.ai

Linux

curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.ai or use WSL2.

Verify Installation

ollama --version
# ollama version 0.4.x

Your First Local Model

# Download and run Llama 3.2 (3B - fast, good quality)
ollama run llama3.2

# You're now in an interactive chat!
>>> What is the capital of France?
The capital of France is Paris...

Press Ctrl+D to exit.

ModelSizeBest ForCommand
llama3.23BGeneral, fastollama run llama3.2
llama3.2:8b8BBetter reasoningollama run llama3.2:8b
mistral7BBalanced performanceollama run mistral
deepseek-r17BReasoning tasksollama run deepseek-r1
codellama7BCode generationollama run codellama
qwen2.5-coder7BCode + chatollama run qwen2.5-coder

Download Models

# Pull without running
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

# List downloaded models
ollama list

# Remove a model
ollama rm llama3.2

Ollama API: OpenAI Compatible

Ollama exposes an API that’s compatible with OpenAI’s format. Start the server:

# Ollama runs as a service, but you can also start manually
ollama serve

The API is available at http://localhost:11434.

Direct API Usage

# ollama_api.py
import requests

def chat_with_ollama(prompt: str, model: str = "llama3.2") -> str:
    """Chat with Ollama using the REST API."""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )

    return response.json()["response"]

# Test it
print(chat_with_ollama("Explain recursion in programming in 2 sentences."))

OpenAI-Compatible Endpoint

Ollama also provides an OpenAI-compatible endpoint at /v1:

# ollama_openai_compat.py
from openai import OpenAI

# Point to Ollama's local server
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but not used
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"}
    ]
)

print(response.choices[0].message.content)

This means any code using OpenAI’s SDK works with Ollama just by changing the base URL!

Python Ollama Library

For a native experience, use the official library:

pip install ollama
# ollama_native.py
import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='Why is the sky blue?'
)
print(response['response'])

# Chat format
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'What is the capital of Japan?'}
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a haiku about coding'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Ollama + LangChain Integration

LangChain has built-in Ollama support:

pip install langchain-ollama
# ollama_langchain.py
from langchain_ollama import OllamaLLM, ChatOllama

# Basic LLM
llm = OllamaLLM(model="llama3.2")
response = llm.invoke("Explain quantum computing simply")
print(response)

# Chat model (recommended for agents)
chat = ChatOllama(model="llama3.2")
response = chat.invoke([
    {"role": "system", "content": "You are a Python expert."},
    {"role": "user", "content": "What's the difference between list and tuple?"}
])
print(response.content)

Build a Local AI Agent

Now let’s build a fully local agent using LangGraph + Ollama:

# local_agent.py
import os
from typing import Annotated, TypedDict, Literal
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from datetime import datetime
import subprocess

# Initialize local LLM
llm = ChatOllama(
    model="llama3.2",
    temperature=0  # More deterministic for tool calling
)

# Define tools
@tool
def get_current_time() -> str:
    """Get the current date and time."""
    return datetime.now().strftime("%Y-%m-%d %H:%M:%S")

@tool
def run_python_code(code: str) -> str:
    """Execute Python code and return the output.
    Use this for calculations or data processing."""
    try:
        # Create a safe execution environment
        result = subprocess.run(
            ['python', '-c', code],
            capture_output=True,
            text=True,
            timeout=10
        )
        if result.returncode == 0:
            return result.stdout or "Code executed successfully (no output)"
        return f"Error: {result.stderr}"
    except subprocess.TimeoutExpired:
        return "Error: Code execution timed out"
    except Exception as e:
        return f"Error: {str(e)}"

@tool
def read_file(filepath: str) -> str:
    """Read the contents of a file."""
    try:
        with open(filepath, 'r') as f:
            content = f.read()
            if len(content) > 2000:
                return content[:2000] + "\n... (truncated)"
            return content
    except FileNotFoundError:
        return f"Error: File '{filepath}' not found"
    except Exception as e:
        return f"Error reading file: {str(e)}"

@tool
def write_file(filepath: str, content: str) -> str:
    """Write content to a file."""
    try:
        with open(filepath, 'w') as f:
            f.write(content)
        return f"Successfully wrote to {filepath}"
    except Exception as e:
        return f"Error writing file: {str(e)}"

@tool
def list_directory(path: str = ".") -> str:
    """List files and directories in a path."""
    try:
        import os
        items = os.listdir(path)
        return "\n".join(items) if items else "Directory is empty"
    except Exception as e:
        return f"Error: {str(e)}"

tools = [get_current_time, run_python_code, read_file, write_file, list_directory]

# Bind tools to LLM
llm_with_tools = llm.bind_tools(tools)

# State definition
class State(TypedDict):
    messages: Annotated[list, add_messages]

# Agent node
def agent(state: State) -> State:
    """The agent decides what to do."""
    response = llm_with_tools.invoke(state["messages"])
    return {"messages": [response]}

# Build the graph
graph_builder = StateGraph(State)

graph_builder.add_node("agent", agent)
graph_builder.add_node("tools", ToolNode(tools=tools))

graph_builder.add_edge(START, "agent")
graph_builder.add_conditional_edges("agent", tools_condition)
graph_builder.add_edge("tools", "agent")

# Compile
local_agent = graph_builder.compile()

def run_local_agent(query: str) -> str:
    """Run a query through the local agent."""
    print(f"\n{'='*60}")
    print(f"🦙 LOCAL AGENT (Ollama + LangGraph)")
    print(f"Query: {query}")
    print('='*60)

    result = local_agent.invoke({
        "messages": [{"role": "user", "content": query}]
    })

    # Get final response
    final_message = result["messages"][-1]
    response = final_message.content if hasattr(final_message, 'content') else str(final_message)

    print(f"\n📝 Response:\n{response}")
    return response

if __name__ == "__main__":
    # Test queries
    run_local_agent("What time is it right now?")

    run_local_agent("Calculate the factorial of 10 using Python code")

    run_local_agent("List the files in the current directory")

Handling Tool Calling with Local Models

Not all local models support function/tool calling natively. Here’s a pattern that works with any model:

# universal_tool_agent.py
import json
import re
from langchain_ollama import ChatOllama
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

llm = ChatOllama(model="llama3.2")

TOOLS = {
    "calculator": {
        "description": "Performs math calculations",
        "usage": 'calculator(expression="2+2")'
    },
    "get_time": {
        "description": "Gets current time",
        "usage": "get_time()"
    }
}

def execute_tool(name: str, args: dict) -> str:
    if name == "calculator":
        try:
            expr = args.get("expression", "0")
            allowed = set('0123456789+-*/.() ')
            if all(c in allowed for c in expr):
                return str(eval(expr))
            return "Invalid expression"
        except:
            return "Calculation error"
    elif name == "get_time":
        from datetime import datetime
        return datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    return f"Unknown tool: {name}"

def create_prompt() -> str:
    tool_list = "\n".join([
        f"- {name}: {info['description']}. Usage: {info['usage']}"
        for name, info in TOOLS.items()
    ])

    return f"""You are an AI assistant with access to tools.

Available tools:
{tool_list}

When you need a tool, respond ONLY with this exact format:
TOOL: tool_name
ARGS: {{"param": "value"}}

When you have the final answer, respond normally without TOOL/ARGS.

Think step by step."""

def parse_tool_call(response: str) -> tuple:
    """Parse tool call from response."""
    tool_match = re.search(r'TOOL:\s*(\w+)', response)
    args_match = re.search(r'ARGS:\s*({.+?})', response, re.DOTALL)

    if tool_match:
        tool_name = tool_match.group(1)
        args = {}
        if args_match:
            try:
                args = json.loads(args_match.group(1))
            except:
                pass
        return tool_name, args

    return None, None

class UniversalToolAgent:
    def __init__(self, max_iterations: int = 5):
        self.max_iterations = max_iterations

    def run(self, query: str) -> str:
        messages = [
            SystemMessage(content=create_prompt()),
            HumanMessage(content=query)
        ]

        for i in range(self.max_iterations):
            response = llm.invoke(messages)
            response_text = response.content

            print(f"\n[Iteration {i+1}] Agent: {response_text[:200]}...")

            tool_name, args = parse_tool_call(response_text)

            if tool_name:
                print(f"  → Tool: {tool_name}({args})")
                result = execute_tool(tool_name, args)
                print(f"  → Result: {result}")

                messages.append(AIMessage(content=response_text))
                messages.append(HumanMessage(content=f"Tool result: {result}"))
            else:
                # No tool call - this is the final answer
                return response_text

        return "Max iterations reached"

# Test
if __name__ == "__main__":
    agent = UniversalToolAgent()
    print(agent.run("What is 15 * 47 + 123?"))
    print(agent.run("What time is it?"))

Performance Optimization

GPU Acceleration

If you have an NVIDIA GPU, Ollama uses it automatically. Verify:

ollama run llama3.2 --verbose
# Look for "gpu" in the output

Model Quantization

Smaller quantized models run faster with minimal quality loss:

# 4-bit quantized (fastest, smallest)
ollama run llama3.2:3b-instruct-q4_0

# 8-bit quantized (balanced)
ollama run llama3.2:3b-instruct-q8_0

Concurrent Requests

Ollama handles multiple requests automatically:

# concurrent_requests.py
import asyncio
import aiohttp
import time

async def query_ollama(session, prompt, model="llama3.2"):
    async with session.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    ) as response:
        result = await response.json()
        return result["response"]

async def main():
    queries = [
        "What is Python?",
        "Explain JavaScript",
        "What is Rust?",
        "Describe Go language"
    ]

    start = time.time()

    async with aiohttp.ClientSession() as session:
        tasks = [query_ollama(session, q) for q in queries]
        results = await asyncio.gather(*tasks)

    elapsed = time.time() - start
    print(f"Processed {len(queries)} queries in {elapsed:.2f}s")

    for q, r in zip(queries, results):
        print(f"\nQ: {q}\nA: {r[:100]}...")

asyncio.run(main())

Custom Models with Modelfiles

Create specialized models with custom system prompts:

# Modelfile
FROM llama3.2

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """You are a senior Python developer. You write clean,
efficient, well-documented code. You always include type hints
and follow PEP 8 style guidelines. When explaining code, you
break it down step by step."""

Build and run:

ollama create python-expert -f Modelfile
ollama run python-expert

Comparing Models

Here’s a benchmark of popular models for agent tasks:

ModelTool CallingReasoningSpeedRAM
llama3.2 (3B)GoodGoodFast8GB
llama3.2 (8B)BetterBetterMedium16GB
mistral (7B)GoodGoodFast16GB
deepseek-r1 (7B)ExcellentExcellentMedium16GB
qwen2.5-coder (7B)GoodGood (code)Fast16GB

For agent tasks, llama3.2 (3B) is the best balance of speed and capability for most use cases.

Common Issues & Solutions

IssueCauseSolution
Slow responsesNo GPU / small RAMUse smaller model, add GPU
”Model not found”Not pulledollama pull model-name
Connection refusedOllama not runningollama serve
Out of memoryModel too largeUse quantized version
Poor tool callingModel limitationUse structured prompts

Summary

What You LearnedKey Takeaway
Why local LLMsPrivacy, cost savings, offline capability
Ollama basicsPull, run, and manage models
API usageREST API + OpenAI compatibility
LangChain integrationChatOllama for agents
Tool callingWorks with proper prompting
OptimizationGPU, quantization, concurrency

What’s Next?

In Part 4, we’ll build agents with real-world tools—web search, code execution, file operations, and API integrations.

Continue to Part 4: Tool-Using Agents →

Full Code Repository

git clone https://github.com/Moshiour027/ai-agents-mastery.git
cd ai-agents-mastery/03-ollama
pip install -r requirements.txt
python local_agent.py

Advertisement

MR

Moshiour Rahman

Software Architect & AI Engineer

Share:
MR

Moshiour Rahman

Software Architect & AI Engineer

Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.

Related Articles

Comments

Comments are powered by GitHub Discussions.

Configure Giscus at giscus.app to enable comments.