Run Llama Locally: Complete Guide to Local LLM Deployment

Why Run LLMs Locally?

Running Large Language Models locally offers privacy, cost savings, and offline capability. With tools like Ollama and llama.cpp, you can run powerful models on consumer hardware.

Benefits

Cloud APIs	Local LLMs
Usage costs	One-time setup
Data leaves device	Complete privacy
Internet required	Works offline
Rate limits	Unlimited usage
Vendor lock-in	Full control

Ollama: Easiest Setup

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

# Start Ollama service
ollama serve

Basic Usage

# Pull and run a model
ollama run llama3.2

# List available models
ollama list

# Pull specific model
ollama pull llama3.2:8b
ollama pull codellama:13b
ollama pull mistral:7b

# Run with custom parameters
ollama run llama3.2 --verbose

# Remove model
ollama rm llama3.2

Popular Models

# General purpose
ollama pull llama3.2        # Meta's latest
ollama pull mistral         # Fast and efficient
ollama pull gemma2          # Google's model

# Code generation
ollama pull codellama       # Code-focused Llama
ollama pull deepseek-coder  # Coding specialist
ollama pull starcoder2      # Code completion

# Small/Fast models
ollama pull phi3            # Microsoft's small model
ollama pull tinyllama       # Very lightweight
ollama pull qwen2:0.5b      # Ultra-small

# Large models (need more RAM)
ollama pull llama3.2:70b    # Powerful, needs 48GB+
ollama pull mixtral         # Mixture of experts

Python Integration

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='What is machine learning?'
)
print(response['response'])

# Chat format
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant.'},
        {'role': 'user', 'content': 'Write a Python function to reverse a string.'}
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Explain quantum computing'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

REST API

import requests

# Generate endpoint
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'Why is the sky blue?',
    'stream': False
})
print(response.json()['response'])

# Chat endpoint
response = requests.post('http://localhost:11434/api/chat', json={
    'model': 'llama3.2',
    'messages': [
        {'role': 'user', 'content': 'Hello!'}
    ],
    'stream': False
})
print(response.json()['message']['content'])

# List models
response = requests.get('http://localhost:11434/api/tags')
for model in response.json()['models']:
    print(f"{model['name']}: {model['size']}")

llama.cpp: Maximum Performance

Installation

# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build (CPU)
make

# Build with CUDA (NVIDIA GPU)
make LLAMA_CUDA=1

# Build with Metal (Apple Silicon)
make LLAMA_METAL=1

# Build with OpenBLAS
make LLAMA_OPENBLAS=1

Download Models

# Download from Hugging Face
pip install huggingface-hub

# Download GGUF model
huggingface-cli download TheBloke/Llama-2-7B-GGUF \
    llama-2-7b.Q4_K_M.gguf \
    --local-dir ./models

# Or use Python
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="TheBloke/Llama-2-7B-GGUF",
    filename="llama-2-7b.Q4_K_M.gguf",
    local_dir="./models"
)

Run Inference

# Basic inference
./main -m models/llama-2-7b.Q4_K_M.gguf \
    -p "What is the capital of France?" \
    -n 100

# Interactive mode
./main -m models/llama-2-7b.Q4_K_M.gguf \
    --interactive \
    --color \
    -n 256

# With GPU layers
./main -m models/llama-2-7b.Q4_K_M.gguf \
    -ngl 35 \
    -p "Explain machine learning"

# Server mode
./server -m models/llama-2-7b.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35

Python Bindings

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="models/llama-2-7b.Q4_K_M.gguf",
    n_ctx=2048,        # Context window
    n_gpu_layers=35,   # GPU layers (0 for CPU only)
    verbose=False
)

# Generate text
output = llm(
    "What is Python programming?",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop=["User:", "\n\n"]
)
print(output['choices'][0]['text'])

# Chat completion
output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding."}
    ],
    temperature=0.7
)
print(output['choices'][0]['message']['content'])

# Streaming
for chunk in llm(
    "Explain neural networks",
    max_tokens=256,
    stream=True
):
    print(chunk['choices'][0]['text'], end='', flush=True)

Quantization

Understanding Quantization

Model Size vs Quality Trade-offs:

Q2_K  - Smallest, lowest quality (not recommended)
Q3_K_S - Very small, low quality
Q3_K_M - Small, decent quality
Q4_0  - Small, good quality
Q4_K_S - Small, better quality
Q4_K_M - Medium, good balance ← Recommended
Q5_0  - Medium, high quality
Q5_K_S - Medium, higher quality
Q5_K_M - Medium-large, very good quality
Q6_K  - Large, excellent quality
Q8_0  - Largest quantized, near-original quality
F16   - Half precision, original quality
F32   - Full precision, maximum quality

Convert Models

# Convert HF model to GGUF
python convert.py models/llama-2-7b-hf \
    --outfile models/llama-2-7b.gguf \
    --outtype f16

# Quantize model
./quantize models/llama-2-7b.gguf \
    models/llama-2-7b.Q4_K_M.gguf \
    Q4_K_M

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain

# Basic Ollama LLM
llm = Ollama(model="llama3.2")
response = llm.invoke("What is Python?")
print(response)

# Chat model
chat = ChatOllama(model="llama3.2", temperature=0.7)

# With prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful {role}."),
    ("user", "{question}")
])

chain = prompt | chat

response = chain.invoke({
    "role": "Python expert",
    "question": "How do I use list comprehensions?"
})
print(response.content)

# RAG with local embeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Build a Local Chatbot

import ollama
import gradio as gr

class LocalChatbot:
    def __init__(self, model: str = "llama3.2"):
        self.model = model
        self.history = []

    def chat(self, message: str) -> str:
        self.history.append({"role": "user", "content": message})

        response = ollama.chat(
            model=self.model,
            messages=self.history
        )

        assistant_message = response['message']['content']
        self.history.append({"role": "assistant", "content": assistant_message})

        return assistant_message

    def clear(self):
        self.history = []

# Gradio interface
chatbot = LocalChatbot()

def respond(message, history):
    response = chatbot.chat(message)
    return response

def clear_history():
    chatbot.clear()
    return []

with gr.Blocks() as demo:
    gr.Markdown("# Local LLM Chatbot")

    chatbot_ui = gr.Chatbot()
    msg = gr.Textbox(placeholder="Type your message...")
    clear = gr.Button("Clear")

    msg.submit(respond, [msg, chatbot_ui], [chatbot_ui])
    clear.click(clear_history, [], [chatbot_ui])

demo.launch()

FastAPI Server

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import ollama

app = FastAPI()

class ChatRequest(BaseModel):
    model: str = "llama3.2"
    messages: list[dict]
    temperature: float = 0.7
    max_tokens: int = 500

class GenerateRequest(BaseModel):
    model: str = "llama3.2"
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 500

@app.post("/chat")
async def chat(request: ChatRequest):
    try:
        response = ollama.chat(
            model=request.model,
            messages=request.messages,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        )
        return {"response": response['message']['content']}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/generate")
async def generate(request: GenerateRequest):
    try:
        response = ollama.generate(
            model=request.model,
            prompt=request.prompt,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        )
        return {"response": response['response']}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/models")
async def list_models():
    models = ollama.list()
    return {"models": [m['name'] for m in models['models']]}

# Run: uvicorn server:app --reload

Hardware Requirements

RAM Guidelines

Model Size	Minimum RAM	Recommended
7B Q4	8GB	16GB
13B Q4	16GB	32GB
30B Q4	32GB	64GB
70B Q4	48GB	128GB

GPU VRAM

For full GPU inference:
- 7B Q4: ~4GB VRAM
- 13B Q4: ~8GB VRAM
- 30B Q4: ~20GB VRAM
- 70B Q4: ~40GB VRAM

Partial offloading:
- Use n_gpu_layers to control GPU usage
- Higher = faster but more VRAM

Summary

Tool	Best For
Ollama	Easy setup, beginners
llama.cpp	Maximum performance
LangChain	Building applications
vLLM	Production serving

Local LLMs provide privacy, cost savings, and full control over your AI applications.

Run Llama Locally: Complete Guide to Local LLM Deployment

Why Run LLMs Locally?

Benefits

Ollama: Easiest Setup

Installation

Basic Usage

Popular Models

Python Integration

REST API

llama.cpp: Maximum Performance

Installation

Download Models

Run Inference

Python Bindings

Quantization

Understanding Quantization

Convert Models

LangChain Integration

Build a Local Chatbot

FastAPI Server

Hardware Requirements

RAM Guidelines

GPU VRAM

Summary

Moshiour Rahman

Related Articles

Local LLMs with Ollama: Build AI Agents with Zero API Costs

AI Agents Fundamentals: Build Your First Agent from Scratch

Fine-Tuning LLMs: Complete Guide to Custom AI Models

Comments

On this page