Python 6 min read

Run Llama Locally: Complete Guide to Local LLM Deployment

Deploy Llama and other open-source LLMs locally. Learn Ollama, llama.cpp, quantization, and build private AI applications without cloud APIs.

MR

Moshiour Rahman

Advertisement

Why Run LLMs Locally?

Running Large Language Models locally offers privacy, cost savings, and offline capability. With tools like Ollama and llama.cpp, you can run powerful models on consumer hardware.

Benefits

Cloud APIsLocal LLMs
Usage costsOne-time setup
Data leaves deviceComplete privacy
Internet requiredWorks offline
Rate limitsUnlimited usage
Vendor lock-inFull control

Ollama: Easiest Setup

Installation

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
# Download from https://ollama.com/download

# Start Ollama service
ollama serve

Basic Usage

# Pull and run a model
ollama run llama3.2

# List available models
ollama list

# Pull specific model
ollama pull llama3.2:8b
ollama pull codellama:13b
ollama pull mistral:7b

# Run with custom parameters
ollama run llama3.2 --verbose

# Remove model
ollama rm llama3.2
# General purpose
ollama pull llama3.2        # Meta's latest
ollama pull mistral         # Fast and efficient
ollama pull gemma2          # Google's model

# Code generation
ollama pull codellama       # Code-focused Llama
ollama pull deepseek-coder  # Coding specialist
ollama pull starcoder2      # Code completion

# Small/Fast models
ollama pull phi3            # Microsoft's small model
ollama pull tinyllama       # Very lightweight
ollama pull qwen2:0.5b      # Ultra-small

# Large models (need more RAM)
ollama pull llama3.2:70b    # Powerful, needs 48GB+
ollama pull mixtral         # Mixture of experts

Python Integration

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='What is machine learning?'
)
print(response['response'])

# Chat format
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a helpful coding assistant.'},
        {'role': 'user', 'content': 'Write a Python function to reverse a string.'}
    ]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Explain quantum computing'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

REST API

import requests

# Generate endpoint
response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.2',
    'prompt': 'Why is the sky blue?',
    'stream': False
})
print(response.json()['response'])

# Chat endpoint
response = requests.post('http://localhost:11434/api/chat', json={
    'model': 'llama3.2',
    'messages': [
        {'role': 'user', 'content': 'Hello!'}
    ],
    'stream': False
})
print(response.json()['message']['content'])

# List models
response = requests.get('http://localhost:11434/api/tags')
for model in response.json()['models']:
    print(f"{model['name']}: {model['size']}")

llama.cpp: Maximum Performance

Installation

# Clone repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build (CPU)
make

# Build with CUDA (NVIDIA GPU)
make LLAMA_CUDA=1

# Build with Metal (Apple Silicon)
make LLAMA_METAL=1

# Build with OpenBLAS
make LLAMA_OPENBLAS=1

Download Models

# Download from Hugging Face
pip install huggingface-hub

# Download GGUF model
huggingface-cli download TheBloke/Llama-2-7B-GGUF \
    llama-2-7b.Q4_K_M.gguf \
    --local-dir ./models

# Or use Python
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="TheBloke/Llama-2-7B-GGUF",
    filename="llama-2-7b.Q4_K_M.gguf",
    local_dir="./models"
)

Run Inference

# Basic inference
./main -m models/llama-2-7b.Q4_K_M.gguf \
    -p "What is the capital of France?" \
    -n 100

# Interactive mode
./main -m models/llama-2-7b.Q4_K_M.gguf \
    --interactive \
    --color \
    -n 256

# With GPU layers
./main -m models/llama-2-7b.Q4_K_M.gguf \
    -ngl 35 \
    -p "Explain machine learning"

# Server mode
./server -m models/llama-2-7b.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 35

Python Bindings

from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="models/llama-2-7b.Q4_K_M.gguf",
    n_ctx=2048,        # Context window
    n_gpu_layers=35,   # GPU layers (0 for CPU only)
    verbose=False
)

# Generate text
output = llm(
    "What is Python programming?",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
    stop=["User:", "\n\n"]
)
print(output['choices'][0]['text'])

# Chat completion
output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding."}
    ],
    temperature=0.7
)
print(output['choices'][0]['message']['content'])

# Streaming
for chunk in llm(
    "Explain neural networks",
    max_tokens=256,
    stream=True
):
    print(chunk['choices'][0]['text'], end='', flush=True)

Quantization

Understanding Quantization

Model Size vs Quality Trade-offs:

Q2_K  - Smallest, lowest quality (not recommended)
Q3_K_S - Very small, low quality
Q3_K_M - Small, decent quality
Q4_0  - Small, good quality
Q4_K_S - Small, better quality
Q4_K_M - Medium, good balance ← Recommended
Q5_0  - Medium, high quality
Q5_K_S - Medium, higher quality
Q5_K_M - Medium-large, very good quality
Q6_K  - Large, excellent quality
Q8_0  - Largest quantized, near-original quality
F16   - Half precision, original quality
F32   - Full precision, maximum quality

Convert Models

# Convert HF model to GGUF
python convert.py models/llama-2-7b-hf \
    --outfile models/llama-2-7b.gguf \
    --outtype f16

# Quantize model
./quantize models/llama-2-7b.gguf \
    models/llama-2-7b.Q4_K_M.gguf \
    Q4_K_M

LangChain Integration

from langchain_community.llms import Ollama
from langchain_community.chat_models import ChatOllama
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain

# Basic Ollama LLM
llm = Ollama(model="llama3.2")
response = llm.invoke("What is Python?")
print(response)

# Chat model
chat = ChatOllama(model="llama3.2", temperature=0.7)

# With prompt template
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful {role}."),
    ("user", "{question}")
])

chain = prompt | chat

response = chain.invoke({
    "role": "Python expert",
    "question": "How do I use list comprehensions?"
})
print(response.content)

# RAG with local embeddings
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Build a Local Chatbot

import ollama
import gradio as gr

class LocalChatbot:
    def __init__(self, model: str = "llama3.2"):
        self.model = model
        self.history = []

    def chat(self, message: str) -> str:
        self.history.append({"role": "user", "content": message})

        response = ollama.chat(
            model=self.model,
            messages=self.history
        )

        assistant_message = response['message']['content']
        self.history.append({"role": "assistant", "content": assistant_message})

        return assistant_message

    def clear(self):
        self.history = []

# Gradio interface
chatbot = LocalChatbot()

def respond(message, history):
    response = chatbot.chat(message)
    return response

def clear_history():
    chatbot.clear()
    return []

with gr.Blocks() as demo:
    gr.Markdown("# Local LLM Chatbot")

    chatbot_ui = gr.Chatbot()
    msg = gr.Textbox(placeholder="Type your message...")
    clear = gr.Button("Clear")

    msg.submit(respond, [msg, chatbot_ui], [chatbot_ui])
    clear.click(clear_history, [], [chatbot_ui])

demo.launch()

FastAPI Server

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import ollama

app = FastAPI()

class ChatRequest(BaseModel):
    model: str = "llama3.2"
    messages: list[dict]
    temperature: float = 0.7
    max_tokens: int = 500

class GenerateRequest(BaseModel):
    model: str = "llama3.2"
    prompt: str
    temperature: float = 0.7
    max_tokens: int = 500

@app.post("/chat")
async def chat(request: ChatRequest):
    try:
        response = ollama.chat(
            model=request.model,
            messages=request.messages,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        )
        return {"response": response['message']['content']}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.post("/generate")
async def generate(request: GenerateRequest):
    try:
        response = ollama.generate(
            model=request.model,
            prompt=request.prompt,
            options={
                "temperature": request.temperature,
                "num_predict": request.max_tokens
            }
        )
        return {"response": response['response']}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/models")
async def list_models():
    models = ollama.list()
    return {"models": [m['name'] for m in models['models']]}

# Run: uvicorn server:app --reload

Hardware Requirements

RAM Guidelines

Model SizeMinimum RAMRecommended
7B Q48GB16GB
13B Q416GB32GB
30B Q432GB64GB
70B Q448GB128GB

GPU VRAM

For full GPU inference:
- 7B Q4: ~4GB VRAM
- 13B Q4: ~8GB VRAM
- 30B Q4: ~20GB VRAM
- 70B Q4: ~40GB VRAM

Partial offloading:
- Use n_gpu_layers to control GPU usage
- Higher = faster but more VRAM

Summary

ToolBest For
OllamaEasy setup, beginners
llama.cppMaximum performance
LangChainBuilding applications
vLLMProduction serving

Local LLMs provide privacy, cost savings, and full control over your AI applications.

Advertisement

MR

Moshiour Rahman

Software Architect & AI Engineer

Share:
MR

Moshiour Rahman

Software Architect & AI Engineer

Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.

Related Articles

Comments

Comments are powered by GitHub Discussions.

Configure Giscus at giscus.app to enable comments.