Production AI Agents: Observability, Evaluation & Deployment

AI Agents Mastery Series

This is Part 6 of our comprehensive AI Agents series—the final chapter.

Part	Topic	Level
1	Fundamentals - Build from Scratch	Beginner
2	LangGraph Deep Dive	Intermediate
3	Local LLMs with Ollama	Intermediate
4	Tool-Using Agents	Intermediate
5	Multi-Agent Systems	Advanced
6	Production Deployment	Advanced

The Production Challenge

Development agents work on your laptop. Production agents need:

Development	Production
”It works!“	99.9% uptime
`print()` debugging	Structured logging
Manual testing	Automated evaluation
Trust the LLM	Verify everything
Single user	Thousands concurrent
No security	Defense in depth

This guide covers everything you need to deploy agents responsibly.

Observability with LangSmith

LangSmith is the industry standard for LLM observability. It captures every step of your agent’s execution.

Setup

pip install langsmith

# .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-langsmith-key
LANGCHAIN_PROJECT=my-agent-production

Automatic Tracing

Once configured, LangChain/LangGraph automatically traces all operations:

# traced_agent.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, END
from typing import Annotated, TypedDict
from langgraph.graph.message import add_messages

load_dotenv()

# LangSmith automatically traces when LANGCHAIN_TRACING_V2=true

class State(TypedDict):
    messages: Annotated[list, add_messages]

llm = ChatOpenAI(model="gpt-4o-mini")

def agent(state: State) -> State:
    response = llm.invoke(state["messages"])
    return {"messages": [response]}

builder = StateGraph(State)
builder.add_node("agent", agent)
builder.add_edge(START, "agent")
builder.add_edge("agent", END)

graph = builder.compile()

# Every invocation is traced to LangSmith
result = graph.invoke({"messages": [{"role": "user", "content": "Hello"}]})

Custom Trace Metadata

Add context to your traces:

from langsmith import traceable

@traceable(
    name="customer_support_agent",
    tags=["production", "customer-facing"],
    metadata={"version": "1.0.0"}
)
def handle_customer_query(query: str, customer_id: str) -> str:
    """Handle a customer support query with full tracing."""
    result = graph.invoke({
        "messages": [{"role": "user", "content": query}]
    })
    return result["messages"][-1].content

# Usage
response = handle_customer_query(
    query="How do I reset my password?",
    customer_id="cust_123"
)

What LangSmith Captures

Data	Why It Matters
Full message chain	Debug conversation flow
Token counts	Cost tracking
Latency per step	Performance optimization
Tool calls & results	Debug tool interactions
Errors & stack traces	Quick issue resolution

Evaluation Strategies

Why Evaluate?

LLMs are non-deterministic. The same input can produce different outputs. Evaluation ensures consistent quality.

Types of Evaluation

Type	What It Tests	Example
Correctness	Right answer?	Math problems, factual questions
Relevance	On topic?	Response addresses the query
Faithfulness	Grounded in facts?	Claims supported by context
Harmlessness	Safe output?	No harmful content
Tool Usage	Correct tool selection?	Uses right tool for task

Building an Evaluation Pipeline

# evaluation.py
import json
from langchain_openai import ChatOpenAI
from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()
eval_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create a dataset of test cases
test_cases = [
    {
        "input": "What is 25 * 4?",
        "expected": "100"
    },
    {
        "input": "What's the capital of France?",
        "expected": "Paris"
    },
    {
        "input": "Summarize: AI agents are autonomous programs.",
        "expected_contains": ["autonomous", "programs"]
    }
]

# Create dataset in LangSmith
dataset_name = "agent-eval-v1"
dataset = client.create_dataset(dataset_name)

for case in test_cases:
    client.create_example(
        inputs={"query": case["input"]},
        outputs={"expected": case.get("expected", case.get("expected_contains"))},
        dataset_id=dataset.id
    )

# Define evaluators
def correctness_evaluator(run, example):
    """Check if the output contains the expected answer."""
    output = run.outputs.get("output", "")
    expected = example.outputs.get("expected")

    if isinstance(expected, list):
        # Check if all expected terms are present
        score = all(term.lower() in output.lower() for term in expected)
    else:
        score = expected.lower() in output.lower()

    return {"score": 1 if score else 0, "key": "correctness"}

def relevance_evaluator(run, example):
    """Use LLM to judge relevance."""
    query = example.inputs.get("query", "")
    output = run.outputs.get("output", "")

    response = eval_llm.invoke([{
        "role": "user",
        "content": f"""Rate the relevance of this response to the query.
        Query: {query}
        Response: {output}

        Score 1 if relevant, 0 if not. Respond with just the number."""
    }])

    try:
        score = int(response.content.strip())
    except:
        score = 0

    return {"score": score, "key": "relevance"}

# Run evaluation
def run_evaluation(agent_func):
    """Evaluate an agent against the test dataset."""
    results = evaluate(
        agent_func,
        data=dataset_name,
        evaluators=[correctness_evaluator, relevance_evaluator],
        experiment_prefix="agent-v1"
    )

    return results

Continuous Evaluation

Run evaluations on every deployment:

# .github/workflows/evaluate.yml
name: Agent Evaluation

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: python -m pytest tests/eval/ -v

      - name: Check threshold
        run: |
          # Fail if correctness < 90%
          python scripts/check_eval_threshold.py --metric correctness --threshold 0.9

Security Best Practices

Input Validation

Never trust user input:

# security.py
import re
from typing import Optional

class InputValidator:
    """Validate and sanitize user inputs."""

    MAX_LENGTH = 10000
    BLOCKED_PATTERNS = [
        r'ignore.*previous.*instructions',
        r'system.*prompt',
        r'<script>',
        r'javascript:',
    ]

    @classmethod
    def validate(cls, text: str) -> tuple[bool, Optional[str]]:
        """Validate input text.

        Returns:
            (is_valid, error_message)
        """
        if not text or not text.strip():
            return False, "Empty input"

        if len(text) > cls.MAX_LENGTH:
            return False, f"Input exceeds {cls.MAX_LENGTH} characters"

        text_lower = text.lower()
        for pattern in cls.BLOCKED_PATTERNS:
            if re.search(pattern, text_lower):
                return False, "Input contains blocked content"

        return True, None

    @classmethod
    def sanitize(cls, text: str) -> str:
        """Sanitize input text."""
        # Remove null bytes
        text = text.replace('\x00', '')

        # Limit length
        text = text[:cls.MAX_LENGTH]

        # Remove control characters (except newlines, tabs)
        text = ''.join(char for char in text if char.isprintable() or char in '\n\t')

        return text.strip()

# Usage in agent
def secure_agent(user_input: str) -> str:
    is_valid, error = InputValidator.validate(user_input)

    if not is_valid:
        return f"Invalid input: {error}"

    sanitized = InputValidator.sanitize(user_input)
    return process_with_agent(sanitized)

Tool Execution Safety

# secure_tools.py
from langchain_core.tools import tool
import subprocess
import os

SAFE_DIRECTORIES = ["/tmp/agent_workspace", "/app/data"]
BLOCKED_COMMANDS = ["rm", "sudo", "chmod", "chown", "curl", "wget"]

@tool
def secure_file_read(filepath: str) -> str:
    """Securely read a file with path validation."""
    # Resolve to absolute path
    abs_path = os.path.abspath(filepath)

    # Check against allowed directories
    if not any(abs_path.startswith(safe_dir) for safe_dir in SAFE_DIRECTORIES):
        return f"Access denied: {filepath} is outside allowed directories"

    # Prevent path traversal
    if ".." in filepath:
        return "Access denied: Path traversal detected"

    try:
        with open(abs_path, 'r') as f:
            content = f.read(100000)  # Limit read size
        return content
    except Exception as e:
        return f"Error: {str(e)}"

@tool
def secure_shell(command: str) -> str:
    """Execute shell command with restrictions."""
    cmd_parts = command.split()

    if not cmd_parts:
        return "No command provided"

    # Check for blocked commands
    if cmd_parts[0] in BLOCKED_COMMANDS:
        return f"Command '{cmd_parts[0]}' is not allowed"

    # Check for shell injection patterns
    dangerous = ['|', ';', '&', '`', '$', '>', '<']
    if any(char in command for char in dangerous):
        return "Command contains disallowed characters"

    try:
        result = subprocess.run(
            cmd_parts,
            capture_output=True,
            text=True,
            timeout=30,
            cwd="/tmp/agent_workspace"
        )
        return result.stdout or result.stderr or "Command completed"
    except subprocess.TimeoutExpired:
        return "Command timed out"
    except Exception as e:
        return f"Error: {str(e)}"

Rate Limiting

# rate_limiting.py
import time
from collections import defaultdict
from functools import wraps

class RateLimiter:
    """Simple in-memory rate limiter."""

    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.requests = defaultdict(list)

    def is_allowed(self, user_id: str) -> bool:
        """Check if request is allowed."""
        now = time.time()
        minute_ago = now - 60

        # Clean old requests
        self.requests[user_id] = [
            ts for ts in self.requests[user_id]
            if ts > minute_ago
        ]

        # Check limit
        if len(self.requests[user_id]) >= self.requests_per_minute:
            return False

        self.requests[user_id].append(now)
        return True

rate_limiter = RateLimiter(requests_per_minute=30)

def rate_limited(func):
    """Decorator to add rate limiting."""
    @wraps(func)
    def wrapper(user_id: str, *args, **kwargs):
        if not rate_limiter.is_allowed(user_id):
            return {"error": "Rate limit exceeded. Please try again later."}
        return func(user_id, *args, **kwargs)
    return wrapper

@rate_limited
def agent_endpoint(user_id: str, query: str) -> dict:
    """Rate-limited agent endpoint."""
    result = graph.invoke({"messages": [{"role": "user", "content": query}]})
    return {"response": result["messages"][-1].content}

Deployment Patterns

Docker Deployment

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Create non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

FastAPI Service

# main.py
from fastapi import FastAPI, HTTPException, Depends
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import os

app = FastAPI(title="AI Agent API", version="1.0.0")

app.add_middleware(
    CORSMiddleware,
    allow_origins=["https://yourdomain.com"],
    allow_methods=["POST"],
    allow_headers=["Authorization", "Content-Type"],
)

class AgentRequest(BaseModel):
    query: str
    user_id: str

class AgentResponse(BaseModel):
    response: str
    trace_id: str

@app.get("/health")
def health_check():
    return {"status": "healthy"}

@app.post("/agent", response_model=AgentResponse)
async def run_agent(request: AgentRequest):
    # Validate input
    is_valid, error = InputValidator.validate(request.query)
    if not is_valid:
        raise HTTPException(status_code=400, detail=error)

    # Rate limit
    if not rate_limiter.is_allowed(request.user_id):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    try:
        result = graph.invoke({
            "messages": [{"role": "user", "content": request.query}]
        })

        return AgentResponse(
            response=result["messages"][-1].content,
            trace_id="trace_xxx"  # From LangSmith
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail="Agent error")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Kubernetes Deployment

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-agent
  template:
    metadata:
      labels:
        app: ai-agent
    spec:
      containers:
      - name: agent
        image: your-registry/ai-agent:latest
        ports:
        - containerPort: 8000
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: agent-secrets
              key: openai-key
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: ai-agent
spec:
  selector:
    app: ai-agent
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Cost Management

Token Tracking

# cost_tracking.py
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback

def tracked_agent_call(query: str) -> dict:
    """Run agent with cost tracking."""
    with get_openai_callback() as cb:
        result = graph.invoke({
            "messages": [{"role": "user", "content": query}]
        })

        return {
            "response": result["messages"][-1].content,
            "cost": {
                "total_tokens": cb.total_tokens,
                "prompt_tokens": cb.prompt_tokens,
                "completion_tokens": cb.completion_tokens,
                "total_cost": cb.total_cost
            }
        }

Cost Optimization Strategies

Strategy	Savings	Trade-off
Use smaller models	50-90%	Less capability
Cache responses	20-50%	Stale data
Limit context	30-60%	Less history
Batch requests	10-20%	Added latency
Rate limiting	Variable	User friction

Monitoring Dashboard

Track these metrics:

Metric	Target	Alert Threshold
Response latency (p95)	< 5s	> 10s
Success rate	> 99%	< 95%
Cost per request	< $0.05	> $0.10
Tool call success	> 95%	< 90%
User satisfaction	> 4.0/5	< 3.5/5

Production Checklist

Before deploying:

Summary

Area	Key Takeaway
Observability	Use LangSmith for tracing
Evaluation	Test before every deploy
Security	Never trust user input
Deployment	Docker + Kubernetes
Cost	Track and limit spending

Series Complete!

Congratulations! You’ve completed the AI Agents Mastery Series. You now know how to:

Build agents from scratch
Use LangGraph for complex workflows
Run agents locally with Ollama
Integrate real-world tools
Build multi-agent systems
Deploy to production

What’s Next?

Join the LangChain Discord community
Explore LangGraph documentation
Build your own agent project!

Full Code Repository

git clone https://github.com/Moshiour027/ai-agents-mastery.git
cd ai-agents-mastery/06-production
pip install -r requirements.txt
python main.py

This concludes the AI Agents Mastery Series. Go build something amazing!