Production AI Agents: Observability, Evaluation & Deployment
Deploy AI agents to production with confidence. Learn monitoring with LangSmith, evaluation strategies, security best practices, and scalable deployment patterns.
Moshiour Rahman
Advertisement
AI Agents Mastery Series
This is Part 6 of our comprehensive AI Agents series—the final chapter.
| Part | Topic | Level |
|---|---|---|
| 1 | Fundamentals - Build from Scratch | Beginner |
| 2 | LangGraph Deep Dive | Intermediate |
| 3 | Local LLMs with Ollama | Intermediate |
| 4 | Tool-Using Agents | Intermediate |
| 5 | Multi-Agent Systems | Advanced |
| 6 | Production Deployment | Advanced |
The Production Challenge
Development agents work on your laptop. Production agents need:
| Development | Production |
|---|---|
| ”It works!“ | 99.9% uptime |
print() debugging | Structured logging |
| Manual testing | Automated evaluation |
| Trust the LLM | Verify everything |
| Single user | Thousands concurrent |
| No security | Defense in depth |
This guide covers everything you need to deploy agents responsibly.
Observability with LangSmith
LangSmith is the industry standard for LLM observability. It captures every step of your agent’s execution.
Setup
pip install langsmith
# .env
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your-langsmith-key
LANGCHAIN_PROJECT=my-agent-production
Automatic Tracing
Once configured, LangChain/LangGraph automatically traces all operations:
# traced_agent.py
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, START, END
from typing import Annotated, TypedDict
from langgraph.graph.message import add_messages
load_dotenv()
# LangSmith automatically traces when LANGCHAIN_TRACING_V2=true
class State(TypedDict):
messages: Annotated[list, add_messages]
llm = ChatOpenAI(model="gpt-4o-mini")
def agent(state: State) -> State:
response = llm.invoke(state["messages"])
return {"messages": [response]}
builder = StateGraph(State)
builder.add_node("agent", agent)
builder.add_edge(START, "agent")
builder.add_edge("agent", END)
graph = builder.compile()
# Every invocation is traced to LangSmith
result = graph.invoke({"messages": [{"role": "user", "content": "Hello"}]})
Custom Trace Metadata
Add context to your traces:
from langsmith import traceable
@traceable(
name="customer_support_agent",
tags=["production", "customer-facing"],
metadata={"version": "1.0.0"}
)
def handle_customer_query(query: str, customer_id: str) -> str:
"""Handle a customer support query with full tracing."""
result = graph.invoke({
"messages": [{"role": "user", "content": query}]
})
return result["messages"][-1].content
# Usage
response = handle_customer_query(
query="How do I reset my password?",
customer_id="cust_123"
)
What LangSmith Captures
| Data | Why It Matters |
|---|---|
| Full message chain | Debug conversation flow |
| Token counts | Cost tracking |
| Latency per step | Performance optimization |
| Tool calls & results | Debug tool interactions |
| Errors & stack traces | Quick issue resolution |
Evaluation Strategies
Why Evaluate?
LLMs are non-deterministic. The same input can produce different outputs. Evaluation ensures consistent quality.
Types of Evaluation
| Type | What It Tests | Example |
|---|---|---|
| Correctness | Right answer? | Math problems, factual questions |
| Relevance | On topic? | Response addresses the query |
| Faithfulness | Grounded in facts? | Claims supported by context |
| Harmlessness | Safe output? | No harmful content |
| Tool Usage | Correct tool selection? | Uses right tool for task |
Building an Evaluation Pipeline
# evaluation.py
import json
from langchain_openai import ChatOpenAI
from langsmith import Client
from langsmith.evaluation import evaluate
client = Client()
eval_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Create a dataset of test cases
test_cases = [
{
"input": "What is 25 * 4?",
"expected": "100"
},
{
"input": "What's the capital of France?",
"expected": "Paris"
},
{
"input": "Summarize: AI agents are autonomous programs.",
"expected_contains": ["autonomous", "programs"]
}
]
# Create dataset in LangSmith
dataset_name = "agent-eval-v1"
dataset = client.create_dataset(dataset_name)
for case in test_cases:
client.create_example(
inputs={"query": case["input"]},
outputs={"expected": case.get("expected", case.get("expected_contains"))},
dataset_id=dataset.id
)
# Define evaluators
def correctness_evaluator(run, example):
"""Check if the output contains the expected answer."""
output = run.outputs.get("output", "")
expected = example.outputs.get("expected")
if isinstance(expected, list):
# Check if all expected terms are present
score = all(term.lower() in output.lower() for term in expected)
else:
score = expected.lower() in output.lower()
return {"score": 1 if score else 0, "key": "correctness"}
def relevance_evaluator(run, example):
"""Use LLM to judge relevance."""
query = example.inputs.get("query", "")
output = run.outputs.get("output", "")
response = eval_llm.invoke([{
"role": "user",
"content": f"""Rate the relevance of this response to the query.
Query: {query}
Response: {output}
Score 1 if relevant, 0 if not. Respond with just the number."""
}])
try:
score = int(response.content.strip())
except:
score = 0
return {"score": score, "key": "relevance"}
# Run evaluation
def run_evaluation(agent_func):
"""Evaluate an agent against the test dataset."""
results = evaluate(
agent_func,
data=dataset_name,
evaluators=[correctness_evaluator, relevance_evaluator],
experiment_prefix="agent-v1"
)
return results
Continuous Evaluation
Run evaluations on every deployment:
# .github/workflows/evaluate.yml
name: Agent Evaluation
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
run: python -m pytest tests/eval/ -v
- name: Check threshold
run: |
# Fail if correctness < 90%
python scripts/check_eval_threshold.py --metric correctness --threshold 0.9
Security Best Practices
Input Validation
Never trust user input:
# security.py
import re
from typing import Optional
class InputValidator:
"""Validate and sanitize user inputs."""
MAX_LENGTH = 10000
BLOCKED_PATTERNS = [
r'ignore.*previous.*instructions',
r'system.*prompt',
r'<script>',
r'javascript:',
]
@classmethod
def validate(cls, text: str) -> tuple[bool, Optional[str]]:
"""Validate input text.
Returns:
(is_valid, error_message)
"""
if not text or not text.strip():
return False, "Empty input"
if len(text) > cls.MAX_LENGTH:
return False, f"Input exceeds {cls.MAX_LENGTH} characters"
text_lower = text.lower()
for pattern in cls.BLOCKED_PATTERNS:
if re.search(pattern, text_lower):
return False, "Input contains blocked content"
return True, None
@classmethod
def sanitize(cls, text: str) -> str:
"""Sanitize input text."""
# Remove null bytes
text = text.replace('\x00', '')
# Limit length
text = text[:cls.MAX_LENGTH]
# Remove control characters (except newlines, tabs)
text = ''.join(char for char in text if char.isprintable() or char in '\n\t')
return text.strip()
# Usage in agent
def secure_agent(user_input: str) -> str:
is_valid, error = InputValidator.validate(user_input)
if not is_valid:
return f"Invalid input: {error}"
sanitized = InputValidator.sanitize(user_input)
return process_with_agent(sanitized)
Tool Execution Safety
# secure_tools.py
from langchain_core.tools import tool
import subprocess
import os
SAFE_DIRECTORIES = ["/tmp/agent_workspace", "/app/data"]
BLOCKED_COMMANDS = ["rm", "sudo", "chmod", "chown", "curl", "wget"]
@tool
def secure_file_read(filepath: str) -> str:
"""Securely read a file with path validation."""
# Resolve to absolute path
abs_path = os.path.abspath(filepath)
# Check against allowed directories
if not any(abs_path.startswith(safe_dir) for safe_dir in SAFE_DIRECTORIES):
return f"Access denied: {filepath} is outside allowed directories"
# Prevent path traversal
if ".." in filepath:
return "Access denied: Path traversal detected"
try:
with open(abs_path, 'r') as f:
content = f.read(100000) # Limit read size
return content
except Exception as e:
return f"Error: {str(e)}"
@tool
def secure_shell(command: str) -> str:
"""Execute shell command with restrictions."""
cmd_parts = command.split()
if not cmd_parts:
return "No command provided"
# Check for blocked commands
if cmd_parts[0] in BLOCKED_COMMANDS:
return f"Command '{cmd_parts[0]}' is not allowed"
# Check for shell injection patterns
dangerous = ['|', ';', '&', '`', '$', '>', '<']
if any(char in command for char in dangerous):
return "Command contains disallowed characters"
try:
result = subprocess.run(
cmd_parts,
capture_output=True,
text=True,
timeout=30,
cwd="/tmp/agent_workspace"
)
return result.stdout or result.stderr or "Command completed"
except subprocess.TimeoutExpired:
return "Command timed out"
except Exception as e:
return f"Error: {str(e)}"
Rate Limiting
# rate_limiting.py
import time
from collections import defaultdict
from functools import wraps
class RateLimiter:
"""Simple in-memory rate limiter."""
def __init__(self, requests_per_minute: int = 60):
self.requests_per_minute = requests_per_minute
self.requests = defaultdict(list)
def is_allowed(self, user_id: str) -> bool:
"""Check if request is allowed."""
now = time.time()
minute_ago = now - 60
# Clean old requests
self.requests[user_id] = [
ts for ts in self.requests[user_id]
if ts > minute_ago
]
# Check limit
if len(self.requests[user_id]) >= self.requests_per_minute:
return False
self.requests[user_id].append(now)
return True
rate_limiter = RateLimiter(requests_per_minute=30)
def rate_limited(func):
"""Decorator to add rate limiting."""
@wraps(func)
def wrapper(user_id: str, *args, **kwargs):
if not rate_limiter.is_allowed(user_id):
return {"error": "Rate limit exceeded. Please try again later."}
return func(user_id, *args, **kwargs)
return wrapper
@rate_limited
def agent_endpoint(user_id: str, query: str) -> dict:
"""Rate-limited agent endpoint."""
result = graph.invoke({"messages": [{"role": "user", "content": query}]})
return {"response": result["messages"][-1].content}
Deployment Patterns
Docker Deployment
# Dockerfile
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Create non-root user
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
FastAPI Service
# main.py
from fastapi import FastAPI, HTTPException, Depends
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
import os
app = FastAPI(title="AI Agent API", version="1.0.0")
app.add_middleware(
CORSMiddleware,
allow_origins=["https://yourdomain.com"],
allow_methods=["POST"],
allow_headers=["Authorization", "Content-Type"],
)
class AgentRequest(BaseModel):
query: str
user_id: str
class AgentResponse(BaseModel):
response: str
trace_id: str
@app.get("/health")
def health_check():
return {"status": "healthy"}
@app.post("/agent", response_model=AgentResponse)
async def run_agent(request: AgentRequest):
# Validate input
is_valid, error = InputValidator.validate(request.query)
if not is_valid:
raise HTTPException(status_code=400, detail=error)
# Rate limit
if not rate_limiter.is_allowed(request.user_id):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
try:
result = graph.invoke({
"messages": [{"role": "user", "content": request.query}]
})
return AgentResponse(
response=result["messages"][-1].content,
trace_id="trace_xxx" # From LangSmith
)
except Exception as e:
raise HTTPException(status_code=500, detail="Agent error")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Kubernetes Deployment
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent
spec:
replicas: 3
selector:
matchLabels:
app: ai-agent
template:
metadata:
labels:
app: ai-agent
spec:
containers:
- name: agent
image: your-registry/ai-agent:latest
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: openai-key
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: ai-agent
spec:
selector:
app: ai-agent
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
Cost Management
Token Tracking
# cost_tracking.py
from langchain_openai import ChatOpenAI
from langchain.callbacks import get_openai_callback
def tracked_agent_call(query: str) -> dict:
"""Run agent with cost tracking."""
with get_openai_callback() as cb:
result = graph.invoke({
"messages": [{"role": "user", "content": query}]
})
return {
"response": result["messages"][-1].content,
"cost": {
"total_tokens": cb.total_tokens,
"prompt_tokens": cb.prompt_tokens,
"completion_tokens": cb.completion_tokens,
"total_cost": cb.total_cost
}
}
Cost Optimization Strategies
| Strategy | Savings | Trade-off |
|---|---|---|
| Use smaller models | 50-90% | Less capability |
| Cache responses | 20-50% | Stale data |
| Limit context | 30-60% | Less history |
| Batch requests | 10-20% | Added latency |
| Rate limiting | Variable | User friction |
Monitoring Dashboard
Track these metrics:
| Metric | Target | Alert Threshold |
|---|---|---|
| Response latency (p95) | < 5s | > 10s |
| Success rate | > 99% | < 95% |
| Cost per request | < $0.05 | > $0.10 |
| Tool call success | > 95% | < 90% |
| User satisfaction | > 4.0/5 | < 3.5/5 |
Production Checklist
Before deploying:
- Input validation on all user inputs
- Rate limiting configured
- Tool execution sandboxed
- Secrets in environment variables (not code)
- Logging configured (LangSmith or equivalent)
- Error handling for all tool failures
- Cost limits set
- Evaluation suite passing
- Health check endpoint working
- Monitoring alerts configured
- Rollback plan documented
Summary
| Area | Key Takeaway |
|---|---|
| Observability | Use LangSmith for tracing |
| Evaluation | Test before every deploy |
| Security | Never trust user input |
| Deployment | Docker + Kubernetes |
| Cost | Track and limit spending |
Series Complete!
Congratulations! You’ve completed the AI Agents Mastery Series. You now know how to:
- Build agents from scratch
- Use LangGraph for complex workflows
- Run agents locally with Ollama
- Integrate real-world tools
- Build multi-agent systems
- Deploy to production
What’s Next?
- Join the LangChain Discord community
- Explore LangGraph documentation
- Build your own agent project!
Full Code Repository
git clone https://github.com/Moshiour027/ai-agents-mastery.git
cd ai-agents-mastery/06-production
pip install -r requirements.txt
python main.py
This concludes the AI Agents Mastery Series. Go build something amazing!
Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
AI Agents Fundamentals: Build Your First Agent from Scratch
Master AI agents from the ground up. Learn the agent loop, build a working agent in pure Python, and understand the foundations that power LangGraph and CrewAI.
PythonTool-Using AI Agents: Web Search, Code Execution & API Integration
Build powerful AI agents with real-world tools. Learn to integrate web search, execute code safely, work with files, and connect to external APIs using LangGraph.
PythonMulti-Agent Systems: Build AI Teams with CrewAI & LangGraph
Master multi-agent orchestration with CrewAI and LangGraph. Build specialized AI teams that collaborate, delegate tasks, and solve complex problems together.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.