Prometheus and Grafana: Complete Monitoring Stack Tutorial
Set up production monitoring with Prometheus and Grafana. Learn metrics collection, alerting, dashboards, and best practices for observability.
Moshiour Rahman
Advertisement
What is Prometheus?
Prometheus is an open-source monitoring system that collects metrics from configured targets, stores them, and enables querying and alerting. Grafana provides visualization dashboards.
Monitoring Stack Components
| Component | Purpose |
|---|---|
| Prometheus | Metrics collection and storage |
| Grafana | Visualization and dashboards |
| Alertmanager | Alert routing and notifications |
| Exporters | Expose metrics from systems |
Setup with Docker Compose
Complete Stack
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
prometheus.yml
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/alerts/*.yml
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Docker containers
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
# Your application
- job_name: 'myapp'
static_configs:
- targets: ['app:8080']
metrics_path: '/actuator/prometheus'
# Multiple targets with labels
- job_name: 'web-servers'
static_configs:
- targets: ['server1:9100', 'server2:9100']
labels:
env: 'production'
- targets: ['server3:9100']
labels:
env: 'staging'
Service Discovery
# Kubernetes service discovery
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# EC2 service discovery
- job_name: 'ec2-instances'
ec2_sd_configs:
- region: us-east-1
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
PromQL Queries
Basic Queries
# Current CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))
# Disk usage
100 - ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"})
# Network traffic (bytes/sec)
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
Application Metrics
# HTTP request rate
rate(http_requests_total[5m])
# Request rate by status code
sum by(status) (rate(http_requests_total[5m]))
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Aggregations
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Average by label
avg by(instance) (rate(cpu_usage[5m]))
# Max value
max(node_memory_MemAvailable_bytes)
# Count number of targets
count(up)
# Top 5 by memory
topk(5, node_memory_MemAvailable_bytes)
Alerting Rules
prometheus/alerts/alerts.yml
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current: {{ $value }}%)"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 15% (current: {{ $value }}%)"
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"
- name: application_alerts
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 > 5
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5% (current: {{ $value }}%)"
- alert: HighResponseTime
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time"
description: "95th percentile response time is above 1s (current: {{ $value }}s)"
Alertmanager Configuration
alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@example.com'
smtp_auth_username: 'alerts@example.com'
smtp_auth_password: 'app-password'
slack_api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
continue: true
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
- name: 'critical-alerts'
slack_configs:
- channel: '#alerts-critical'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
- name: 'warning-alerts'
slack_configs:
- channel: '#alerts-warning'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Spring Boot Integration
Dependencies
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
Configuration
# application.yml
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
metrics:
export:
prometheus:
enabled: true
tags:
application: ${spring.application.name}
Custom Metrics
@Component
@RequiredArgsConstructor
public class OrderMetrics {
private final MeterRegistry meterRegistry;
private Counter orderCounter;
private Timer orderProcessingTimer;
@PostConstruct
public void init() {
orderCounter = Counter.builder("orders_total")
.description("Total number of orders")
.tag("type", "all")
.register(meterRegistry);
orderProcessingTimer = Timer.builder("order_processing_time")
.description("Order processing time")
.register(meterRegistry);
}
public void recordOrder(String status) {
Counter.builder("orders_total")
.tag("status", status)
.register(meterRegistry)
.increment();
}
public void recordProcessingTime(Runnable task) {
orderProcessingTimer.record(task);
}
}
Grafana Dashboards
Provisioning Dashboards
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards
Data Source Provisioning
# grafana/provisioning/datasources/datasource.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Node Exporter Dashboard JSON
{
"dashboard": {
"title": "Node Exporter",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{ instance }}"
}
]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [
{
"expr": "100 * (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes))"
}
]
}
]
}
}
Best Practices
Naming Conventions
# Metric naming
<namespace>_<name>_<unit>_total # Counter
<namespace>_<name>_<unit> # Gauge
<namespace>_<name>_<unit>_bucket # Histogram
# Examples
http_requests_total
http_request_duration_seconds
node_memory_MemAvailable_bytes
Labels Best Practices
# Good labels (low cardinality)
http_requests_total{method="GET", status="200", endpoint="/api/users"}
# Bad labels (high cardinality - avoid!)
http_requests_total{user_id="12345"} # Don't use unique IDs
Summary
| Component | Purpose |
|---|---|
| Prometheus | Pull-based metrics collection |
| Grafana | Dashboards and visualization |
| Alertmanager | Alert routing and silencing |
| Node Exporter | Host-level metrics |
| cAdvisor | Container metrics |
Prometheus and Grafana provide a powerful, scalable monitoring solution for modern applications.
Advertisement
Moshiour Rahman
Software Architect & AI Engineer
Enterprise software architect with deep expertise in financial systems, distributed architecture, and AI-powered applications. Building large-scale systems at Fortune 500 companies. Specializing in LLM orchestration, multi-agent systems, and cloud-native solutions. I share battle-tested patterns from real enterprise projects.
Related Articles
Stop Wrestling YAML: How to Deploy 50 AI Models with Python Loops
Infrastructure as Code shouldn't be a copy-paste nightmare. Learn how to use Pulumi and Python to programmatically deploy scalable AI infrastructure without the YAML fatigue.
DevOpsKubernetes Helm Charts: Package and Deploy Applications
Master Helm for Kubernetes deployments. Learn chart creation, templates, values, dependencies, and best practices for production applications.
DevOpsDocker Compose for Microservices: Complete Development Guide
Master Docker Compose for local microservices development. Learn multi-container orchestration, networking, volumes, and production-ready configurations.
Comments
Comments are powered by GitHub Discussions.
Configure Giscus at giscus.app to enable comments.