Monitoring Systems
Build comprehensive monitoring systems for CI/CD pipelines, applications, and infrastructure with modern observability tools.
Monitoring Architecture Overview
A well-designed monitoring architecture provides end-to-end visibility across applications, infrastructure, and business metrics.
Modern Monitoring Stack
Three Pillars of Observability
1. Metrics
- Quantitative Measurements: Numerical data points over time
- Use Cases: Performance tracking, capacity planning, alerting
- Examples: CPU usage, request rate, response time, error rate
2. Logs
- Event Records: Detailed event information and context
- Use Cases: Debugging, audit trails, troubleshooting
- Examples: Application logs, access logs, error logs
3. Traces
- Request Flows: End-to-end request journey across services
- Use Cases: Performance analysis, bottleneck identification
- Examples: Distributed traces, span data, service dependencies
Prometheus Monitoring
Prometheus is a powerful metrics collection and monitoring system designed for reliability and scalability.
Prometheus Setup
1. Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
# Load alerting rules
rule_files:
- "alerts/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter (system metrics)
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '${1}'
# Application metrics
- job_name: 'myapp'
static_configs:
- targets: ['myapp:8080']
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
# Kubernetes pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
2. Docker Compose Setup
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
Application Instrumentation
1. Node.js Application Metrics
// metrics.js - Prometheus metrics for Node.js
const promClient = require('prom-client');
const express = require('express');
// Create a Registry to register metrics
const register = new promClient.Registry();
// Add default metrics
promClient.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
const businessMetrics = {
ordersCreated: new promClient.Counter({
name: 'orders_created_total',
help: 'Total number of orders created',
labelNames: ['status']
}),
orderValue: new promClient.Histogram({
name: 'order_value_dollars',
help: 'Value of orders in dollars',
buckets: [10, 50, 100, 500, 1000, 5000]
}),
activeUsers: new promClient.Gauge({
name: 'active_users',
help: 'Number of currently active users'
})
};
// Register custom metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
Object.values(businessMetrics).forEach(metric => register.registerMetric(metric));
// Middleware to collect HTTP metrics
function metricsMiddleware(req, res, next) {
const start = Date.now();
activeConnections.inc();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const route = req.route ? req.route.path : req.path;
httpRequestDuration
.labels(req.method, route, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, route, res.statusCode)
.inc();
activeConnections.dec();
});
next();
}
// Metrics endpoint
function metricsEndpoint(req, res) {
res.set('Content-Type', register.contentType);
res.end(register.metrics());
}
module.exports = {
register,
metricsMiddleware,
metricsEndpoint,
businessMetrics
};
2. Application Integration
// app.js
const express = require('express');
const { metricsMiddleware, metricsEndpoint, businessMetrics } = require('./metrics');
const app = express();
// Add metrics middleware
app.use(metricsMiddleware);
// Metrics endpoint for Prometheus
app.get('/metrics', metricsEndpoint);
// Business logic with metrics
app.post('/orders', async (req, res) => {
try {
const order = await createOrder(req.body);
// Track business metrics
businessMetrics.ordersCreated.labels('success').inc();
businessMetrics.orderValue.observe(order.total);
res.json({ success: true, order });
} catch (error) {
businessMetrics.ordersCreated.labels('failed').inc();
res.status(500).json({ error: error.message });
}
});
// Health check endpoint
app.get('/health', (req, res) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime()
});
});
app.listen(8080, () => {
console.log('Server started on port 8080');
});
3. Java Spring Boot Metrics
// MetricsConfiguration.java
package com.example.myapp.config;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
@Configuration
public class MetricsConfiguration {
@Bean
public Counter orderCreatedCounter(MeterRegistry registry) {
return Counter.builder("orders.created")
.description("Total number of orders created")
.tags("status", "success")
.register(registry);
}
@Bean
public Timer orderProcessingTimer(MeterRegistry registry) {
return Timer.builder("orders.processing.time")
.description("Time taken to process orders")
.register(registry);
}
}
// OrderService.java
package com.example.myapp.service;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
@Service
public class OrderService {
private final Counter orderCreatedCounter;
private final Timer orderProcessingTimer;
public OrderService(Counter orderCreatedCounter, Timer orderProcessingTimer) {
this.orderCreatedCounter = orderCreatedCounter;
this.orderProcessingTimer = orderProcessingTimer;
}
public Order createOrder(OrderRequest request) {
return orderProcessingTimer.record(() -> {
try {
Order order = processOrder(request);
orderCreatedCounter.increment();
return order;
} catch (Exception e) {
// Track failed orders
Counter.builder("orders.created")
.tag("status", "failed")
.register(meterRegistry)
.increment();
throw e;
}
});
}
private Order processOrder(OrderRequest request) {
// Business logic here
return new Order();
}
}
Alert Rules
1. Prometheus Alert Rules
# alerts/application-alerts.yml
groups:
- name: application_alerts
interval: 30s
rules:
# High error rate alert
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
component: application
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
runbook_url: "https://docs.example.com/runbooks/high-error-rate"
# Slow response time alert
- alert: SlowResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
component: application
annotations:
summary: "Slow response time detected"
description: "95th percentile response time is {{ $value }}s for {{ $labels.route }}"
# High memory usage alert
- alert: HighMemoryUsage
expr: |
process_resident_memory_bytes / node_memory_MemTotal_bytes > 0.85
for: 5m
labels:
severity: warning
component: infrastructure
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
# Pod crash loop alert
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 15m
labels:
severity: critical
component: kubernetes
annotations:
summary: "Pod is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"
# Database connection pool exhaustion
- alert: DatabaseConnectionPoolExhaustion
expr: |
hikaricp_connections_active / hikaricp_connections_max > 0.9
for: 5m
labels:
severity: critical
component: database
annotations:
summary: "Database connection pool near exhaustion"
description: "Connection pool usage is {{ $value | humanizePercentage }}"
- name: business_alerts
interval: 60s
rules:
# Sudden drop in orders
- alert: OrdersDropped
expr: |
rate(orders_created_total[5m]) < 0.5 * rate(orders_created_total[1h] offset 1h)
for: 10m
labels:
severity: warning
component: business
annotations:
summary: "Sudden drop in order rate"
description: "Order rate has dropped by more than 50% compared to last hour"
# High order failure rate
- alert: HighOrderFailureRate
expr: |
rate(orders_created_total{status="failed"}[5m]) /
rate(orders_created_total[5m]) > 0.1
for: 5m
labels:
severity: critical
component: business
annotations:
summary: "High order failure rate"
description: "Order failure rate is {{ $value | humanizePercentage }}"
2. Alertmanager Configuration
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Templates for notifications
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route configuration
route:
receiver: 'default'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty'
continue: true
# All alerts go to Slack
- match_re:
severity: ^(critical|warning)$
receiver: 'slack'
# Business alerts
- match:
component: business
receiver: 'business-team'
# Infrastructure alerts
- match:
component: infrastructure
receiver: 'infrastructure-team'
# Inhibit rules
inhibit_rules:
# Inhibit warning if critical is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
# Receivers
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}'
details:
severity: '{{ .CommonLabels.severity }}'
component: '{{ .CommonLabels.component }}'
- name: 'slack'
slack_configs:
- channel: '#alerts'
color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: |
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Component:* {{ .Labels.component }}
{{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
{{ end }}
send_resolved: true
- name: 'business-team'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
auth_username: '[email protected]'
auth_password: 'password'
headers:
Subject: 'Business Alert: {{ .GroupLabels.alertname }}'
- name: 'infrastructure-team'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.example.com:587'
Grafana Dashboards
Grafana provides powerful visualization capabilities for metrics, logs, and traces.
Dashboard Configuration
1. Grafana Data Source Provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: 15s
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686
editable: true
2. Application Dashboard JSON
{
"dashboard": {
"title": "Application Performance Dashboard",
"tags": ["application", "performance"],
"timezone": "browser",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "reqps",
"label": "Requests/sec"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
},
"targets": [
{
"expr": "rate(http_requests_total{status_code=~\"5..\"}[5m])",
"legendFormat": "{{method}} {{route}}",
"refId": "A"
}
],
"alert": {
"conditions": [
{
"evaluator": {
"params": [0.05],
"type": "gt"
},
"operator": {
"type": "and"
},
"query": {
"params": ["A", "5m", "now"]
},
"reducer": {
"params": [],
"type": "avg"
},
"type": "query"
}
]
}
},
{
"title": "Response Time (p95)",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 8
},
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "{{method}} {{route}}",
"refId": "A"
}
],
"yaxes": [
{
"format": "s",
"label": "Duration"
}
]
},
{
"title": "Active Connections",
"type": "stat",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 8
},
"targets": [
{
"expr": "active_connections",
"refId": "A"
}
],
"options": {
"colorMode": "value",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "auto",
"reduceOptions": {
"values": false,
"calcs": ["lastNotNull"]
}
}
},
{
"title": "Memory Usage",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 16
},
"targets": [
{
"expr": "process_resident_memory_bytes / 1024 / 1024",
"legendFormat": "Memory (MB)",
"refId": "A"
}
],
"yaxes": [
{
"format": "decmbytes",
"label": "Memory"
}
]
},
{
"title": "CPU Usage",
"type": "graph",
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 16
},
"targets": [
{
"expr": "rate(process_cpu_seconds_total[5m]) * 100",
"legendFormat": "CPU %",
"refId": "A"
}
],
"yaxes": [
{
"format": "percent",
"label": "CPU"
}
]
}
],
"refresh": "30s",
"time": {
"from": "now-6h",
"to": "now"
}
}
}
3. Business Metrics Dashboard
{
"dashboard": {
"title": "Business Metrics Dashboard",
"tags": ["business", "metrics"],
"panels": [
{
"title": "Orders per Minute",
"type": "graph",
"targets": [
{
"expr": "rate(orders_created_total[1m]) * 60",
"legendFormat": "Orders/min"
}
]
},
{
"title": "Revenue per Hour",
"type": "graph",
"targets": [
{
"expr": "rate(order_value_dollars_sum[1h]) * 3600",
"legendFormat": "Revenue/hour"
}
]
},
{
"title": "Active Users",
"type": "stat",
"targets": [
{
"expr": "active_users"
}
]
},
{
"title": "Order Success Rate",
"type": "gauge",
"targets": [
{
"expr": "rate(orders_created_total{status=\"success\"}[5m]) / rate(orders_created_total[5m]) * 100"
}
],
"options": {
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "red", "value": 0 },
{ "color": "yellow", "value": 90 },
{ "color": "green", "value": 95 }
]
}
}
}
}
]
}
}
Infrastructure Monitoring
Monitor infrastructure health, resource utilization, and system performance.
Kubernetes Monitoring
1. kube-state-metrics Deployment
# kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.5.0
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: monitoring
labels:
app: kube-state-metrics
spec:
ports:
- name: http-metrics
port: 8080
targetPort: http-metrics
- name: telemetry
port: 8081
targetPort: telemetry
selector:
app: kube-state-metrics
2. Kubernetes Metrics Queries
# kubernetes-alerts.yml
groups:
- name: kubernetes_alerts
rules:
# Pod not ready
- alert: PodNotReady
expr: |
kube_pod_status_phase{phase!="Running"} == 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod not ready"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready"
# Node not ready
- alert: NodeNotReady
expr: |
kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node not ready"
description: "Node {{ $labels.node }} is not ready"
# High node CPU usage
- alert: HighNodeCPUUsage
expr: |
(1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High node CPU usage"
description: "Node {{ $labels.instance }} CPU usage is {{ $value }}%"
# High node memory usage
- alert: HighNodeMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High node memory usage"
description: "Node {{ $labels.instance }} memory usage is {{ $value }}%"
# Persistent Volume usage
- alert: PersistentVolumeFillingUp
expr: |
kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Persistent volume filling up"
description: "PV {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} is {{ $value }}% full"
Key Takeaways
Monitoring Best Practices
- Golden Signals: Monitor latency, traffic, errors, and saturation
- Actionable Alerts: Create alerts that require action, not just information
- Context Rich: Provide sufficient context in alerts for quick resolution
- SLO-Based: Align monitoring with Service Level Objectives
- Continuous Improvement: Regularly review and refine monitoring
Implementation Strategy
- Start Simple: Begin with basic metrics and expand gradually
- Standardize: Use consistent naming and labeling conventions
- Document: Maintain runbooks for common alerts
- Test Alerts: Regularly test alerting pipeline
- Optimize: Balance between alert noise and coverage
Common Patterns
- RED Metrics: Rate, Errors, Duration for services
- USE Metrics: Utilization, Saturation, Errors for resources
- Four Golden Signals: Latency, traffic, errors, saturation
- Business Metrics: Track key business indicators
- SLI/SLO/SLA: Define and monitor service levels
Next Steps: Ready to implement logging systems? Continue to Section 6.2: Logging Systems to learn about centralized logging with ELK stack.
Comprehensive monitoring is essential for maintaining reliable systems. In the next section, we'll explore logging systems for debugging and troubleshooting.