Monitoring Systems

Build comprehensive monitoring systems for CI/CD pipelines, applications, and infrastructure with modern observability tools.

Monitoring Architecture Overview

A well-designed monitoring architecture provides end-to-end visibility across applications, infrastructure, and business metrics.

Modern Monitoring Stack

Three Pillars of Observability

1. Metrics

Quantitative Measurements: Numerical data points over time
Use Cases: Performance tracking, capacity planning, alerting
Examples: CPU usage, request rate, response time, error rate

2. Logs

Event Records: Detailed event information and context
Use Cases: Debugging, audit trails, troubleshooting
Examples: Application logs, access logs, error logs

3. Traces

Request Flows: End-to-end request journey across services
Use Cases: Performance analysis, bottleneck identification
Examples: Distributed traces, span data, service dependencies

Prometheus Monitoring

Prometheus is a powerful metrics collection and monitoring system designed for reliability and scalability.

Prometheus Setup

1. Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    region: 'us-west-2'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 'alertmanager:9093'

# Load alerting rules
rule_files:
  - "alerts/*.yml"

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '${1}'

  # Application metrics
  - job_name: 'myapp'
    static_configs:
      - targets: ['myapp:8080']
    metrics_path: '/actuator/prometheus'
    scrape_interval: 5s

  # Kubernetes pods
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

2. Docker Compose Setup

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerts:/etc/prometheus/alerts
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-simple-json-datasource
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

Application Instrumentation

1. Node.js Application Metrics

// metrics.js - Prometheus metrics for Node.js
const promClient = require('prom-client');
const express = require('express');

// Create a Registry to register metrics
const register = new promClient.Registry();

// Add default metrics
promClient.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

const businessMetrics = {
  ordersCreated: new promClient.Counter({
    name: 'orders_created_total',
    help: 'Total number of orders created',
    labelNames: ['status']
  }),
  
  orderValue: new promClient.Histogram({
    name: 'order_value_dollars',
    help: 'Value of orders in dollars',
    buckets: [10, 50, 100, 500, 1000, 5000]
  }),
  
  activeUsers: new promClient.Gauge({
    name: 'active_users',
    help: 'Number of currently active users'
  })
};

// Register custom metrics
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
register.registerMetric(activeConnections);
Object.values(businessMetrics).forEach(metric => register.registerMetric(metric));

// Middleware to collect HTTP metrics
function metricsMiddleware(req, res, next) {
  const start = Date.now();
  
  activeConnections.inc();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestDuration
      .labels(req.method, route, res.statusCode)
      .observe(duration);
    
    httpRequestTotal
      .labels(req.method, route, res.statusCode)
      .inc();
    
    activeConnections.dec();
  });
  
  next();
}

// Metrics endpoint
function metricsEndpoint(req, res) {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
}

module.exports = {
  register,
  metricsMiddleware,
  metricsEndpoint,
  businessMetrics
};

2. Application Integration

// app.js
const express = require('express');
const { metricsMiddleware, metricsEndpoint, businessMetrics } = require('./metrics');

const app = express();

// Add metrics middleware
app.use(metricsMiddleware);

// Metrics endpoint for Prometheus
app.get('/metrics', metricsEndpoint);

// Business logic with metrics
app.post('/orders', async (req, res) => {
  try {
    const order = await createOrder(req.body);
    
    // Track business metrics
    businessMetrics.ordersCreated.labels('success').inc();
    businessMetrics.orderValue.observe(order.total);
    
    res.json({ success: true, order });
  } catch (error) {
    businessMetrics.ordersCreated.labels('failed').inc();
    res.status(500).json({ error: error.message });
  }
});

// Health check endpoint
app.get('/health', (req, res) => {
  res.json({
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime()
  });
});

app.listen(8080, () => {
  console.log('Server started on port 8080');
});

3. Java Spring Boot Metrics

// MetricsConfiguration.java
package com.example.myapp.config;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class MetricsConfiguration {
    
    @Bean
    public Counter orderCreatedCounter(MeterRegistry registry) {
        return Counter.builder("orders.created")
                .description("Total number of orders created")
                .tags("status", "success")
                .register(registry);
    }
    
    @Bean
    public Timer orderProcessingTimer(MeterRegistry registry) {
        return Timer.builder("orders.processing.time")
                .description("Time taken to process orders")
                .register(registry);
    }
}

// OrderService.java
package com.example.myapp.service;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;

@Service
public class OrderService {
    
    private final Counter orderCreatedCounter;
    private final Timer orderProcessingTimer;
    
    public OrderService(Counter orderCreatedCounter, Timer orderProcessingTimer) {
        this.orderCreatedCounter = orderCreatedCounter;
        this.orderProcessingTimer = orderProcessingTimer;
    }
    
    public Order createOrder(OrderRequest request) {
        return orderProcessingTimer.record(() -> {
            try {
                Order order = processOrder(request);
                orderCreatedCounter.increment();
                return order;
            } catch (Exception e) {
                // Track failed orders
                Counter.builder("orders.created")
                        .tag("status", "failed")
                        .register(meterRegistry)
                        .increment();
                throw e;
            }
        });
    }
    
    private Order processOrder(OrderRequest request) {
        // Business logic here
        return new Order();
    }
}

Alert Rules

1. Prometheus Alert Rules

# alerts/application-alerts.yml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      # High error rate alert
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status_code=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
          component: application
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
          runbook_url: "https://docs.example.com/runbooks/high-error-rate"
      
      # Slow response time alert
      - alert: SlowResponseTime
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 10m
        labels:
          severity: warning
          component: application
        annotations:
          summary: "Slow response time detected"
          description: "95th percentile response time is {{ $value }}s for {{ $labels.route }}"
      
      # High memory usage alert
      - alert: HighMemoryUsage
        expr: |
          process_resident_memory_bytes / node_memory_MemTotal_bytes > 0.85
        for: 5m
        labels:
          severity: warning
          component: infrastructure
        annotations:
          summary: "High memory usage"
          description: "Memory usage is {{ $value | humanizePercentage }} on {{ $labels.instance }}"
      
      # Pod crash loop alert
      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 15m
        labels:
          severity: critical
          component: kubernetes
        annotations:
          summary: "Pod is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is crash looping"
      
      # Database connection pool exhaustion
      - alert: DatabaseConnectionPoolExhaustion
        expr: |
          hikaricp_connections_active / hikaricp_connections_max > 0.9
        for: 5m
        labels:
          severity: critical
          component: database
        annotations:
          summary: "Database connection pool near exhaustion"
          description: "Connection pool usage is {{ $value | humanizePercentage }}"

  - name: business_alerts
    interval: 60s
    rules:
      # Sudden drop in orders
      - alert: OrdersDropped
        expr: |
          rate(orders_created_total[5m]) < 0.5 * rate(orders_created_total[1h] offset 1h)
        for: 10m
        labels:
          severity: warning
          component: business
        annotations:
          summary: "Sudden drop in order rate"
          description: "Order rate has dropped by more than 50% compared to last hour"
      
      # High order failure rate
      - alert: HighOrderFailureRate
        expr: |
          rate(orders_created_total{status="failed"}[5m]) / 
          rate(orders_created_total[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
          component: business
        annotations:
          summary: "High order failure rate"
          description: "Order failure rate is {{ $value | humanizePercentage }}"

2. Alertmanager Configuration

# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Templates for notifications
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# Route configuration
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  
  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true
      
    # All alerts go to Slack
    - match_re:
        severity: ^(critical|warning)$
      receiver: 'slack'
      
    # Business alerts
    - match:
        component: business
      receiver: 'business-team'
      
    # Infrastructure alerts
    - match:
        component: infrastructure
      receiver: 'infrastructure-team'

# Inhibit rules
inhibit_rules:
  # Inhibit warning if critical is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'cluster', 'service']

# Receivers
receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true
        
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}'
        details:
          severity: '{{ .CommonLabels.severity }}'
          component: '{{ .CommonLabels.component }}'
          
  - name: 'slack'
    slack_configs:
      - channel: '#alerts'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: |
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Component:* {{ .Labels.component }}
          {{ if .Annotations.runbook_url }}*Runbook:* {{ .Annotations.runbook_url }}{{ end }}
          {{ end }}
        send_resolved: true
        
  - name: 'business-team'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: '[email protected]'
        auth_password: 'password'
        headers:
          Subject: 'Business Alert: {{ .GroupLabels.alertname }}'
          
  - name: 'infrastructure-team'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'

Grafana Dashboards

Grafana provides powerful visualization capabilities for metrics, logs, and traces.

Dashboard Configuration

1. Grafana Data Source Provisioning

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: 15s
      
  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true
    
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
    editable: true

2. Application Dashboard JSON

{
  "dashboard": {
    "title": "Application Performance Dashboard",
    "tags": ["application", "performance"],
    "timezone": "browser",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 0
        },
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "reqps",
            "label": "Requests/sec"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 0
        },
        "targets": [
          {
            "expr": "rate(http_requests_total{status_code=~\"5..\"}[5m])",
            "legendFormat": "{{method}} {{route}}",
            "refId": "A"
          }
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.05],
                "type": "gt"
              },
              "operator": {
                "type": "and"
              },
              "query": {
                "params": ["A", "5m", "now"]
              },
              "reducer": {
                "params": [],
                "type": "avg"
              },
              "type": "query"
            }
          ]
        }
      },
      {
        "title": "Response Time (p95)",
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 8
        },
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "{{method}} {{route}}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "s",
            "label": "Duration"
          }
        ]
      },
      {
        "title": "Active Connections",
        "type": "stat",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 8
        },
        "targets": [
          {
            "expr": "active_connections",
            "refId": "A"
          }
        ],
        "options": {
          "colorMode": "value",
          "graphMode": "area",
          "justifyMode": "auto",
          "orientation": "auto",
          "reduceOptions": {
            "values": false,
            "calcs": ["lastNotNull"]
          }
        }
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 0,
          "y": 16
        },
        "targets": [
          {
            "expr": "process_resident_memory_bytes / 1024 / 1024",
            "legendFormat": "Memory (MB)",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "decmbytes",
            "label": "Memory"
          }
        ]
      },
      {
        "title": "CPU Usage",
        "type": "graph",
        "gridPos": {
          "h": 8,
          "w": 12,
          "x": 12,
          "y": 16
        },
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total[5m]) * 100",
            "legendFormat": "CPU %",
            "refId": "A"
          }
        ],
        "yaxes": [
          {
            "format": "percent",
            "label": "CPU"
          }
        ]
      }
    ],
    "refresh": "30s",
    "time": {
      "from": "now-6h",
      "to": "now"
    }
  }
}

3. Business Metrics Dashboard

{
  "dashboard": {
    "title": "Business Metrics Dashboard",
    "tags": ["business", "metrics"],
    "panels": [
      {
        "title": "Orders per Minute",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(orders_created_total[1m]) * 60",
            "legendFormat": "Orders/min"
          }
        ]
      },
      {
        "title": "Revenue per Hour",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(order_value_dollars_sum[1h]) * 3600",
            "legendFormat": "Revenue/hour"
          }
        ]
      },
      {
        "title": "Active Users",
        "type": "stat",
        "targets": [
          {
            "expr": "active_users"
          }
        ]
      },
      {
        "title": "Order Success Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(orders_created_total{status=\"success\"}[5m]) / rate(orders_created_total[5m]) * 100"
          }
        ],
        "options": {
          "showThresholdLabels": false,
          "showThresholdMarkers": true
        },
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "mode": "absolute",
              "steps": [
                { "color": "red", "value": 0 },
                { "color": "yellow", "value": 90 },
                { "color": "green", "value": 95 }
              ]
            }
          }
        }
      }
    ]
  }
}

Infrastructure Monitoring

Monitor infrastructure health, resource utilization, and system performance.

Kubernetes Monitoring

1. kube-state-metrics Deployment

# kube-state-metrics.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kube-state-metrics
  template:
    metadata:
      labels:
        app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.5.0
        ports:
        - containerPort: 8080
          name: http-metrics
        - containerPort: 8081
          name: telemetry
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            path: /
            port: 8081
          initialDelaySeconds: 5
          timeoutSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    app: kube-state-metrics
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
  - name: telemetry
    port: 8081
    targetPort: telemetry
  selector:
    app: kube-state-metrics

2. Kubernetes Metrics Queries

# kubernetes-alerts.yml
groups:
  - name: kubernetes_alerts
    rules:
      # Pod not ready
      - alert: PodNotReady
        expr: |
          kube_pod_status_phase{phase!="Running"} == 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod not ready"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is not ready"
      
      # Node not ready
      - alert: NodeNotReady
        expr: |
          kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node not ready"
          description: "Node {{ $labels.node }} is not ready"
      
      # High node CPU usage
      - alert: HighNodeCPUUsage
        expr: |
          (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High node CPU usage"
          description: "Node {{ $labels.instance }} CPU usage is {{ $value }}%"
      
      # High node memory usage
      - alert: HighNodeMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High node memory usage"
          description: "Node {{ $labels.instance }} memory usage is {{ $value }}%"
      
      # Persistent Volume usage
      - alert: PersistentVolumeFillingUp
        expr: |
          kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Persistent volume filling up"
          description: "PV {{ $labels.persistentvolumeclaim }} in {{ $labels.namespace }} is {{ $value }}% full"

Key Takeaways

Monitoring Best Practices

Golden Signals: Monitor latency, traffic, errors, and saturation
Actionable Alerts: Create alerts that require action, not just information
Context Rich: Provide sufficient context in alerts for quick resolution
SLO-Based: Align monitoring with Service Level Objectives
Continuous Improvement: Regularly review and refine monitoring

Implementation Strategy

Start Simple: Begin with basic metrics and expand gradually
Standardize: Use consistent naming and labeling conventions
Document: Maintain runbooks for common alerts
Test Alerts: Regularly test alerting pipeline
Optimize: Balance between alert noise and coverage

Common Patterns

RED Metrics: Rate, Errors, Duration for services
USE Metrics: Utilization, Saturation, Errors for resources
Four Golden Signals: Latency, traffic, errors, saturation
Business Metrics: Track key business indicators
SLI/SLO/SLA: Define and monitor service levels

Next Steps: Ready to implement logging systems? Continue to Section 6.2: Logging Systems to learn about centralized logging with ELK stack.

Comprehensive monitoring is essential for maintaining reliable systems. In the next section, we'll explore logging systems for debugging and troubleshooting.

Monitoring Architecture Overview​

Modern Monitoring Stack​

Three Pillars of Observability​

1. Metrics​

2. Logs​

3. Traces​

Prometheus Monitoring​

Prometheus Setup​

1. Prometheus Configuration​

2. Docker Compose Setup​

Application Instrumentation​

1. Node.js Application Metrics​

2. Application Integration​

3. Java Spring Boot Metrics​

Alert Rules​

1. Prometheus Alert Rules​

2. Alertmanager Configuration​

Grafana Dashboards​

Dashboard Configuration​

1. Grafana Data Source Provisioning​

2. Application Dashboard JSON​

3. Business Metrics Dashboard​

Infrastructure Monitoring​

Kubernetes Monitoring​

1. kube-state-metrics Deployment​

2. Kubernetes Metrics Queries​

Key Takeaways​

Monitoring Best Practices​

Implementation Strategy​

Common Patterns​