Back to Blog
MonitoringAlertingObservabilityDevOps

Monitoring and Alerting Strategies for Production Systems

Build effective monitoring that catches issues before users do. From metrics to logs to alerts that matter.

B
Bootspring Team
Engineering
September 28, 2024
5 min read

Good monitoring tells you when something's wrong before users complain. Great monitoring tells you why and helps you fix it fast.

The Three Pillars of Observability#

Metrics: Numeric measurements over time CPU usage, request count, error rate Logs: Discrete events with context Error messages, request details, audit trails Traces: Request flow across services Latency breakdown, service dependencies

Metrics#

Key Metrics (RED Method)#

Rate: Requests per second Errors: Error rate percentage Duration: Request latency (p50, p95, p99) # Prometheus queries # Rate rate(http_requests_total[5m]) # Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Latency p99 histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Infrastructure Metrics (USE Method)#

Utilization: % time resource is busy Saturation: Queue depth, waiting work Errors: Error count # CPU 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes # Disk I/O saturation rate(node_disk_io_time_weighted_seconds_total[5m])

Custom Business Metrics#

1import { Counter, Histogram, Gauge } from 'prom-client'; 2 3// Business metrics 4const ordersPlaced = new Counter({ 5 name: 'orders_placed_total', 6 help: 'Total orders placed', 7 labelNames: ['payment_method', 'country'], 8}); 9 10const orderValue = new Histogram({ 11 name: 'order_value_dollars', 12 help: 'Order value in dollars', 13 buckets: [10, 50, 100, 500, 1000, 5000], 14}); 15 16const activeUsers = new Gauge({ 17 name: 'active_users', 18 help: 'Currently active users', 19}); 20 21// Usage 22ordersPlaced.labels('credit_card', 'US').inc(); 23orderValue.observe(order.total); 24activeUsers.set(await countActiveUsers());

Logging#

Structured Logging#

1import pino from 'pino'; 2 3const logger = pino({ 4 level: process.env.LOG_LEVEL || 'info', 5 formatters: { 6 level: (label) => ({ level: label }), 7 }, 8}); 9 10// Structured log 11logger.info({ 12 event: 'order_placed', 13 orderId: order.id, 14 userId: user.id, 15 total: order.total, 16 items: order.items.length, 17 duration: performance.now() - startTime, 18}); 19 20// Request logging middleware 21app.use((req, res, next) => { 22 const requestId = req.headers['x-request-id'] || uuid(); 23 req.log = logger.child({ requestId }); 24 25 const start = Date.now(); 26 res.on('finish', () => { 27 req.log.info({ 28 method: req.method, 29 path: req.path, 30 status: res.statusCode, 31 duration: Date.now() - start, 32 }); 33 }); 34 35 next(); 36});

Log Levels#

Level Usage ────────────────────────────────────── error Something failed, needs attention warn Degraded but still working info Normal operations, business events debug Detailed diagnostic information trace Very verbose, development only

Correlation IDs#

1// Pass correlation ID through all services 2async function handleRequest(req: Request) { 3 const correlationId = req.headers['x-correlation-id'] || uuid(); 4 5 // Include in all logs 6 const logger = baseLogger.child({ correlationId }); 7 8 // Pass to downstream services 9 const response = await fetch('http://other-service/api', { 10 headers: { 11 'x-correlation-id': correlationId, 12 }, 13 }); 14 15 logger.info({ event: 'downstream_call', status: response.status }); 16}

Alerting#

Alert Design Principles#

Good alerts: ✓ Actionable (someone needs to do something) ✓ Timely (catch issues before users notice) ✓ Accurate (low false positive rate) ✓ Clear (obvious what's wrong and how to fix) Bad alerts: ✗ Alerting on every error (alert fatigue) ✗ No runbook or context ✗ Alerting on symptoms, not causes ✗ Too many recipients

Alert Examples#

1# Prometheus alerting rules 2groups: 3 - name: application 4 rules: 5 # High error rate 6 - alert: HighErrorRate 7 expr: | 8 sum(rate(http_requests_total{status=~"5.."}[5m])) / 9 sum(rate(http_requests_total[5m])) > 0.05 10 for: 5m 11 labels: 12 severity: critical 13 annotations: 14 summary: "High error rate ({{ $value | humanizePercentage }})" 15 description: "Error rate above 5% for 5 minutes" 16 runbook: "https://wiki.example.com/runbooks/high-error-rate" 17 18 # Slow response times 19 - alert: HighLatency 20 expr: | 21 histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) 22 > 0.5 23 for: 10m 24 labels: 25 severity: warning 26 annotations: 27 summary: "High latency (p95: {{ $value | humanizeDuration }})" 28 29 # Service down 30 - alert: ServiceDown 31 expr: up == 0 32 for: 1m 33 labels: 34 severity: critical 35 annotations: 36 summary: "Service {{ $labels.instance }} is down"

Alerting Tiers#

Page (wake someone up): - Service completely down - Data loss occurring - Security incident - Error rate > 50% Urgent (respond same day): - Error rate > 5% - High latency - Degraded functionality - Disk space > 80% Warning (review next business day): - Error rate > 1% - Certificate expiring in 14 days - Dependency deprecation

Dashboards#

Dashboard Design#

Executive dashboard: - Business metrics (revenue, users, conversions) - High-level health (green/yellow/red) - Trends (week over week) Operations dashboard: - Request rate and latency - Error rates by service - Resource utilization - Active alerts Service dashboard: - Detailed metrics for one service - Dependencies health - Recent deployments - Log snippets

Grafana Dashboard Example#

1{ 2 "panels": [ 3 { 4 "title": "Request Rate", 5 "type": "graph", 6 "targets": [ 7 { 8 "expr": "sum(rate(http_requests_total[5m])) by (service)", 9 "legendFormat": "{{ service }}" 10 } 11 ] 12 }, 13 { 14 "title": "Error Rate", 15 "type": "stat", 16 "targets": [ 17 { 18 "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100" 19 } 20 ], 21 "thresholds": { 22 "steps": [ 23 { "value": 0, "color": "green" }, 24 { "value": 1, "color": "yellow" }, 25 { "value": 5, "color": "red" } 26 ] 27 } 28 } 29 ] 30}

Incident Response#

Runbooks#

1# High Error Rate Runbook 2 3## Symptoms 4- Error rate above 5% 5- Users reporting failures 6 7## Immediate Actions 81. Check recent deployments: `kubectl rollout history deployment/api` 92. Check dependency health: Grafana → Dependencies Dashboard 103. Check error logs: `kubectl logs -l app=api --tail=100 | grep ERROR` 11 12## Common Causes 13- Bad deployment → Rollback: `kubectl rollout undo deployment/api` 14- Database issues → Check DB dashboard 15- External API failure → Check status page 16 17## Escalation 18- After 15 minutes: Page backend lead 19- After 30 minutes: Page engineering manager

Conclusion#

Effective monitoring starts with the right metrics and logs, continues with meaningful alerts, and culminates in fast incident response.

Start with RED metrics for services, USE metrics for infrastructure, and business metrics for outcomes. Alert on symptoms but provide context to find causes. Build runbooks as you resolve incidents.

Remember: the goal is to detect and fix issues before users notice them.

Share this article

Help spread the word about Bootspring