Monitoring and Alerting Strategies for Production Systems

Good monitoring tells you when something's wrong before users complain. Great monitoring tells you why and helps you fix it fast.

The Three Pillars of Observability#

Metrics:  Numeric measurements over time
          CPU usage, request count, error rate

Logs:     Discrete events with context
          Error messages, request details, audit trails

Traces:   Request flow across services
          Latency breakdown, service dependencies

Metrics#

Key Metrics (RED Method)#

Rate:     Requests per second
Errors:   Error rate percentage
Duration: Request latency (p50, p95, p99)

# Prometheus queries
# Rate
rate(http_requests_total[5m])

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100

# Latency p99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Infrastructure Metrics (USE Method)#

Utilization:  % time resource is busy
Saturation:   Queue depth, waiting work
Errors:       Error count

# CPU
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Disk I/O saturation
rate(node_disk_io_time_weighted_seconds_total[5m])

Custom Business Metrics#

import { Counter, Histogram, Gauge } from 'prom-client';

// Business metrics
const ordersPlaced = new Counter({
  name: 'orders_placed_total',
  help: 'Total orders placed',
  labelNames: ['payment_method', 'country'],
});

const orderValue = new Histogram({
  name: 'order_value_dollars',
  help: 'Order value in dollars',
  buckets: [10, 50, 100, 500, 1000, 5000],
});

const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Currently active users',
});

// Usage
ordersPlaced.labels('credit_card', 'US').inc();
orderValue.observe(order.total);
activeUsers.set(await countActiveUsers());

Logging#

Structured Logging#

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
});

// Structured log
logger.info({
  event: 'order_placed',
  orderId: order.id,
  userId: user.id,
  total: order.total,
  items: order.items.length,
  duration: performance.now() - startTime,
});

// Request logging middleware
app.use((req, res, next) => {
  const requestId = req.headers['x-request-id'] || uuid();
  req.log = logger.child({ requestId });

  const start = Date.now();
  res.on('finish', () => {
    req.log.info({
      method: req.method,
      path: req.path,
      status: res.statusCode,
      duration: Date.now() - start,
    });
  });

  next();
});

Log Levels#

Level     Usage
──────────────────────────────────────
error     Something failed, needs attention
warn      Degraded but still working
info      Normal operations, business events
debug     Detailed diagnostic information
trace     Very verbose, development only

Correlation IDs#

// Pass correlation ID through all services
async function handleRequest(req: Request) {
  const correlationId = req.headers['x-correlation-id'] || uuid();

  // Include in all logs
  const logger = baseLogger.child({ correlationId });

  // Pass to downstream services
  const response = await fetch('http://other-service/api', {
    headers: {
      'x-correlation-id': correlationId,
    },
  });

  logger.info({ event: 'downstream_call', status: response.status });
}

Alerting#

Alert Design Principles#

Good alerts:
✓ Actionable (someone needs to do something)
✓ Timely (catch issues before users notice)
✓ Accurate (low false positive rate)
✓ Clear (obvious what's wrong and how to fix)

Bad alerts:
✗ Alerting on every error (alert fatigue)
✗ No runbook or context
✗ Alerting on symptoms, not causes
✗ Too many recipients

Alert Examples#

# Prometheus alerting rules
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }})"
          description: "Error rate above 5% for 5 minutes"
          runbook: "https://wiki.example.com/runbooks/high-error-rate"

      # Slow response times
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency (p95: {{ $value | humanizeDuration }})"

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"

Alerting Tiers#

Page (wake someone up):
- Service completely down
- Data loss occurring
- Security incident
- Error rate > 50%

Urgent (respond same day):
- Error rate > 5%
- High latency
- Degraded functionality
- Disk space > 80%

Warning (review next business day):
- Error rate > 1%
- Certificate expiring in 14 days
- Dependency deprecation

Dashboards#

Dashboard Design#

Executive dashboard:
- Business metrics (revenue, users, conversions)
- High-level health (green/yellow/red)
- Trends (week over week)

Operations dashboard:
- Request rate and latency
- Error rates by service
- Resource utilization
- Active alerts

Service dashboard:
- Detailed metrics for one service
- Dependencies health
- Recent deployments
- Log snippets

Grafana Dashboard Example#

{
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (service)",
          "legendFormat": "{{ service }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
        }
      ],
      "thresholds": {
        "steps": [
          { "value": 0, "color": "green" },
          { "value": 1, "color": "yellow" },
          { "value": 5, "color": "red" }
        ]
      }
    }
  ]
}

Incident Response#

Runbooks#

# High Error Rate Runbook

## Symptoms
- Error rate above 5%
- Users reporting failures

## Immediate Actions
1. Check recent deployments: `kubectl rollout history deployment/api`
2. Check dependency health: Grafana → Dependencies Dashboard
3. Check error logs: `kubectl logs -l app=api --tail=100 | grep ERROR`

## Common Causes
- Bad deployment → Rollback: `kubectl rollout undo deployment/api`
- Database issues → Check DB dashboard
- External API failure → Check status page

## Escalation
- After 15 minutes: Page backend lead
- After 30 minutes: Page engineering manager

Conclusion#

Effective monitoring starts with the right metrics and logs, continues with meaningful alerts, and culminates in fast incident response.

Start with RED metrics for services, USE metrics for infrastructure, and business metrics for outcomes. Alert on symptoms but provide context to find causes. Build runbooks as you resolve incidents.

Remember: the goal is to detect and fix issues before users notice them.

Share this article