Good monitoring tells you when something's wrong before users complain. Great monitoring tells you why and helps you fix it fast.
The Three Pillars of Observability#
Metrics: Numeric measurements over time
CPU usage, request count, error rate
Logs: Discrete events with context
Error messages, request details, audit trails
Traces: Request flow across services
Latency breakdown, service dependencies
Metrics#
Key Metrics (RED Method)#
Rate: Requests per second
Errors: Error rate percentage
Duration: Request latency (p50, p95, p99)
# Prometheus queries
# Rate
rate(http_requests_total[5m])
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Latency p99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Infrastructure Metrics (USE Method)#
Utilization: % time resource is busy
Saturation: Queue depth, waiting work
Errors: Error count
# CPU
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Disk I/O saturation
rate(node_disk_io_time_weighted_seconds_total[5m])
Custom Business Metrics#
1import { Counter, Histogram, Gauge } from 'prom-client';
2
3// Business metrics
4const ordersPlaced = new Counter({
5 name: 'orders_placed_total',
6 help: 'Total orders placed',
7 labelNames: ['payment_method', 'country'],
8});
9
10const orderValue = new Histogram({
11 name: 'order_value_dollars',
12 help: 'Order value in dollars',
13 buckets: [10, 50, 100, 500, 1000, 5000],
14});
15
16const activeUsers = new Gauge({
17 name: 'active_users',
18 help: 'Currently active users',
19});
20
21// Usage
22ordersPlaced.labels('credit_card', 'US').inc();
23orderValue.observe(order.total);
24activeUsers.set(await countActiveUsers());Logging#
Structured Logging#
1import pino from 'pino';
2
3const logger = pino({
4 level: process.env.LOG_LEVEL || 'info',
5 formatters: {
6 level: (label) => ({ level: label }),
7 },
8});
9
10// Structured log
11logger.info({
12 event: 'order_placed',
13 orderId: order.id,
14 userId: user.id,
15 total: order.total,
16 items: order.items.length,
17 duration: performance.now() - startTime,
18});
19
20// Request logging middleware
21app.use((req, res, next) => {
22 const requestId = req.headers['x-request-id'] || uuid();
23 req.log = logger.child({ requestId });
24
25 const start = Date.now();
26 res.on('finish', () => {
27 req.log.info({
28 method: req.method,
29 path: req.path,
30 status: res.statusCode,
31 duration: Date.now() - start,
32 });
33 });
34
35 next();
36});Log Levels#
Level Usage
──────────────────────────────────────
error Something failed, needs attention
warn Degraded but still working
info Normal operations, business events
debug Detailed diagnostic information
trace Very verbose, development only
Correlation IDs#
1// Pass correlation ID through all services
2async function handleRequest(req: Request) {
3 const correlationId = req.headers['x-correlation-id'] || uuid();
4
5 // Include in all logs
6 const logger = baseLogger.child({ correlationId });
7
8 // Pass to downstream services
9 const response = await fetch('http://other-service/api', {
10 headers: {
11 'x-correlation-id': correlationId,
12 },
13 });
14
15 logger.info({ event: 'downstream_call', status: response.status });
16}Alerting#
Alert Design Principles#
Good alerts:
✓ Actionable (someone needs to do something)
✓ Timely (catch issues before users notice)
✓ Accurate (low false positive rate)
✓ Clear (obvious what's wrong and how to fix)
Bad alerts:
✗ Alerting on every error (alert fatigue)
✗ No runbook or context
✗ Alerting on symptoms, not causes
✗ Too many recipients
Alert Examples#
1# Prometheus alerting rules
2groups:
3 - name: application
4 rules:
5 # High error rate
6 - alert: HighErrorRate
7 expr: |
8 sum(rate(http_requests_total{status=~"5.."}[5m])) /
9 sum(rate(http_requests_total[5m])) > 0.05
10 for: 5m
11 labels:
12 severity: critical
13 annotations:
14 summary: "High error rate ({{ $value | humanizePercentage }})"
15 description: "Error rate above 5% for 5 minutes"
16 runbook: "https://wiki.example.com/runbooks/high-error-rate"
17
18 # Slow response times
19 - alert: HighLatency
20 expr: |
21 histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
22 > 0.5
23 for: 10m
24 labels:
25 severity: warning
26 annotations:
27 summary: "High latency (p95: {{ $value | humanizeDuration }})"
28
29 # Service down
30 - alert: ServiceDown
31 expr: up == 0
32 for: 1m
33 labels:
34 severity: critical
35 annotations:
36 summary: "Service {{ $labels.instance }} is down"Alerting Tiers#
Page (wake someone up):
- Service completely down
- Data loss occurring
- Security incident
- Error rate > 50%
Urgent (respond same day):
- Error rate > 5%
- High latency
- Degraded functionality
- Disk space > 80%
Warning (review next business day):
- Error rate > 1%
- Certificate expiring in 14 days
- Dependency deprecation
Dashboards#
Dashboard Design#
Executive dashboard:
- Business metrics (revenue, users, conversions)
- High-level health (green/yellow/red)
- Trends (week over week)
Operations dashboard:
- Request rate and latency
- Error rates by service
- Resource utilization
- Active alerts
Service dashboard:
- Detailed metrics for one service
- Dependencies health
- Recent deployments
- Log snippets
Grafana Dashboard Example#
1{
2 "panels": [
3 {
4 "title": "Request Rate",
5 "type": "graph",
6 "targets": [
7 {
8 "expr": "sum(rate(http_requests_total[5m])) by (service)",
9 "legendFormat": "{{ service }}"
10 }
11 ]
12 },
13 {
14 "title": "Error Rate",
15 "type": "stat",
16 "targets": [
17 {
18 "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
19 }
20 ],
21 "thresholds": {
22 "steps": [
23 { "value": 0, "color": "green" },
24 { "value": 1, "color": "yellow" },
25 { "value": 5, "color": "red" }
26 ]
27 }
28 }
29 ]
30}Incident Response#
Runbooks#
1# High Error Rate Runbook
2
3## Symptoms
4- Error rate above 5%
5- Users reporting failures
6
7## Immediate Actions
81. Check recent deployments: `kubectl rollout history deployment/api`
92. Check dependency health: Grafana → Dependencies Dashboard
103. Check error logs: `kubectl logs -l app=api --tail=100 | grep ERROR`
11
12## Common Causes
13- Bad deployment → Rollback: `kubectl rollout undo deployment/api`
14- Database issues → Check DB dashboard
15- External API failure → Check status page
16
17## Escalation
18- After 15 minutes: Page backend lead
19- After 30 minutes: Page engineering managerConclusion#
Effective monitoring starts with the right metrics and logs, continues with meaningful alerts, and culminates in fast incident response.
Start with RED metrics for services, USE metrics for infrastructure, and business metrics for outcomes. Alert on symptoms but provide context to find causes. Build runbooks as you resolve incidents.
Remember: the goal is to detect and fix issues before users notice them.