Good monitoring tells you when something's wrong before users complain. Great monitoring tells you why and helps you fix it fast.
The Three Pillars of Observability#
Metrics: Numeric measurements over time
CPU usage, request count, error rate
Logs: Discrete events with context
Error messages, request details, audit trails
Traces: Request flow across services
Latency breakdown, service dependencies
Metrics#
Key Metrics (RED Method)#
Rate: Requests per second
Errors: Error rate percentage
Duration: Request latency (p50, p95, p99)
# Prometheus queries
# Rate
rate(http_requests_total[5m])
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) * 100
# Latency p99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Infrastructure Metrics (USE Method)#
Utilization: % time resource is busy
Saturation: Queue depth, waiting work
Errors: Error count
# CPU
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Disk I/O saturation
rate(node_disk_io_time_weighted_seconds_total[5m])
Custom Business Metrics#
Logging#
Structured Logging#
Log Levels#
Level Usage
──────────────────────────────────────
error Something failed, needs attention
warn Degraded but still working
info Normal operations, business events
debug Detailed diagnostic information
trace Very verbose, development only
Correlation IDs#
Alerting#
Alert Design Principles#
Good alerts:
✓ Actionable (someone needs to do something)
✓ Timely (catch issues before users notice)
✓ Accurate (low false positive rate)
✓ Clear (obvious what's wrong and how to fix)
Bad alerts:
✗ Alerting on every error (alert fatigue)
✗ No runbook or context
✗ Alerting on symptoms, not causes
✗ Too many recipients
Alert Examples#
Alerting Tiers#
Page (wake someone up):
- Service completely down
- Data loss occurring
- Security incident
- Error rate > 50%
Urgent (respond same day):
- Error rate > 5%
- High latency
- Degraded functionality
- Disk space > 80%
Warning (review next business day):
- Error rate > 1%
- Certificate expiring in 14 days
- Dependency deprecation
Dashboards#
Dashboard Design#
Executive dashboard:
- Business metrics (revenue, users, conversions)
- High-level health (green/yellow/red)
- Trends (week over week)
Operations dashboard:
- Request rate and latency
- Error rates by service
- Resource utilization
- Active alerts
Service dashboard:
- Detailed metrics for one service
- Dependencies health
- Recent deployments
- Log snippets
Grafana Dashboard Example#
Incident Response#
Runbooks#
Conclusion#
Effective monitoring starts with the right metrics and logs, continues with meaningful alerts, and culminates in fast incident response.
Start with RED metrics for services, USE metrics for infrastructure, and business metrics for outcomes. Alert on symptoms but provide context to find causes. Build runbooks as you resolve incidents.
Remember: the goal is to detect and fix issues before users notice them.