Observability lets you understand system behavior from external outputs. Metrics, logs, and traces form the three pillars of observability.
The Three Pillars#
Metrics:
- Numerical measurements over time
- CPU, memory, request counts, latencies
- Aggregated and sampled
Logs:
- Discrete events with context
- Errors, requests, business events
- Detailed but voluminous
Traces:
- Request flow across services
- Timing and dependencies
- End-to-end visibility
Metrics with Prometheus#
Business Metrics#
Structured Logging#
Distributed Tracing#
Error Tracking#
Health Checks#
Alerting Rules#
Dashboard Queries#
# Request rate
sum(rate(http_requests_total[5m])) by (path)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# P50, P95, P99 latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Active connections
active_connections
# Memory usage
process_resident_memory_bytes
Best Practices#
Metrics:
✓ Use consistent naming conventions
✓ Add relevant labels
✓ Set appropriate buckets
✓ Monitor cardinality
Logging:
✓ Use structured logging
✓ Include correlation IDs
✓ Log at appropriate levels
✓ Don't log sensitive data
Tracing:
✓ Propagate context across services
✓ Add meaningful span names
✓ Include relevant attributes
✓ Sample appropriately
Alerting:
✓ Alert on symptoms, not causes
✓ Include runbooks
✓ Avoid alert fatigue
✓ Test alerts regularly
Conclusion#
Observability requires metrics, logs, and traces working together. Start with basic health checks and metrics, add structured logging, then implement tracing for complex systems. Good observability reduces mean time to resolution and improves system reliability.