Observability lets you understand system behavior from external outputs. Metrics, logs, and traces form the three pillars of observability.
The Three Pillars
Metrics:
- Numerical measurements over time
- CPU, memory, request counts, latencies
- Aggregated and sampled
Logs:
- Discrete events with context
- Errors, requests, business events
- Detailed but voluminous
Traces:
- Request flow across services
- Timing and dependencies
- End-to-end visibility
Metrics with Prometheus
Business Metrics
Structured Logging
Distributed Tracing
Error Tracking
Health Checks
Alerting Rules
Dashboard Queries
# Request rate
sum(rate(http_requests_total[5m])) by (path)
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# P50, P95, P99 latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Active connections
active_connections
# Memory usage
process_resident_memory_bytes
Best Practices
Metrics:
✓ Use consistent naming conventions
✓ Add relevant labels
✓ Set appropriate buckets
✓ Monitor cardinality
Logging:
✓ Use structured logging
✓ Include correlation IDs
✓ Log at appropriate levels
✓ Don't log sensitive data
Tracing:
✓ Propagate context across services
✓ Add meaningful span names
✓ Include relevant attributes
✓ Sample appropriately
Alerting:
✓ Alert on symptoms, not causes
✓ Include runbooks
✓ Avoid alert fatigue
✓ Test alerts regularly
Conclusion
Observability requires metrics, logs, and traces working together. Start with basic health checks and metrics, add structured logging, then implement tracing for complex systems. Good observability reduces mean time to resolution and improves system reliability.