Back to Blog
MonitoringAlertingObservabilityDevOps

Monitoring and Alerting Strategies for Production Systems

Build effective monitoring that catches issues before users do. From metrics to logs to alerts that matter.

B
Bootspring Team
Engineering
September 28, 2024
5 min read

Good monitoring tells you when something's wrong before users complain. Great monitoring tells you why and helps you fix it fast.

The Three Pillars of Observability#

Metrics: Numeric measurements over time CPU usage, request count, error rate Logs: Discrete events with context Error messages, request details, audit trails Traces: Request flow across services Latency breakdown, service dependencies

Metrics#

Key Metrics (RED Method)#

Rate: Requests per second Errors: Error rate percentage Duration: Request latency (p50, p95, p99) # Prometheus queries # Rate rate(http_requests_total[5m]) # Error rate sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # Latency p99 histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Infrastructure Metrics (USE Method)#

Utilization: % time resource is busy Saturation: Queue depth, waiting work Errors: Error count # CPU 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes # Disk I/O saturation rate(node_disk_io_time_weighted_seconds_total[5m])

Custom Business Metrics#

Loading code block...

Logging#

Structured Logging#

Loading code block...

Log Levels#

Level Usage ────────────────────────────────────── error Something failed, needs attention warn Degraded but still working info Normal operations, business events debug Detailed diagnostic information trace Very verbose, development only

Correlation IDs#

Loading code block...

Alerting#

Alert Design Principles#

Good alerts: ✓ Actionable (someone needs to do something) ✓ Timely (catch issues before users notice) ✓ Accurate (low false positive rate) ✓ Clear (obvious what's wrong and how to fix) Bad alerts: ✗ Alerting on every error (alert fatigue) ✗ No runbook or context ✗ Alerting on symptoms, not causes ✗ Too many recipients

Alert Examples#

Loading code block...

Alerting Tiers#

Page (wake someone up): - Service completely down - Data loss occurring - Security incident - Error rate > 50% Urgent (respond same day): - Error rate > 5% - High latency - Degraded functionality - Disk space > 80% Warning (review next business day): - Error rate > 1% - Certificate expiring in 14 days - Dependency deprecation

Dashboards#

Dashboard Design#

Executive dashboard: - Business metrics (revenue, users, conversions) - High-level health (green/yellow/red) - Trends (week over week) Operations dashboard: - Request rate and latency - Error rates by service - Resource utilization - Active alerts Service dashboard: - Detailed metrics for one service - Dependencies health - Recent deployments - Log snippets

Grafana Dashboard Example#

Loading code block...

Incident Response#

Runbooks#

Loading code block...

Conclusion#

Effective monitoring starts with the right metrics and logs, continues with meaningful alerts, and culminates in fast incident response.

Start with RED metrics for services, USE metrics for infrastructure, and business metrics for outcomes. Alert on symptoms but provide context to find causes. Build runbooks as you resolve incidents.

Remember: the goal is to detect and fix issues before users notice them.

Share this article

Help spread the word about Bootspring

Related articles