When something goes wrong in production, observability is the difference between quick resolution and hours of guessing. Good logging, tracing, and metrics let you understand system behavior, debug issues, and prevent problems before users notice.
The Three Pillars of Observability
Logs
Discrete events with context. "User 123 placed order 456 at 14:32:05"
Metrics
Aggregated measurements over time. "Orders per minute: 150"
Traces
Request flow across services. "Request X took 500ms: API (50ms) → DB (300ms) → Cache (150ms)"
Structured Logging
Why Structure Matters
Structured logs are:
- Searchable:
userId:123 AND level:error - Aggregatable: "Count errors by productId"
- Parseable: Machines can process them
Logger Configuration
Log Levels
Contextual Logging
Distributed Tracing
OpenTelemetry Setup
Manual Spans
Metrics
Key Metrics Types
Request Metrics Middleware
Business Metrics
Alerting Strategy
Alert Definition
Runbook Links
Log Aggregation
Shipping Logs
Query Patterns
# Find errors for a specific user
service:order-api AND level:error AND userId:123
# Find slow requests
service:order-api AND duration:>1000
# Trace a request across services
traceId:abc123
# Error patterns in last hour
service:* AND level:error | stats count by message | sort -count | head 10
Dashboard Design
Key Dashboards
Request to AI:
Design observability dashboards for an e-commerce API:
Dashboards needed:
1. Overview (health at a glance)
2. Request performance
3. Business metrics
4. Infrastructure
5. Errors and debugging
For each dashboard:
- Key metrics to display
- Visualization types
- Time ranges
- Alert thresholds
Example Overview Dashboard
┌─────────────────────────────────────────────────────────────┐
│ Request Rate Error Rate P95 Latency │
│ [ 1,234 req/s ] [ 0.3% ] [ 145ms ] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Requests per Second (last 6 hours) │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │
└─────────────────────────────────────────────────────────────┘
┌──────────────────────────┬──────────────────────────────────┐
│ Top Endpoints │ Recent Errors │
│ GET /products 45% │ • PaymentError: timeout │
│ GET /users 25% │ • ValidationError: email │
│ POST /orders 15% │ • NotFoundError: product │
│ Other 15% │ │
└──────────────────────────┴──────────────────────────────────┘
Conclusion
Observability isn't a feature—it's a capability that enables everything else. Without visibility into system behavior, you're flying blind.
Start with structured logging. Add metrics for key operations. Implement tracing for distributed systems. Build dashboards that answer questions before they're asked. Set up alerts that catch problems before users do.
AI helps implement these patterns correctly, from logger configuration to alert thresholds. The investment in observability pays dividends every time you need to debug production issues—which is always sooner than you expect.