When something goes wrong in production, observability is the difference between quick resolution and hours of guessing. Good logging, tracing, and metrics let you understand system behavior, debug issues, and prevent problems before users notice.
The Three Pillars of Observability#
Logs#
Discrete events with context. "User 123 placed order 456 at 14:32:05"
Metrics#
Aggregated measurements over time. "Orders per minute: 150"
Traces#
Request flow across services. "Request X took 500ms: API (50ms) → DB (300ms) → Cache (150ms)"
Structured Logging#
Why Structure Matters#
1// ❌ Unstructured logging
2console.log('User ' + userId + ' ordered ' + productId + ' for $' + price);
3// Output: "User 123 ordered abc for $50"
4
5// ✅ Structured logging
6logger.info('Order placed', {
7 userId: '123',
8 productId: 'abc',
9 price: 50,
10 currency: 'USD'
11});
12// Output: {"level":"info","message":"Order placed","userId":"123","productId":"abc","price":50,"currency":"USD","timestamp":"2024-01-15T14:32:05Z"}Structured logs are:
- Searchable:
userId:123 AND level:error - Aggregatable: "Count errors by productId"
- Parseable: Machines can process them
Logger Configuration#
1import pino from 'pino';
2
3const logger = pino({
4 level: process.env.LOG_LEVEL || 'info',
5 formatters: {
6 level: (label) => ({ level: label }),
7 bindings: (bindings) => ({
8 pid: bindings.pid,
9 host: bindings.hostname,
10 service: 'my-api',
11 version: process.env.APP_VERSION
12 })
13 },
14 timestamp: pino.stdTimeFunctions.isoTime,
15 redact: {
16 paths: ['password', 'token', 'authorization', '*.password', '*.token'],
17 censor: '[REDACTED]'
18 }
19});
20
21export default logger;Log Levels#
1// TRACE: Very detailed debugging (rarely enabled in production)
2logger.trace({ query, params }, 'Executing SQL query');
3
4// DEBUG: Detailed information useful for debugging
5logger.debug({ userId, sessionId }, 'User session validated');
6
7// INFO: Normal operational events
8logger.info({ orderId, total }, 'Order completed successfully');
9
10// WARN: Unexpected but handled situations
11logger.warn({ userId, attempts }, 'Rate limit approaching for user');
12
13// ERROR: Errors that need attention
14logger.error({ err, orderId }, 'Payment processing failed');
15
16// FATAL: System is unusable
17logger.fatal({ err }, 'Database connection lost, shutting down');Contextual Logging#
1import { AsyncLocalStorage } from 'async_hooks';
2
3const asyncLocalStorage = new AsyncLocalStorage<LogContext>();
4
5// Middleware to set request context
6app.use((req, res, next) => {
7 const context = {
8 requestId: req.headers['x-request-id'] || generateId(),
9 userId: req.user?.id,
10 sessionId: req.session?.id,
11 traceId: req.headers['x-trace-id']
12 };
13
14 asyncLocalStorage.run(context, () => next());
15});
16
17// Logger that includes context
18function getLogger() {
19 const context = asyncLocalStorage.getStore() || {};
20 return logger.child(context);
21}
22
23// Usage anywhere in request handling
24const log = getLogger();
25log.info('Processing order'); // Automatically includes requestId, userId, etc.Distributed Tracing#
OpenTelemetry Setup#
1import { NodeSDK } from '@opentelemetry/sdk-node';
2import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
3import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
4
5const sdk = new NodeSDK({
6 serviceName: 'order-service',
7 traceExporter: new OTLPTraceExporter({
8 url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT
9 }),
10 instrumentations: [getNodeAutoInstrumentations()]
11});
12
13sdk.start();Manual Spans#
1import { trace, SpanStatusCode } from '@opentelemetry/api';
2
3const tracer = trace.getTracer('order-service');
4
5async function processOrder(order: Order) {
6 return tracer.startActiveSpan('processOrder', async (span) => {
7 try {
8 span.setAttribute('order.id', order.id);
9 span.setAttribute('order.total', order.total);
10
11 // Child span for payment
12 await tracer.startActiveSpan('processPayment', async (paymentSpan) => {
13 paymentSpan.setAttribute('payment.method', order.paymentMethod);
14 await paymentService.charge(order);
15 paymentSpan.end();
16 });
17
18 // Child span for inventory
19 await tracer.startActiveSpan('updateInventory', async (inventorySpan) => {
20 await inventoryService.reserve(order.items);
21 inventorySpan.end();
22 });
23
24 span.setStatus({ code: SpanStatusCode.OK });
25 } catch (error) {
26 span.setStatus({
27 code: SpanStatusCode.ERROR,
28 message: error.message
29 });
30 span.recordException(error);
31 throw error;
32 } finally {
33 span.end();
34 }
35 });
36}Metrics#
Key Metrics Types#
1import { Counter, Histogram, Gauge, Registry } from 'prom-client';
2
3const registry = new Registry();
4
5// Counter: Things that only go up
6const httpRequestsTotal = new Counter({
7 name: 'http_requests_total',
8 help: 'Total HTTP requests',
9 labelNames: ['method', 'path', 'status'],
10 registers: [registry]
11});
12
13// Histogram: Distribution of values
14const httpRequestDuration = new Histogram({
15 name: 'http_request_duration_seconds',
16 help: 'HTTP request duration in seconds',
17 labelNames: ['method', 'path'],
18 buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
19 registers: [registry]
20});
21
22// Gauge: Value that can go up or down
23const activeConnections = new Gauge({
24 name: 'active_connections',
25 help: 'Number of active connections',
26 registers: [registry]
27});Request Metrics Middleware#
1app.use((req, res, next) => {
2 const start = process.hrtime.bigint();
3
4 res.on('finish', () => {
5 const duration = Number(process.hrtime.bigint() - start) / 1e9;
6
7 httpRequestsTotal.inc({
8 method: req.method,
9 path: req.route?.path || 'unknown',
10 status: res.statusCode
11 });
12
13 httpRequestDuration.observe(
14 { method: req.method, path: req.route?.path || 'unknown' },
15 duration
16 );
17 });
18
19 next();
20});Business Metrics#
1// Track business-relevant events
2const ordersCreated = new Counter({
3 name: 'orders_created_total',
4 help: 'Total orders created',
5 labelNames: ['payment_method', 'country']
6});
7
8const orderValue = new Histogram({
9 name: 'order_value_dollars',
10 help: 'Order value in dollars',
11 buckets: [10, 50, 100, 500, 1000, 5000]
12});
13
14async function createOrder(order: Order) {
15 // ... create order logic
16
17 ordersCreated.inc({
18 payment_method: order.paymentMethod,
19 country: order.shippingCountry
20 });
21
22 orderValue.observe(order.total);
23}Alerting Strategy#
Alert Definition#
1# Prometheus alerting rules
2groups:
3 - name: api-alerts
4 rules:
5 - alert: HighErrorRate
6 expr: |
7 sum(rate(http_requests_total{status=~"5.."}[5m]))
8 /
9 sum(rate(http_requests_total[5m]))
10 > 0.05
11 for: 5m
12 labels:
13 severity: critical
14 annotations:
15 summary: "High error rate detected"
16 description: "Error rate is {{ $value | humanizePercentage }}"
17
18 - alert: SlowResponses
19 expr: |
20 histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
21 > 1
22 for: 10m
23 labels:
24 severity: warning
25 annotations:
26 summary: "Slow response times"
27 description: "P95 latency is {{ $value }}s"
28
29 - alert: LowOrderRate
30 expr: rate(orders_created_total[1h]) < 10
31 for: 30m
32 labels:
33 severity: warning
34 annotations:
35 summary: "Order rate is unusually low"Runbook Links#
- alert: DatabaseConnectionsExhausted
expr: pg_stat_activity_count >= pg_settings_max_connections * 0.9
annotations:
runbook: "https://runbooks.company.com/database/connection-exhaustion"Log Aggregation#
Shipping Logs#
1// Using Winston with multiple transports
2import winston from 'winston';
3
4const logger = winston.createLogger({
5 transports: [
6 // Console for development
7 new winston.transports.Console({
8 format: winston.format.combine(
9 winston.format.colorize(),
10 winston.format.simple()
11 )
12 }),
13
14 // File for local debugging
15 new winston.transports.File({
16 filename: 'logs/error.log',
17 level: 'error'
18 }),
19
20 // HTTP transport for log aggregation service
21 new winston.transports.Http({
22 host: 'logs.company.com',
23 path: '/ingest',
24 ssl: true
25 })
26 ]
27});Query Patterns#
# Find errors for a specific user
service:order-api AND level:error AND userId:123
# Find slow requests
service:order-api AND duration:>1000
# Trace a request across services
traceId:abc123
# Error patterns in last hour
service:* AND level:error | stats count by message | sort -count | head 10
Dashboard Design#
Key Dashboards#
Request to AI:
Design observability dashboards for an e-commerce API:
Dashboards needed:
1. Overview (health at a glance)
2. Request performance
3. Business metrics
4. Infrastructure
5. Errors and debugging
For each dashboard:
- Key metrics to display
- Visualization types
- Time ranges
- Alert thresholds
Example Overview Dashboard#
┌─────────────────────────────────────────────────────────────┐
│ Request Rate Error Rate P95 Latency │
│ [ 1,234 req/s ] [ 0.3% ] [ 145ms ] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Requests per Second (last 6 hours) │
│ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │
└─────────────────────────────────────────────────────────────┘
┌──────────────────────────┬──────────────────────────────────┐
│ Top Endpoints │ Recent Errors │
│ GET /products 45% │ • PaymentError: timeout │
│ GET /users 25% │ • ValidationError: email │
│ POST /orders 15% │ • NotFoundError: product │
│ Other 15% │ │
└──────────────────────────┴──────────────────────────────────┘
Conclusion#
Observability isn't a feature—it's a capability that enables everything else. Without visibility into system behavior, you're flying blind.
Start with structured logging. Add metrics for key operations. Implement tracing for distributed systems. Build dashboards that answer questions before they're asked. Set up alerts that catch problems before users do.
AI helps implement these patterns correctly, from logger configuration to alert thresholds. The investment in observability pays dividends every time you need to debug production issues—which is always sooner than you expect.