Back to Blog
LoggingObservabilityMonitoringDevOps

Logging and Observability Best Practices

Build observable systems with structured logging, distributed tracing, and metrics. Know what's happening in production at all times.

B
Bootspring Team
Engineering
August 18, 2025
6 min read

When something goes wrong in production, observability is the difference between quick resolution and hours of guessing. Good logging, tracing, and metrics let you understand system behavior, debug issues, and prevent problems before users notice.

The Three Pillars of Observability#

Logs#

Discrete events with context. "User 123 placed order 456 at 14:32:05"

Metrics#

Aggregated measurements over time. "Orders per minute: 150"

Traces#

Request flow across services. "Request X took 500ms: API (50ms) → DB (300ms) → Cache (150ms)"

Structured Logging#

Why Structure Matters#

1// ❌ Unstructured logging 2console.log('User ' + userId + ' ordered ' + productId + ' for $' + price); 3// Output: "User 123 ordered abc for $50" 4 5// ✅ Structured logging 6logger.info('Order placed', { 7 userId: '123', 8 productId: 'abc', 9 price: 50, 10 currency: 'USD' 11}); 12// Output: {"level":"info","message":"Order placed","userId":"123","productId":"abc","price":50,"currency":"USD","timestamp":"2024-01-15T14:32:05Z"}

Structured logs are:

  • Searchable: userId:123 AND level:error
  • Aggregatable: "Count errors by productId"
  • Parseable: Machines can process them

Logger Configuration#

1import pino from 'pino'; 2 3const logger = pino({ 4 level: process.env.LOG_LEVEL || 'info', 5 formatters: { 6 level: (label) => ({ level: label }), 7 bindings: (bindings) => ({ 8 pid: bindings.pid, 9 host: bindings.hostname, 10 service: 'my-api', 11 version: process.env.APP_VERSION 12 }) 13 }, 14 timestamp: pino.stdTimeFunctions.isoTime, 15 redact: { 16 paths: ['password', 'token', 'authorization', '*.password', '*.token'], 17 censor: '[REDACTED]' 18 } 19}); 20 21export default logger;

Log Levels#

1// TRACE: Very detailed debugging (rarely enabled in production) 2logger.trace({ query, params }, 'Executing SQL query'); 3 4// DEBUG: Detailed information useful for debugging 5logger.debug({ userId, sessionId }, 'User session validated'); 6 7// INFO: Normal operational events 8logger.info({ orderId, total }, 'Order completed successfully'); 9 10// WARN: Unexpected but handled situations 11logger.warn({ userId, attempts }, 'Rate limit approaching for user'); 12 13// ERROR: Errors that need attention 14logger.error({ err, orderId }, 'Payment processing failed'); 15 16// FATAL: System is unusable 17logger.fatal({ err }, 'Database connection lost, shutting down');

Contextual Logging#

1import { AsyncLocalStorage } from 'async_hooks'; 2 3const asyncLocalStorage = new AsyncLocalStorage<LogContext>(); 4 5// Middleware to set request context 6app.use((req, res, next) => { 7 const context = { 8 requestId: req.headers['x-request-id'] || generateId(), 9 userId: req.user?.id, 10 sessionId: req.session?.id, 11 traceId: req.headers['x-trace-id'] 12 }; 13 14 asyncLocalStorage.run(context, () => next()); 15}); 16 17// Logger that includes context 18function getLogger() { 19 const context = asyncLocalStorage.getStore() || {}; 20 return logger.child(context); 21} 22 23// Usage anywhere in request handling 24const log = getLogger(); 25log.info('Processing order'); // Automatically includes requestId, userId, etc.

Distributed Tracing#

OpenTelemetry Setup#

1import { NodeSDK } from '@opentelemetry/sdk-node'; 2import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; 3import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'; 4 5const sdk = new NodeSDK({ 6 serviceName: 'order-service', 7 traceExporter: new OTLPTraceExporter({ 8 url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT 9 }), 10 instrumentations: [getNodeAutoInstrumentations()] 11}); 12 13sdk.start();

Manual Spans#

1import { trace, SpanStatusCode } from '@opentelemetry/api'; 2 3const tracer = trace.getTracer('order-service'); 4 5async function processOrder(order: Order) { 6 return tracer.startActiveSpan('processOrder', async (span) => { 7 try { 8 span.setAttribute('order.id', order.id); 9 span.setAttribute('order.total', order.total); 10 11 // Child span for payment 12 await tracer.startActiveSpan('processPayment', async (paymentSpan) => { 13 paymentSpan.setAttribute('payment.method', order.paymentMethod); 14 await paymentService.charge(order); 15 paymentSpan.end(); 16 }); 17 18 // Child span for inventory 19 await tracer.startActiveSpan('updateInventory', async (inventorySpan) => { 20 await inventoryService.reserve(order.items); 21 inventorySpan.end(); 22 }); 23 24 span.setStatus({ code: SpanStatusCode.OK }); 25 } catch (error) { 26 span.setStatus({ 27 code: SpanStatusCode.ERROR, 28 message: error.message 29 }); 30 span.recordException(error); 31 throw error; 32 } finally { 33 span.end(); 34 } 35 }); 36}

Metrics#

Key Metrics Types#

1import { Counter, Histogram, Gauge, Registry } from 'prom-client'; 2 3const registry = new Registry(); 4 5// Counter: Things that only go up 6const httpRequestsTotal = new Counter({ 7 name: 'http_requests_total', 8 help: 'Total HTTP requests', 9 labelNames: ['method', 'path', 'status'], 10 registers: [registry] 11}); 12 13// Histogram: Distribution of values 14const httpRequestDuration = new Histogram({ 15 name: 'http_request_duration_seconds', 16 help: 'HTTP request duration in seconds', 17 labelNames: ['method', 'path'], 18 buckets: [0.01, 0.05, 0.1, 0.5, 1, 5], 19 registers: [registry] 20}); 21 22// Gauge: Value that can go up or down 23const activeConnections = new Gauge({ 24 name: 'active_connections', 25 help: 'Number of active connections', 26 registers: [registry] 27});

Request Metrics Middleware#

1app.use((req, res, next) => { 2 const start = process.hrtime.bigint(); 3 4 res.on('finish', () => { 5 const duration = Number(process.hrtime.bigint() - start) / 1e9; 6 7 httpRequestsTotal.inc({ 8 method: req.method, 9 path: req.route?.path || 'unknown', 10 status: res.statusCode 11 }); 12 13 httpRequestDuration.observe( 14 { method: req.method, path: req.route?.path || 'unknown' }, 15 duration 16 ); 17 }); 18 19 next(); 20});

Business Metrics#

1// Track business-relevant events 2const ordersCreated = new Counter({ 3 name: 'orders_created_total', 4 help: 'Total orders created', 5 labelNames: ['payment_method', 'country'] 6}); 7 8const orderValue = new Histogram({ 9 name: 'order_value_dollars', 10 help: 'Order value in dollars', 11 buckets: [10, 50, 100, 500, 1000, 5000] 12}); 13 14async function createOrder(order: Order) { 15 // ... create order logic 16 17 ordersCreated.inc({ 18 payment_method: order.paymentMethod, 19 country: order.shippingCountry 20 }); 21 22 orderValue.observe(order.total); 23}

Alerting Strategy#

Alert Definition#

1# Prometheus alerting rules 2groups: 3 - name: api-alerts 4 rules: 5 - alert: HighErrorRate 6 expr: | 7 sum(rate(http_requests_total{status=~"5.."}[5m])) 8 / 9 sum(rate(http_requests_total[5m])) 10 > 0.05 11 for: 5m 12 labels: 13 severity: critical 14 annotations: 15 summary: "High error rate detected" 16 description: "Error rate is {{ $value | humanizePercentage }}" 17 18 - alert: SlowResponses 19 expr: | 20 histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) 21 > 1 22 for: 10m 23 labels: 24 severity: warning 25 annotations: 26 summary: "Slow response times" 27 description: "P95 latency is {{ $value }}s" 28 29 - alert: LowOrderRate 30 expr: rate(orders_created_total[1h]) < 10 31 for: 30m 32 labels: 33 severity: warning 34 annotations: 35 summary: "Order rate is unusually low"
- alert: DatabaseConnectionsExhausted expr: pg_stat_activity_count >= pg_settings_max_connections * 0.9 annotations: runbook: "https://runbooks.company.com/database/connection-exhaustion"

Log Aggregation#

Shipping Logs#

1// Using Winston with multiple transports 2import winston from 'winston'; 3 4const logger = winston.createLogger({ 5 transports: [ 6 // Console for development 7 new winston.transports.Console({ 8 format: winston.format.combine( 9 winston.format.colorize(), 10 winston.format.simple() 11 ) 12 }), 13 14 // File for local debugging 15 new winston.transports.File({ 16 filename: 'logs/error.log', 17 level: 'error' 18 }), 19 20 // HTTP transport for log aggregation service 21 new winston.transports.Http({ 22 host: 'logs.company.com', 23 path: '/ingest', 24 ssl: true 25 }) 26 ] 27});

Query Patterns#

# Find errors for a specific user service:order-api AND level:error AND userId:123 # Find slow requests service:order-api AND duration:>1000 # Trace a request across services traceId:abc123 # Error patterns in last hour service:* AND level:error | stats count by message | sort -count | head 10

Dashboard Design#

Key Dashboards#

Request to AI: Design observability dashboards for an e-commerce API: Dashboards needed: 1. Overview (health at a glance) 2. Request performance 3. Business metrics 4. Infrastructure 5. Errors and debugging For each dashboard: - Key metrics to display - Visualization types - Time ranges - Alert thresholds

Example Overview Dashboard#

┌─────────────────────────────────────────────────────────────┐ │ Request Rate Error Rate P95 Latency │ │ [ 1,234 req/s ] [ 0.3% ] [ 145ms ] │ └─────────────────────────────────────────────────────────────┘ ┌─────────────────────────────────────────────────────────────┐ │ Requests per Second (last 6 hours) │ │ ▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁▂▃▄▅▆▇█▇▆▅▄▃▂▁ │ └─────────────────────────────────────────────────────────────┘ ┌──────────────────────────┬──────────────────────────────────┐ │ Top Endpoints │ Recent Errors │ │ GET /products 45% │ • PaymentError: timeout │ │ GET /users 25% │ • ValidationError: email │ │ POST /orders 15% │ • NotFoundError: product │ │ Other 15% │ │ └──────────────────────────┴──────────────────────────────────┘

Conclusion#

Observability isn't a feature—it's a capability that enables everything else. Without visibility into system behavior, you're flying blind.

Start with structured logging. Add metrics for key operations. Implement tracing for distributed systems. Build dashboards that answer questions before they're asked. Set up alerts that catch problems before users do.

AI helps implement these patterns correctly, from logger configuration to alert thresholds. The investment in observability pays dividends every time you need to debug production issues—which is always sooner than you expect.

Share this article

Help spread the word about Bootspring