Back to Blog
DebuggingProductionMonitoringDevOps

Debugging Production Issues: A Systematic Approach

Debug production problems effectively. From log analysis to tracing to root cause analysis techniques.

B
Bootspring Team
Engineering
January 20, 2024
5 min read

Production bugs are stressful. A systematic approach helps you find and fix issues quickly while minimizing impact on users.

The Debugging Process#

1. ASSESS - What's the impact? - How many users affected? - Is it getting worse? 2. STABILIZE - Can we mitigate immediately? - Rollback? Feature flag? Scale up? 3. INVESTIGATE - Gather evidence - Form hypotheses - Test systematically 4. FIX - Implement solution - Verify fix works - Deploy carefully 5. LEARN - Document root cause - Prevent recurrence - Share knowledge

Initial Assessment#

1// Quick health check script 2async function assessSystem() { 3 const checks = await Promise.all([ 4 checkDatabase(), 5 checkCache(), 6 checkExternalAPIs(), 7 checkErrorRates(), 8 checkResponseTimes(), 9 ]); 10 11 return { 12 database: checks[0], 13 cache: checks[1], 14 externalAPIs: checks[2], 15 errorRate: checks[3], 16 responseTime: checks[4], 17 timestamp: new Date().toISOString(), 18 }; 19} 20 21// Error rate spike detection 22async function checkErrorRates() { 23 const recent = await getErrorCount({ minutes: 5 }); 24 const baseline = await getErrorCount({ minutes: 60 }) / 12; 25 26 return { 27 current: recent, 28 baseline, 29 ratio: recent / baseline, 30 alert: recent > baseline * 3, 31 }; 32}

Log Analysis#

1// Structured logging for debugging 2const logger = { 3 error: (message: string, context: object) => { 4 console.error(JSON.stringify({ 5 level: 'error', 6 message, 7 ...context, 8 timestamp: new Date().toISOString(), 9 requestId: getRequestId(), 10 userId: getCurrentUserId(), 11 traceId: getTraceId(), 12 })); 13 }, 14}; 15 16// Log patterns for common issues 17logger.error('Database query failed', { 18 query: 'SELECT * FROM users', 19 duration: 5000, 20 error: error.message, 21 stack: error.stack, 22}); 23 24// Search logs effectively 25// Find errors for specific user 26// grep 'userId.*user123' /var/log/app/*.log | grep error 27 28// Find slow requests 29// jq 'select(.duration > 1000)' /var/log/app/access.log

Distributed Tracing#

1// OpenTelemetry tracing 2import { trace, context, SpanStatusCode } from '@opentelemetry/api'; 3 4const tracer = trace.getTracer('my-service'); 5 6async function handleRequest(req: Request) { 7 const span = tracer.startSpan('handleRequest', { 8 attributes: { 9 'http.method': req.method, 10 'http.url': req.url, 11 'user.id': req.userId, 12 }, 13 }); 14 15 try { 16 // Database query with child span 17 const dbSpan = tracer.startSpan('database.query', { 18 parent: span, 19 }); 20 21 const result = await database.query('SELECT...'); 22 23 dbSpan.setAttribute('db.rows', result.length); 24 dbSpan.end(); 25 26 return result; 27 } catch (error) { 28 span.setStatus({ 29 code: SpanStatusCode.ERROR, 30 message: error.message, 31 }); 32 span.recordException(error); 33 throw error; 34 } finally { 35 span.end(); 36 } 37}

Common Issue Patterns#

1## Memory Leak 2Symptoms: 3- Memory usage increases over time 4- Periodic crashes/restarts 5- Slow response degradation 6 7Debug: 8- Take heap snapshots over time 9- Compare object retention 10- Check for growing arrays/maps 11 12## Connection Exhaustion 13Symptoms: 14- "Too many connections" errors 15- Timeouts on database calls 16- Works after restart 17 18Debug: 19- Check connection pool metrics 20- Look for unreleased connections 21- Check for connection leaks in error paths 22 23## Race Condition 24Symptoms: 25- Intermittent failures 26- Works on retry 27- Hard to reproduce 28 29Debug: 30- Add detailed timing logs 31- Check for shared mutable state 32- Look for missing locks/transactions 33 34## External Service Degradation 35Symptoms: 36- Increased latency 37- Timeout errors 38- Partial failures 39 40Debug: 41- Check external service status 42- Review timeout configurations 43- Check circuit breaker state

Reproduction Strategies#

1// Capture request for replay 2function captureRequest(req: Request) { 3 const captured = { 4 method: req.method, 5 url: req.url, 6 headers: sanitizeHeaders(req.headers), 7 body: req.body, 8 timestamp: Date.now(), 9 }; 10 11 logger.debug('Captured request', captured); 12 return captured; 13} 14 15// Replay in test environment 16async function replayRequest(captured: CapturedRequest) { 17 const response = await fetch(TEST_ENV_URL + captured.url, { 18 method: captured.method, 19 headers: captured.headers, 20 body: captured.body, 21 }); 22 23 return { 24 status: response.status, 25 body: await response.json(), 26 }; 27}

Root Cause Analysis#

1## 5 Whys Technique 2 3Problem: Users can't log in 4 51. Why? Authentication service returns 500 62. Why? Database query times out 73. Why? Table lock held too long 84. Why? Long-running migration query 95. Why? Migration ran during peak hours 10 11Root Cause: Missing deployment procedure for migrations 12 13Fix: Run migrations during maintenance window 14Prevention: Add migration timing to deployment checklist

Quick Mitigations#

1# Scale up temporarily 2kubectl scale deployment/api --replicas=10 3 4# Restart problematic pods 5kubectl rollout restart deployment/api 6 7# Enable feature flag to disable problematic feature 8curl -X POST https://api.featureflags.com/flags/new-checkout \ 9 -d '{"enabled": false}' 10 11# Rollback deployment 12kubectl rollout undo deployment/api 13 14# Rate limit aggressive traffic 15iptables -A INPUT -p tcp --dport 80 -m limit --limit 100/s -j ACCEPT

Post-Incident Process#

1## Incident Report Template 2 3### Summary 4Brief description of what happened 5 6### Timeline 7- HH:MM - Issue detected 8- HH:MM - Team alerted 9- HH:MM - Investigation started 10- HH:MM - Root cause identified 11- HH:MM - Fix deployed 12- HH:MM - Issue resolved 13 14### Impact 15- Duration: X hours 16- Users affected: N 17- Revenue impact: $X 18 19### Root Cause 20Detailed explanation of why it happened 21 22### Resolution 23What was done to fix it 24 25### Prevention 26- [ ] Action item 1 27- [ ] Action item 2 28- [ ] Action item 3 29 30### Lessons Learned 31What we learned from this incident

Debugging Toolkit#

Essential Tools: - Logs: grep, jq, Kibana, CloudWatch - Metrics: Grafana, Datadog, Prometheus - Tracing: Jaeger, Zipkin, X-Ray - Profiling: Node --inspect, Chrome DevTools - Database: EXPLAIN, pg_stat_statements - Network: tcpdump, Wireshark, curl Commands: # Watch logs in real-time tail -f /var/log/app.log | jq # Check memory usage ps aux --sort=-%mem | head # Check connections netstat -an | grep ESTABLISHED | wc -l # Check disk usage df -h && du -sh /*

Best Practices#

DO: ✓ Stay calm and methodical ✓ Communicate status regularly ✓ Document as you go ✓ Verify fixes before celebrating ✓ Conduct blameless post-mortems ✓ Share learnings with team DON'T: ✗ Make untested changes in production ✗ Debug alone for too long ✗ Skip the post-mortem ✗ Blame individuals ✗ Ignore warning signs

Conclusion#

Production debugging is a skill that improves with practice. Build observability into your systems, develop systematic approaches, and always conduct post-mortems.

The best debugging is preventing bugs in the first place—but when they happen, be ready.

Share this article

Help spread the word about Bootspring