Production bugs are stressful. A systematic approach helps you find and fix issues quickly while minimizing impact on users.
The Debugging Process#
1. ASSESS
- What's the impact?
- How many users affected?
- Is it getting worse?
2. STABILIZE
- Can we mitigate immediately?
- Rollback? Feature flag? Scale up?
3. INVESTIGATE
- Gather evidence
- Form hypotheses
- Test systematically
4. FIX
- Implement solution
- Verify fix works
- Deploy carefully
5. LEARN
- Document root cause
- Prevent recurrence
- Share knowledge
Initial Assessment#
1// Quick health check script
2async function assessSystem() {
3 const checks = await Promise.all([
4 checkDatabase(),
5 checkCache(),
6 checkExternalAPIs(),
7 checkErrorRates(),
8 checkResponseTimes(),
9 ]);
10
11 return {
12 database: checks[0],
13 cache: checks[1],
14 externalAPIs: checks[2],
15 errorRate: checks[3],
16 responseTime: checks[4],
17 timestamp: new Date().toISOString(),
18 };
19}
20
21// Error rate spike detection
22async function checkErrorRates() {
23 const recent = await getErrorCount({ minutes: 5 });
24 const baseline = await getErrorCount({ minutes: 60 }) / 12;
25
26 return {
27 current: recent,
28 baseline,
29 ratio: recent / baseline,
30 alert: recent > baseline * 3,
31 };
32}Log Analysis#
1// Structured logging for debugging
2const logger = {
3 error: (message: string, context: object) => {
4 console.error(JSON.stringify({
5 level: 'error',
6 message,
7 ...context,
8 timestamp: new Date().toISOString(),
9 requestId: getRequestId(),
10 userId: getCurrentUserId(),
11 traceId: getTraceId(),
12 }));
13 },
14};
15
16// Log patterns for common issues
17logger.error('Database query failed', {
18 query: 'SELECT * FROM users',
19 duration: 5000,
20 error: error.message,
21 stack: error.stack,
22});
23
24// Search logs effectively
25// Find errors for specific user
26// grep 'userId.*user123' /var/log/app/*.log | grep error
27
28// Find slow requests
29// jq 'select(.duration > 1000)' /var/log/app/access.logDistributed Tracing#
1// OpenTelemetry tracing
2import { trace, context, SpanStatusCode } from '@opentelemetry/api';
3
4const tracer = trace.getTracer('my-service');
5
6async function handleRequest(req: Request) {
7 const span = tracer.startSpan('handleRequest', {
8 attributes: {
9 'http.method': req.method,
10 'http.url': req.url,
11 'user.id': req.userId,
12 },
13 });
14
15 try {
16 // Database query with child span
17 const dbSpan = tracer.startSpan('database.query', {
18 parent: span,
19 });
20
21 const result = await database.query('SELECT...');
22
23 dbSpan.setAttribute('db.rows', result.length);
24 dbSpan.end();
25
26 return result;
27 } catch (error) {
28 span.setStatus({
29 code: SpanStatusCode.ERROR,
30 message: error.message,
31 });
32 span.recordException(error);
33 throw error;
34 } finally {
35 span.end();
36 }
37}Common Issue Patterns#
1## Memory Leak
2Symptoms:
3- Memory usage increases over time
4- Periodic crashes/restarts
5- Slow response degradation
6
7Debug:
8- Take heap snapshots over time
9- Compare object retention
10- Check for growing arrays/maps
11
12## Connection Exhaustion
13Symptoms:
14- "Too many connections" errors
15- Timeouts on database calls
16- Works after restart
17
18Debug:
19- Check connection pool metrics
20- Look for unreleased connections
21- Check for connection leaks in error paths
22
23## Race Condition
24Symptoms:
25- Intermittent failures
26- Works on retry
27- Hard to reproduce
28
29Debug:
30- Add detailed timing logs
31- Check for shared mutable state
32- Look for missing locks/transactions
33
34## External Service Degradation
35Symptoms:
36- Increased latency
37- Timeout errors
38- Partial failures
39
40Debug:
41- Check external service status
42- Review timeout configurations
43- Check circuit breaker stateReproduction Strategies#
1// Capture request for replay
2function captureRequest(req: Request) {
3 const captured = {
4 method: req.method,
5 url: req.url,
6 headers: sanitizeHeaders(req.headers),
7 body: req.body,
8 timestamp: Date.now(),
9 };
10
11 logger.debug('Captured request', captured);
12 return captured;
13}
14
15// Replay in test environment
16async function replayRequest(captured: CapturedRequest) {
17 const response = await fetch(TEST_ENV_URL + captured.url, {
18 method: captured.method,
19 headers: captured.headers,
20 body: captured.body,
21 });
22
23 return {
24 status: response.status,
25 body: await response.json(),
26 };
27}Root Cause Analysis#
1## 5 Whys Technique
2
3Problem: Users can't log in
4
51. Why? Authentication service returns 500
62. Why? Database query times out
73. Why? Table lock held too long
84. Why? Long-running migration query
95. Why? Migration ran during peak hours
10
11Root Cause: Missing deployment procedure for migrations
12
13Fix: Run migrations during maintenance window
14Prevention: Add migration timing to deployment checklistQuick Mitigations#
1# Scale up temporarily
2kubectl scale deployment/api --replicas=10
3
4# Restart problematic pods
5kubectl rollout restart deployment/api
6
7# Enable feature flag to disable problematic feature
8curl -X POST https://api.featureflags.com/flags/new-checkout \
9 -d '{"enabled": false}'
10
11# Rollback deployment
12kubectl rollout undo deployment/api
13
14# Rate limit aggressive traffic
15iptables -A INPUT -p tcp --dport 80 -m limit --limit 100/s -j ACCEPTPost-Incident Process#
1## Incident Report Template
2
3### Summary
4Brief description of what happened
5
6### Timeline
7- HH:MM - Issue detected
8- HH:MM - Team alerted
9- HH:MM - Investigation started
10- HH:MM - Root cause identified
11- HH:MM - Fix deployed
12- HH:MM - Issue resolved
13
14### Impact
15- Duration: X hours
16- Users affected: N
17- Revenue impact: $X
18
19### Root Cause
20Detailed explanation of why it happened
21
22### Resolution
23What was done to fix it
24
25### Prevention
26- [ ] Action item 1
27- [ ] Action item 2
28- [ ] Action item 3
29
30### Lessons Learned
31What we learned from this incidentDebugging Toolkit#
Essential Tools:
- Logs: grep, jq, Kibana, CloudWatch
- Metrics: Grafana, Datadog, Prometheus
- Tracing: Jaeger, Zipkin, X-Ray
- Profiling: Node --inspect, Chrome DevTools
- Database: EXPLAIN, pg_stat_statements
- Network: tcpdump, Wireshark, curl
Commands:
# Watch logs in real-time
tail -f /var/log/app.log | jq
# Check memory usage
ps aux --sort=-%mem | head
# Check connections
netstat -an | grep ESTABLISHED | wc -l
# Check disk usage
df -h && du -sh /*
Best Practices#
DO:
✓ Stay calm and methodical
✓ Communicate status regularly
✓ Document as you go
✓ Verify fixes before celebrating
✓ Conduct blameless post-mortems
✓ Share learnings with team
DON'T:
✗ Make untested changes in production
✗ Debug alone for too long
✗ Skip the post-mortem
✗ Blame individuals
✗ Ignore warning signs
Conclusion#
Production debugging is a skill that improves with practice. Build observability into your systems, develop systematic approaches, and always conduct post-mortems.
The best debugging is preventing bugs in the first place—but when they happen, be ready.