Data loss can destroy a business. A solid backup strategy protects against hardware failures, human errors, and disasters. Here's how to implement bulletproof backups.
Backup Types#
Full Backup:
- Complete copy of database
- Longest to create
- Fastest to restore
- Use for: Weekly/monthly backups
Incremental Backup:
- Only changes since last backup
- Fastest to create
- Requires all increments to restore
- Use for: Daily/hourly backups
Differential Backup:
- Changes since last full backup
- Medium creation time
- Only needs full + latest differential
- Use for: Daily backups
PostgreSQL Backup#
pg_dump (Logical Backup)#
1# Full database backup
2pg_dump -h localhost -U postgres -d mydb -F c -f backup.dump
3
4# Specific tables
5pg_dump -h localhost -U postgres -d mydb -t users -t orders -F c -f partial.dump
6
7# Schema only
8pg_dump -h localhost -U postgres -d mydb --schema-only -f schema.sql
9
10# Data only
11pg_dump -h localhost -U postgres -d mydb --data-only -f data.sql
12
13# Compressed backup
14pg_dump -h localhost -U postgres -d mydb | gzip > backup.sql.gz
15
16# Restore
17pg_restore -h localhost -U postgres -d mydb backup.dump
18
19# Restore with options
20pg_restore -h localhost -U postgres -d mydb \
21 --clean --if-exists \
22 --no-owner --no-privileges \
23 backup.dumppg_basebackup (Physical Backup)#
1# Physical backup for point-in-time recovery
2pg_basebackup -h localhost -U replication -D /backup/base \
3 -Fp -Xs -P -R
4
5# With compression
6pg_basebackup -h localhost -U replication -D /backup/base \
7 -Ft -z -Xs -PContinuous Archiving (WAL)#
1# postgresql.conf
2archive_mode = on
3archive_command = 'cp %p /archive/wal/%f'
4wal_level = replica
5
6# Point-in-time recovery
7restore_command = 'cp /archive/wal/%f %p'
8recovery_target_time = '2024-01-15 14:30:00'Automated Backup Script#
1#!/bin/bash
2# backup.sh
3
4set -e
5
6# Configuration
7DB_HOST="${DB_HOST:-localhost}"
8DB_NAME="${DB_NAME:-mydb}"
9DB_USER="${DB_USER:-postgres}"
10BACKUP_DIR="/backups"
11RETENTION_DAYS=30
12S3_BUCKET="my-backups"
13DATE=$(date +%Y%m%d_%H%M%S)
14
15# Create backup
16BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${DATE}.dump"
17echo "Creating backup: ${BACKUP_FILE}"
18
19pg_dump -h "$DB_HOST" -U "$DB_USER" -d "$DB_NAME" \
20 -F c -f "$BACKUP_FILE"
21
22# Compress
23gzip "$BACKUP_FILE"
24BACKUP_FILE="${BACKUP_FILE}.gz"
25
26# Upload to S3
27echo "Uploading to S3..."
28aws s3 cp "$BACKUP_FILE" "s3://${S3_BUCKET}/postgres/${DB_NAME}/"
29
30# Cleanup old local backups
31echo "Cleaning up old backups..."
32find "$BACKUP_DIR" -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete
33
34# Cleanup old S3 backups
35aws s3 ls "s3://${S3_BUCKET}/postgres/${DB_NAME}/" | \
36 while read -r line; do
37 createDate=$(echo "$line" | awk '{print $1" "$2}')
38 createDate=$(date -d "$createDate" +%s)
39 olderThan=$(date -d "-${RETENTION_DAYS} days" +%s)
40 if [[ $createDate -lt $olderThan ]]; then
41 fileName=$(echo "$line" | awk '{print $4}')
42 aws s3 rm "s3://${S3_BUCKET}/postgres/${DB_NAME}/$fileName"
43 fi
44 done
45
46echo "Backup complete: ${BACKUP_FILE}"Kubernetes CronJob#
1apiVersion: batch/v1
2kind: CronJob
3metadata:
4 name: postgres-backup
5spec:
6 schedule: "0 2 * * *" # Daily at 2 AM
7 concurrencyPolicy: Forbid
8 jobTemplate:
9 spec:
10 template:
11 spec:
12 containers:
13 - name: backup
14 image: postgres:15
15 command:
16 - /bin/bash
17 - -c
18 - |
19 pg_dump -h $DB_HOST -U $DB_USER -d $DB_NAME -F c | \
20 gzip | \
21 aws s3 cp - s3://$S3_BUCKET/backups/$(date +%Y%m%d).dump.gz
22 env:
23 - name: DB_HOST
24 value: postgres-service
25 - name: DB_USER
26 valueFrom:
27 secretKeyRef:
28 name: postgres-secret
29 key: username
30 - name: PGPASSWORD
31 valueFrom:
32 secretKeyRef:
33 name: postgres-secret
34 key: password
35 - name: AWS_ACCESS_KEY_ID
36 valueFrom:
37 secretKeyRef:
38 name: aws-secret
39 key: access-key
40 - name: AWS_SECRET_ACCESS_KEY
41 valueFrom:
42 secretKeyRef:
43 name: aws-secret
44 key: secret-key
45 restartPolicy: OnFailureRecovery Procedures#
1// Recovery runbook
2const recoveryProcedure = {
3 steps: [
4 {
5 name: 'Assess the situation',
6 actions: [
7 'Identify what data was lost',
8 'Determine recovery point needed',
9 'Notify stakeholders',
10 ],
11 },
12 {
13 name: 'Prepare recovery environment',
14 actions: [
15 'Provision new database server if needed',
16 'Download backup from S3',
17 'Verify backup integrity',
18 ],
19 },
20 {
21 name: 'Restore database',
22 commands: [
23 'createdb -h localhost -U postgres mydb_restored',
24 'pg_restore -h localhost -U postgres -d mydb_restored backup.dump',
25 ],
26 },
27 {
28 name: 'Verify data',
29 actions: [
30 'Run data integrity checks',
31 'Compare row counts',
32 'Test application connectivity',
33 ],
34 },
35 {
36 name: 'Switch traffic',
37 actions: [
38 'Update connection strings',
39 'Restart application servers',
40 'Monitor for errors',
41 ],
42 },
43 ],
44};Backup Testing#
1// Automated backup verification
2async function verifyBackup(backupPath: string): Promise<boolean> {
3 const tempDb = `verify_${Date.now()}`;
4
5 try {
6 // Create temp database
7 await exec(`createdb ${tempDb}`);
8
9 // Restore backup
10 await exec(`pg_restore -d ${tempDb} ${backupPath}`);
11
12 // Run integrity checks
13 const checks = await Promise.all([
14 verifyTableCounts(tempDb),
15 verifyConstraints(tempDb),
16 verifyIndexes(tempDb),
17 runSampleQueries(tempDb),
18 ]);
19
20 return checks.every((c) => c.passed);
21 } finally {
22 // Cleanup
23 await exec(`dropdb ${tempDb}`);
24 }
25}
26
27async function verifyTableCounts(db: string): Promise<CheckResult> {
28 const result = await query(db, `
29 SELECT schemaname, relname, n_live_tup
30 FROM pg_stat_user_tables
31 ORDER BY n_live_tup DESC
32 `);
33
34 // Compare with expected counts
35 const expected = await getExpectedCounts();
36 const mismatches = result.filter(
37 (row) => Math.abs(row.n_live_tup - expected[row.relname]) > 100
38 );
39
40 return {
41 passed: mismatches.length === 0,
42 details: mismatches,
43 };
44}Disaster Recovery Plan#
1## DR Checklist
2
3### RTO (Recovery Time Objective): 4 hours
4### RPO (Recovery Point Objective): 1 hour
5
6### Primary Failure
71. [ ] Detect failure (monitoring alerts)
82. [ ] Assess impact
93. [ ] Initiate failover to replica
104. [ ] Update DNS/load balancer
115. [ ] Verify application connectivity
126. [ ] Notify stakeholders
13
14### Regional Failure
151. [ ] Activate DR region
162. [ ] Restore from cross-region backup
173. [ ] Update global DNS
184. [ ] Scale infrastructure
195. [ ] Verify all services operational
20
21### Data Corruption
221. [ ] Identify affected tables/timeframe
232. [ ] Stop writes to affected tables
243. [ ] Restore from point-in-time backup
254. [ ] Reconcile any lost transactions
265. [ ] Resume normal operations
27
28### Contact List
29- DBA On-Call: +1-xxx-xxx-xxxx
30- Infrastructure: +1-xxx-xxx-xxxx
31- CTO: +1-xxx-xxx-xxxxBest Practices#
DO:
✓ Test backups regularly (monthly)
✓ Store backups in multiple regions
✓ Encrypt backups at rest
✓ Monitor backup job success
✓ Document recovery procedures
✓ Practice recovery drills
DON'T:
✗ Store backups only on same server
✗ Skip backup verification
✗ Ignore backup failures
✗ Keep backups indefinitely (storage costs)
✗ Forget to backup WAL for PITR
Conclusion#
Backups are insurance—you hope to never need them, but when you do, they're invaluable. Automate backups, test recovery regularly, and document procedures for when disaster strikes.
The time to prepare for data loss is before it happens.