Back to Blog
DatabaseReplicationHigh AvailabilityArchitecture

Database Replication Strategies for High Availability

Build resilient database architectures. From read replicas to multi-region setups to failover strategies.

B
Bootspring Team
Engineering
September 5, 2024
4 min read

Database replication ensures your data survives failures and scales to handle read-heavy workloads. Here's how to implement replication effectively.

Replication Patterns#

Primary-Replica (Master-Slave)#

┌──────────────┐ │ Primary │ ← All writes │ (Master) │ └──────┬───────┘ │ Replication ├──────────────┬──────────────┐ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Replica 1 │ │ Replica 2 │ │ Replica 3 │ │ (Read) │ │ (Read) │ │ (Read) │ └──────────────┘ └──────────────┘ └──────────────┘ Benefits: - Read scaling - Backup without impacting primary - Geographic distribution Drawbacks: - Write bottleneck - Replication lag - Failover complexity

Multi-Primary (Master-Master)#

┌──────────────┐ ┌──────────────┐ │ Primary 1 │◄───►│ Primary 2 │ │ (R/W) │ │ (R/W) │ └──────────────┘ └──────────────┘ Benefits: - Write scaling - No single point of failure Drawbacks: - Conflict resolution needed - More complex - Higher latency for consistency

PostgreSQL Replication#

Streaming Replication Setup#

1# Primary configuration (postgresql.conf) 2wal_level = replica 3max_wal_senders = 10 4wal_keep_size = 1GB 5synchronous_commit = on 6 7# pg_hba.conf - allow replication connections 8host replication replicator replica_ip/32 scram-sha-256
1# Replica setup 2# Stop PostgreSQL, remove data directory 3pg_basebackup -h primary_host -D /var/lib/postgresql/data -U replicator -P -R 4 5# The -R flag creates standby.signal and configures recovery 6# Start PostgreSQL - it will begin replicating

Application Configuration#

1// Read/write splitting with Prisma 2import { PrismaClient } from '@prisma/client'; 3 4const writeClient = new PrismaClient({ 5 datasources: { db: { url: process.env.DATABASE_URL_PRIMARY } }, 6}); 7 8const readClient = new PrismaClient({ 9 datasources: { db: { url: process.env.DATABASE_URL_REPLICA } }, 10}); 11 12// Use appropriate client 13async function getUser(id: string) { 14 return readClient.user.findUnique({ where: { id } }); 15} 16 17async function createUser(data: CreateUserInput) { 18 return writeClient.user.create({ data }); 19} 20 21// Or use a single client with read replicas 22const prisma = new PrismaClient().$extends({ 23 query: { 24 $allModels: { 25 async $allOperations({ operation, model, args, query }) { 26 const readOperations = ['findUnique', 'findFirst', 'findMany', 'count', 'aggregate']; 27 if (readOperations.includes(operation)) { 28 // Route to replica 29 } 30 return query(args); 31 }, 32 }, 33 }, 34});

Synchronous vs Asynchronous#

Asynchronous Replication#

Primary commits → Returns to client → Replicates later Pros: - Lower latency - Primary doesn't wait Cons: - Data loss possible - Replication lag

Synchronous Replication#

1-- PostgreSQL synchronous replication 2-- postgresql.conf on primary 3synchronous_standby_names = 'replica1,replica2' 4synchronous_commit = on 5 6-- Check replication status 7SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn 8FROM pg_stat_replication;
Primary commits → Waits for replica → Returns to client Pros: - Zero data loss - Strong consistency Cons: - Higher latency - Primary blocked if replica fails

Failover Strategies#

Automatic Failover with Patroni#

1# patroni.yml 2scope: postgres-cluster 3name: node1 4 5restapi: 6 listen: 0.0.0.0:8008 7 8etcd: 9 hosts: etcd1:2379,etcd2:2379,etcd3:2379 10 11bootstrap: 12 dcs: 13 ttl: 30 14 loop_wait: 10 15 retry_timeout: 10 16 maximum_lag_on_failover: 1048576 17 postgresql: 18 use_pg_rewind: true 19 parameters: 20 wal_level: replica 21 hot_standby: on 22 max_wal_senders: 10 23 24postgresql: 25 listen: 0.0.0.0:5432 26 data_dir: /var/lib/postgresql/data 27 authentication: 28 replication: 29 username: replicator 30 password: secret

Connection Pooling with PgBouncer#

1; pgbouncer.ini 2[databases] 3mydb = host=primary_host port=5432 dbname=mydb 4mydb_ro = host=replica_host port=5432 dbname=mydb 5 6[pgbouncer] 7listen_addr = 0.0.0.0 8listen_port = 6432 9auth_type = scram-sha-256 10pool_mode = transaction 11max_client_conn = 1000 12default_pool_size = 25

Multi-Region Replication#

US-East (Primary) │ ├──► US-West (Async Replica) │ └──► EU-West (Async Replica) Considerations: - Network latency (50-200ms cross-region) - Conflict resolution strategy - Data sovereignty requirements - Failover and failback procedures

AWS RDS Multi-AZ#

1// Automatic failover with RDS 2// Connection string points to endpoint that routes to current primary 3const connectionString = `postgres://user:pass@mydb.cluster-xxx.us-east-1.rds.amazonaws.com:5432/mydb`; 4 5// Read replicas for scaling reads 6const readReplicaString = `postgres://user:pass@mydb.cluster-ro-xxx.us-east-1.rds.amazonaws.com:5432/mydb`;

Monitoring Replication#

1-- PostgreSQL replication lag 2SELECT 3 client_addr, 4 state, 5 pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes, 6 pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb 7FROM pg_stat_replication; 8 9-- On replica: check lag 10SELECT 11 CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() 12 THEN 0 13 ELSE EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp()) 14 END AS replication_lag_seconds;
1# Prometheus alerting 2- alert: ReplicationLagHigh 3 expr: pg_replication_lag_seconds > 30 4 for: 5m 5 labels: 6 severity: warning 7 annotations: 8 summary: "Database replication lag is {{ $value }}s"

Conclusion#

Database replication is essential for high availability and read scaling. Start with primary-replica for most cases, add synchronous replication for zero data loss requirements, and implement automatic failover for production reliability.

Monitor replication lag continuously and test failover procedures regularly. The best replication setup is one you've practiced recovering from.

Share this article

Help spread the word about Bootspring