Database Replication Strategies for High Availability

Database replication ensures your data survives failures and scales to handle read-heavy workloads. Here's how to implement replication effectively.

Replication Patterns#

Primary-Replica (Master-Slave)#

┌──────────────┐
│   Primary    │  ← All writes
│   (Master)   │
└──────┬───────┘
       │ Replication
       ├──────────────┬──────────────┐
       ▼              ▼              ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│   Replica 1  │ │   Replica 2  │ │   Replica 3  │
│   (Read)     │ │   (Read)     │ │   (Read)     │
└──────────────┘ └──────────────┘ └──────────────┘

Benefits:
- Read scaling
- Backup without impacting primary
- Geographic distribution

Drawbacks:
- Write bottleneck
- Replication lag
- Failover complexity

Multi-Primary (Master-Master)#

┌──────────────┐     ┌──────────────┐
│   Primary 1  │◄───►│   Primary 2  │
│   (R/W)      │     │   (R/W)      │
└──────────────┘     └──────────────┘

Benefits:
- Write scaling
- No single point of failure

Drawbacks:
- Conflict resolution needed
- More complex
- Higher latency for consistency

PostgreSQL Replication#

Streaming Replication Setup#

# Primary configuration (postgresql.conf)
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1GB
synchronous_commit = on

# pg_hba.conf - allow replication connections
host replication replicator replica_ip/32 scram-sha-256

# Replica setup
# Stop PostgreSQL, remove data directory
pg_basebackup -h primary_host -D /var/lib/postgresql/data -U replicator -P -R

# The -R flag creates standby.signal and configures recovery
# Start PostgreSQL - it will begin replicating

Application Configuration#

// Read/write splitting with Prisma
import { PrismaClient } from '@prisma/client';

const writeClient = new PrismaClient({
  datasources: { db: { url: process.env.DATABASE_URL_PRIMARY } },
});

const readClient = new PrismaClient({
  datasources: { db: { url: process.env.DATABASE_URL_REPLICA } },
});

// Use appropriate client
async function getUser(id: string) {
  return readClient.user.findUnique({ where: { id } });
}

async function createUser(data: CreateUserInput) {
  return writeClient.user.create({ data });
}

// Or use a single client with read replicas
const prisma = new PrismaClient().$extends({
  query: {
    $allModels: {
      async $allOperations({ operation, model, args, query }) {
        const readOperations = ['findUnique', 'findFirst', 'findMany', 'count', 'aggregate'];
        if (readOperations.includes(operation)) {
          // Route to replica
        }
        return query(args);
      },
    },
  },
});

Synchronous vs Asynchronous#

Asynchronous Replication#

Primary commits → Returns to client → Replicates later

Pros:
- Lower latency
- Primary doesn't wait

Cons:
- Data loss possible
- Replication lag

Synchronous Replication#

-- PostgreSQL synchronous replication
-- postgresql.conf on primary
synchronous_standby_names = 'replica1,replica2'
synchronous_commit = on

-- Check replication status
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn
FROM pg_stat_replication;

Primary commits → Waits for replica → Returns to client

Pros:
- Zero data loss
- Strong consistency

Cons:
- Higher latency
- Primary blocked if replica fails

Failover Strategies#

Automatic Failover with Patroni#

# patroni.yml
scope: postgres-cluster
name: node1

restapi:
  listen: 0.0.0.0:8008

etcd:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        hot_standby: on
        max_wal_senders: 10

postgresql:
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: secret

Connection Pooling with PgBouncer#

; pgbouncer.ini
[databases]
mydb = host=primary_host port=5432 dbname=mydb
mydb_ro = host=replica_host port=5432 dbname=mydb

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = scram-sha-256
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25

Multi-Region Replication#

US-East (Primary)
       │
       ├──► US-West (Async Replica)
       │
       └──► EU-West (Async Replica)

Considerations:
- Network latency (50-200ms cross-region)
- Conflict resolution strategy
- Data sovereignty requirements
- Failover and failback procedures

AWS RDS Multi-AZ#

// Automatic failover with RDS
// Connection string points to endpoint that routes to current primary
const connectionString = `postgres://user:pass@mydb.cluster-xxx.us-east-1.rds.amazonaws.com:5432/mydb`;

// Read replicas for scaling reads
const readReplicaString = `postgres://user:pass@mydb.cluster-ro-xxx.us-east-1.rds.amazonaws.com:5432/mydb`;

Monitoring Replication#

-- PostgreSQL replication lag
SELECT
  client_addr,
  state,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb
FROM pg_stat_replication;

-- On replica: check lag
SELECT
  CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()
    THEN 0
    ELSE EXTRACT(EPOCH FROM now() - pg_last_xact_replay_timestamp())
  END AS replication_lag_seconds;

# Prometheus alerting
- alert: ReplicationLagHigh
  expr: pg_replication_lag_seconds > 30
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Database replication lag is {{ $value }}s"

Database replication is essential for high availability and read scaling. Start with primary-replica for most cases, add synchronous replication for zero data loss requirements, and implement automatic failover for production reliability.

Monitor replication lag continuously and test failover procedures regularly. The best replication setup is one you've practiced recovering from.