Skip to main content

High Availability Architecture

Architecture diagrams for each infrastructure component's HA design. Each component has a different failure mode and a different mitigation strategy.


Octane — Web Server HA

Failure mode: An Octane process crash takes down all HTTP on that machine.

Mitigation: Multiple machines behind a load balancer. supervisord auto-restarts Octane on each machine if it crashes. The load balancer health-checks each machine and removes unhealthy nodes from rotation.

Key points:

  • Minimum 2 web machines. The load balancer detects a failed machine within one health-check interval (typically 5 s) and stops sending traffic to it.
  • supervisord restarts a crashed Octane process within ~1 s. Combined with LB health-checks, a single-machine crash causes <10 s of partial degradation.
  • Sessions are stored in Redis (not on-disk), so any web machine can serve any session. No sticky sessions needed for Octane.
  • The GET /up endpoint (built into Laravel 11) returns HTTP 200 when the app and its DB connection are healthy.

Redis — Cache / Session / Queue HA

Failure mode: Redis goes down → cache misses degrade performance, sessions are lost, and the queue stops accepting new jobs.

Mitigation: Redis Sentinel — three Sentinel processes monitor a primary/replica pair. If the primary fails, Sentinels elect a new primary automatically. Laravel's predis or phpredis client discovers the current primary through the Sentinel API.

Key points:

  • Three Sentinels are the minimum — a majority vote (2 of 3) is required to trigger a failover. Two Sentinels cannot form a quorum.
  • Failover time is typically 15–30 s (configurable via down-after-milliseconds). During this window, writes fail. Laravel's retry logic in queue workers and cache operations absorbs short gaps.
  • Use separate Redis databases (or separate instances) for cache (db 0), sessions (db 1), and queues (db 2). This prevents a cache eviction burst from dropping session keys.

Reverb — WebSocket HA

Failure mode: A Reverb process crash drops all open WebSocket connections. Clients lose real-time notifications until they reconnect.

Mitigation: Multiple Reverb instances behind a load balancer with sticky sessions (so a client's WebSocket is pinned to one instance). All instances subscribe to the same Redis Pub/Sub channel, so a broadcast() call from any Laravel app node reaches every connected client regardless of which Reverb instance they are on.

Key points:

  • Sticky sessions are required because a WebSocket is a persistent connection — the load balancer must always route the same client to the same Reverb instance. Use ip_hash or cookie-based affinity in nginx.
  • Redis Pub/Sub decouples publishing from delivery. The Laravel app does not know which Reverb instance holds which clients — it just publishes to Redis.
  • Client-side reconnect (built into Laravel Echo) handles a Reverb restart transparently. A brief disconnection (<5 s) is unnoticed by end users.
  • Reverb is stateless with respect to the business domain. Its only state is the open socket connections, which are ephemeral by nature.

Horizon — Queue Worker HA

Failure mode: Queue workers crash → background jobs stop processing. Invoice PDFs are not generated, ERP exports stall, notification emails are delayed.

Mitigation: Run Horizon workers on dedicated machines separate from the web tier. Web machines and worker machines both connect to the same Redis queue. supervisord on each worker machine keeps Horizon running. If a worker machine dies, jobs remain safely in Redis and are processed by other worker machines.

Key points:

  • Separating the worker tier from the web tier prevents a CPU-heavy job (invoice PDF via headless Chromium) from starving web request workers.
  • The billing queue uses dedicated workers and higher priority (--queue=billing,default). Financial jobs are never delayed by a backlog of notification emails.
  • If Worker Machine A goes down, Worker Machine B continues processing all queues. Jobs are not lost — they remain in Redis until a worker claims them.
  • supervisord on each worker machine auto-restarts Horizon within ~1 s of a crash.
  • Failed jobs are preserved in failed_jobs table (not in Redis) — they survive Redis restarts and can be retried from the Horizon dashboard.

PostgreSQL — Database HA

Failure mode: The primary database goes down → all writes fail, application is fully degraded.

Mitigation: Streaming replication to a hot standby replica. A failover agent (Patroni or pg_auto_failover) promotes the replica automatically when the primary fails. PgBouncer pools connections from all Octane workers, preventing connection exhaustion on the database server.

Key points:

  • PgBouncer (transaction mode): Each Octane worker holds a persistent PHP connection to PgBouncer, not directly to PostgreSQL. PgBouncer multiplexes these into a small pool of real PostgreSQL connections (e.g., 20 connections serving 64 Octane workers). Without this, 2 machines × 8 workers = 16 direct connections — manageable, but PgBouncer becomes essential as the fleet grows.
  • Synchronous vs async replication: Synchronous replication (synchronous_commit=on) guarantees zero data loss on failover but adds ~1–5 ms latency per write. Async replication is faster but may lose the last few transactions if the primary fails mid-flight. For a law firm handling financial data, synchronous is preferred.
  • Failover time: Patroni detects a primary failure in ~10 s and promotes the replica in ~15–30 s total. During this window, writes return errors. Laravel's queue retry logic handles transient DB errors for background jobs.
  • Read replica use: The replica serves read-heavy operations — Meilisearch re-indexing jobs, billing report generation, and analytics queries — keeping write traffic isolated on the primary.