Disaster Recovery Architecture

The infrastructure supports automated disaster recovery through a primary/secondary cluster pair with DNS-based failover.

Backend parameters are supplied via .hcl files in backend-configs/:

Component TypeDeployed On Purpose
Global DNS ZoneOCI DNSPrimary (active)Central DNS zone for failover steering
DNS Steering PolicyOCI DNSPrimary (active)Routes traffic based on health check status
Primary Health CheckOCI Health CheckPrimary (active)Monitors primary cluster availability
Secondary Health CheckOCI Health CheckPrimary (active)TCP readiness check for standby cluster
DR Failover FunctionOCI FunctionsSecondary (standby)Automated failover: scales operator, updates DNS
Failover AlarmOCI MonitoringSecondary (standby)Triggers function when primary health check fails
Cert-Manager WebhookKubernetesBoth (optional)DNS-01 certificates when HTTP-01 unreachable
External DNSKubernetesBoth (optional)Automatic DNS record management per cluster
Remote StateTerraformBothCross-cluster discovery of encryption keys, LB IPs, health check IDs

Automated Failover Workflow:

When the primary cluster becomes unavailable, the following automated sequence occurs:

  1. Detection: OCI Alarm detects primary health check failure (configurable pending duration, e.g., PT3M)

  2. Alarm Fires: Alarm transitions to FIRING state and publishes to OCI Notification Topic

  3. Function Invocation: DR failover function is triggered on the secondary cluster

  4. Pre-Failover Validation: Function validates the primary has been down for the configured threshold (default 180 seconds) to prevent false positives

  5. Operator Scaling: Function scales humio-operator deployment from 0 to 1 replicas via Kubernetes API

  6. Log Collector Startup: Humio operator brings up Log Collector pods, recovering from the primary's object storage bucket

  7. Pod Readiness (optional): Function waits for target pod count to reach ready state

  8. DNS Update: Function updates the primary health check's is_disabled flag, causing the DNS steering policy to route traffic to the secondary

  9. Cooldown: Function persists cooldown state in Object Storage to prevent repeated failover attempts

Two-Phase Promotion

After a failover event, the secondary cluster can be promoted to full active status through a two-phase process

Phase Variable Setting BehaviorPurpose
Phase 1dr = "active", dr_use_dedicated_routing = falseGeneric Kubernetes service selectors route all traffic to available digest pods. UI and Ingest node pools begin scaling up.Zero-downtime transition - traffic flows immediately
Phase 2dr = "active", dr_use_dedicated_routing = truePool-specific Kubernetes service selectors: UI traffic routes to UI pods, Ingest traffic routes to Ingest podsOptimal load distribution after all pools are ready

Remote State Configuration

DR deployments use Terraform remote state to share configuration between the primary and secondary clusters:

Direction Data Shared Purpose
Primary to Secondary Storage encryption key, Ingest LB IP, Health check IDs, Steering policy ID Standby cluster needs primary's encryption key for bucket access and health check IDs for failover automation
Secondary to Primary Ingest LB IP, Health check IDs Primary creates health check monitors for the secondary endpoint