Disaster Recovery Architecture
The infrastructure supports automated disaster recovery through a primary/secondary cluster pair with DNS-based failover.
Backend parameters are supplied via .hcl files in backend-configs/:
| Component | Type | Deployed On | Purpose |
|---|---|---|---|
| Global DNS Zone | OCI DNS | Primary (active) | Central DNS zone for failover steering |
| DNS Steering Policy | OCI DNS | Primary (active) | Routes traffic based on health check status |
| Primary Health Check | OCI Health Check | Primary (active) | Monitors primary cluster availability |
| Secondary Health Check | OCI Health Check | Primary (active) | TCP readiness check for standby cluster |
| DR Failover Function | OCI Functions | Secondary (standby) | Automated failover: scales operator, updates DNS |
| Failover Alarm | OCI Monitoring | Secondary (standby) | Triggers function when primary health check fails |
| Cert-Manager Webhook | Kubernetes | Both (optional) | DNS-01 certificates when HTTP-01 unreachable |
| External DNS | Kubernetes | Both (optional) | Automatic DNS record management per cluster |
| Remote State | Terraform | Both | Cross-cluster discovery of encryption keys, LB IPs, health check IDs |
Automated Failover Workflow:
When the primary cluster becomes unavailable, the following automated sequence occurs:
Detection: OCI Alarm detects primary health check failure (configurable pending duration, e.g., PT3M)
Alarm Fires: Alarm transitions to FIRING state and publishes to OCI Notification Topic
Function Invocation: DR failover function is triggered on the secondary cluster
Pre-Failover Validation: Function validates the primary has been down for the configured threshold (default 180 seconds) to prevent false positives
Operator Scaling: Function scales humio-operator deployment from 0 to 1 replicas via Kubernetes API
Log Collector Startup: Humio operator brings up Log Collector pods, recovering from the primary's object storage bucket
Pod Readiness (optional): Function waits for target pod count to reach ready state
DNS Update: Function updates the primary health check's
is_disabledflag, causing the DNS steering policy to route traffic to the secondaryCooldown: Function persists cooldown state in Object Storage to prevent repeated failover attempts
Two-Phase Promotion
After a failover event, the secondary cluster can be promoted to full active status through a two-phase process
| Phase | Variable Setting | Behavior | Purpose |
|---|---|---|---|
| Phase 1 | dr = "active", dr_use_dedicated_routing = false | Generic Kubernetes service selectors route all traffic to available digest pods. UI and Ingest node pools begin scaling up. | Zero-downtime transition - traffic flows immediately |
| Phase 2 | dr = "active", dr_use_dedicated_routing = true | Pool-specific Kubernetes service selectors: UI traffic routes to UI pods, Ingest traffic routes to Ingest pods | Optimal load distribution after all pools are ready |
Remote State Configuration
DR deployments use Terraform remote state to share configuration between the primary and secondary clusters:
| Direction | Data Shared | Purpose |
|---|---|---|
| Primary to Secondary | Storage encryption key, Ingest LB IP, Health check IDs, Steering policy ID | Standby cluster needs primary's encryption key for bucket access and health check IDs for failover automation |
| Secondary to Primary | Ingest LB IP, Health check IDs | Primary creates health check monitors for the secondary endpoint |