Architecture

The infrastructure supports automated disaster recovery through a primary/secondary cluster pair with DNS-based failover.

For complete DR setup, failover procedures, and promotion steps, see the DR Operations Guide.

Backend parameters are supplied via .hcl files in backend-configs/:

Component	Type	Deployed On	Purpose
Global DNS Zone	OCI DNS	Primary (active)	Central DNS zone for failover steering
DNS Steering Policy	OCI DNS	Primary (active)	Routes traffic based on health check status
Primary Health Check	OCI Health Check	Primary (active)	Monitors primary cluster availability
Secondary Health Check	OCI Health Check	Primary (active)	TCP readiness check for standby cluster
DR Failover Function	OCI Functions	Secondary (standby)	Automated failover: scales operator, updates DNS
Failover Alarm	OCI Monitoring	Secondary (standby)	Triggers function when primary health check fails
Cert-Manager Webhook	Kubernetes	Both (optional)	DNS-01 certificates when HTTP-01 unreachable
External DNS	Kubernetes	Both (optional)	Automatic DNS record management per cluster
Remote State	Terraform	Both	Cross-cluster discovery of encryption keys, LB IPs, health check IDs

Automated Failover Workflow:

When the primary cluster becomes unavailable, the following automated sequence occurs:

Detection: OCI Alarm detects primary health check failure (configurable pending duration, e.g., PT3M)
Alarm Fires: Alarm transitions to FIRING state and publishes to OCI Notification Topic
Function Invocation: DR failover function is triggered on the secondary cluster
Pre-Failover Validation: Function validates the primary has been down for the configured threshold (default 180 seconds) to prevent false positives
Operator Scaling: Function scales humio-operator deployment from 0 to 1 replicas via Kubernetes API
Log Collector Startup: Humio operator brings up Log Collector pods, recovering from the primary's object storage bucket
Pod Readiness (optional): Function waits for target pod count to reach ready state
DNS Update: Function updates the primary health check's is_disabled flag, causing the DNS steering policy to route traffic to the secondary
Cooldown: Function persists cooldown state in Object Storage to prevent repeated failover attempts

Two-Phase Promotion

After a failover event, the secondary cluster can be promoted to full active status through a two-phase process

Phase	Variable Setting	Behavior	Purpose
Phase 1	`dr = "active"`, `dr_use_dedicated_routing = false`	Generic Kubernetes service selectors route all traffic to available digest pods. UI and Ingest node pools begin scaling up.	Zero-downtime transition - traffic flows immediately
Phase 2	`dr = "active"`, `dr_use_dedicated_routing = true`	Pool-specific Kubernetes service selectors: UI traffic routes to UI pods, Ingest traffic routes to Ingest pods	Optimal load distribution after all pools are ready

Remote State Configuration

DR deployments use Terraform remote state to share configuration between the primary and secondary clusters:

Direction	Data Shared	Purpose
Primary to Secondary	Storage encryption key, Ingest LB IP, Health check IDs, Steering policy ID	Standby cluster needs primary's encryption key for bucket access and health check IDs for failover automation
Secondary to Primary	Ingest LB IP, Health check IDs	Primary creates health check monitors for the secondary endpoint