Architecture
Encryption key synchronization:
Primary generates the key on first deploy and exports it as a sensitive Terraform output.
Secondary reads the key via data.terraform_remote_state and creates a Kubernetes secret with the same value.
AZURE_RECOVER_FROM_*environment variables are set on the standby cluster as soon as it is provisioned, but they are only consumed once the single LogScale pod is started during the DR promotion procedure.
Global DNS and Azure Function-based failover scaler:
A global DNS name (for example,
${global_logscale_hostname}.${global_dns_zone_name}) is managed in Azure DNS with aCNAMEpointing to Azure Traffic Manager. When the primary health check fails, Traffic Manager routes traffic to the secondary cluster.On the standby Azure cluster (
dr="standby"), an event-driven chain (Traffic Manager Health Check → Azure Monitor Alert → Azure Function) scales the Humio operator from 0 → 1 so it can reconcile the already-declarednodeCount=1and start the single LogScale pod. The Azure Function does not changespec.nodeCount, and it does not scale back down automatically; if you disable the Azure Function, you must scale the operator manually.Terraform only deploys this Azure Function when
dr="standby"for that state file. When you promote the secondary todr="active"and re-apply, the Azure Function resources are removed automatically.This reduces reliance on in-cluster polling: failover is triggered by the same health-check signal that drives DNS failover, and Kubernetes access is granted to the Azure Function via AKS credentials.
Azure Function Configuration Options:
The DR failover Azure Function includes several configurable features to prevent false failovers and handle transient errors:
| Name | Type | Required | Description |
|---|---|---|---|
dr_failover_function_pre_failover_failure_seconds
| 180 | 0-600 | Minimum consecutive seconds primary must be failing before failover. Set to 0 for immediate failover (testing only). |
dr_failover_function_cooldown_seconds
| 300 | 0-3600 | Minimum time between failovers to prevent flapping |
dr_failover_function_max_retries
| 3 | 0-10 | Retry attempts for K8s API calls on transient errors |
These can be configured in the secondary tfvars or left at defaults. Pre-failover validation always runs to ensure the Azure Function doesn't trigger failover on brief network blips or when the primary has recovered (for example, during failback). The retry logic handles transient Kubernetes API failures during the scaling operation.
Example tfvars configuration for faster failover (testing):
# Immediate failover for testing (set to 180 for production)
dr_failover_function_pre_failover_failure_seconds = 0Deterministic Storage Container Naming
Storage account names must be globally unique in Azure. This repo
intentionally includes a short random prefix
(random_string.name-modifier) in
local.resource_name_prefix, so the exact storage
account/container names are:
Stable within a state file (the random prefix is stored in Terraform state)
Not knowable in advance before the first apply
For DR operations, do not guess names. Use Terraform outputs in each state:
terraform output -raw storage_account_name
terraform output -raw storage_container_name
terraform output -raw storage_blob_endpointImportant
The current DR design does not require the primary to pre-know the
secondary container name for RBAC. The standby cluster reads the
primary's storage details via
primary_remote_state_config and performs the
cross-region firewall update from the standby side. See
Cross-Region Storage Access for DR Recovery.