Architecture

Encryption key synchronization:

  • Primary generates the key on first deploy and exports it as a sensitive Terraform output.

  • Secondary reads the key via data.terraform_remote_state and creates a Kubernetes secret with the same value.

  • AZURE_RECOVER_FROM_* environment variables are set on the standby cluster as soon as it is provisioned, but they are only consumed once the single LogScale pod is started during the DR promotion procedure.

Global DNS and Azure Function-based failover scaler:

  • A global DNS name (for example, ${global_logscale_hostname}.${global_dns_zone_name}) is managed in Azure DNS with a CNAME pointing to Azure Traffic Manager. When the primary health check fails, Traffic Manager routes traffic to the secondary cluster.

  • On the standby Azure cluster (dr="standby"), an event-driven chain (Traffic Manager Health Check → Azure Monitor Alert → Azure Function) scales the Humio operator from 0 → 1 so it can reconcile the already-declared nodeCount=1 and start the single LogScale pod. The Azure Function does not change spec.nodeCount, and it does not scale back down automatically; if you disable the Azure Function, you must scale the operator manually.

  • Terraform only deploys this Azure Function when dr="standby" for that state file. When you promote the secondary to dr="active" and re-apply, the Azure Function resources are removed automatically.

  • This reduces reliance on in-cluster polling: failover is triggered by the same health-check signal that drives DNS failover, and Kubernetes access is granted to the Azure Function via AKS credentials.

Azure Function Configuration Options:

The DR failover Azure Function includes several configurable features to prevent false failovers and handle transient errors:

Name Type Required Description
dr_failover_function_pre_failover_failure_seconds 180 0-600 Minimum consecutive seconds primary must be failing before failover. Set to 0 for immediate failover (testing only).
dr_failover_function_cooldown_seconds 300 0-3600 Minimum time between failovers to prevent flapping
dr_failover_function_max_retries 3 0-10 Retry attempts for K8s API calls on transient errors

These can be configured in the secondary tfvars or left at defaults. Pre-failover validation always runs to ensure the Azure Function doesn't trigger failover on brief network blips or when the primary has recovered (for example, during failback). The retry logic handles transient Kubernetes API failures during the scaling operation.

Example tfvars configuration for faster failover (testing):

terraform
# Immediate failover for testing (set to 180 for production)
dr_failover_function_pre_failover_failure_seconds = 0
Deterministic Storage Container Naming

Storage account names must be globally unique in Azure. This repo intentionally includes a short random prefix (random_string.name-modifier) in local.resource_name_prefix, so the exact storage account/container names are:

  • Stable within a state file (the random prefix is stored in Terraform state)

  • Not knowable in advance before the first apply

For DR operations, do not guess names. Use Terraform outputs in each state:

shell
terraform output -raw storage_account_name
terraform output -raw storage_container_name
terraform output -raw storage_blob_endpoint

Important

The current DR design does not require the primary to pre-know the secondary container name for RBAC. The standby cluster reads the primary's storage details via primary_remote_state_config and performs the cross-region firewall update from the standby side. See Cross-Region Storage Access for DR Recovery.