Disaster Recovery Operations Guide

This is the operations guide.

See also Disaster Recovery Technical Reference.

Overview

Two clusters are managed using separate Terraform state files:

  • Primary (eastus): production, dr="active".

  • Secondary (westus): standby, dr="standby", minimal capacity, reads the primary's Azure Blob Storage container using the exact same encryption key pulled from remote state, and keeps all LogScale pods scaled to zero until a failover/promotion is initiated.

Region flexibility:

  • The regions shown (eastus and westus) are examples only. You can deploy in any Azure regions supported by your organization. Update azure_resource_group_region in your tfvars, the remote state configuration, and any region-specific references (for example Traffic Manager/DNS) to match your chosen regions.

Key features:

  • Automated encryption key synchronization (no hardcoding). Standby apply requires the primary key (remote state or explicit value).

  • Cross-region storage access via primary storage firewall update (secondary NAT GW IP) + primary storage account key from remote state (AZURE_RECOVER_FROM_ACCOUNTKEY). Terraform also assigns Storage Blob Data Reader to the standby AKS managed identity.

  • Alerts toggle automatically via ENABLE_ALERTS based on dr (true for active, false for standby).

  • Standby keeps Humio Operator scaled to 0; Azure Function (or manual) scales the operator to 1 on failover. NodeCount is already set to 1 on the HumioCluster manifest; no automatic scale-down exists.

  • Manual, controlled promotion by changing dr and applying Terraform.

Key capabilities:

Feature Primary (Active) Secondary (Standby)
Region eastus westus
Cluster Type Advanced (full production) Standby (Humio operator off)
Node Pools All pools per cluster_size (system/digest/ingress/ingest/ui/kafka) System/digest/ingress/kafka only; UI and Ingest node pools not created
Humio nodeCount cluster_size digest count nodeCount=1 declared, but no pods run until operator is scaled up
Humio operator 1 replica 0 replicas until failover
Replication Factor Production value 1 (overridden)
Auto Rebalance Enabled Disabled
Storage Container terraform output -raw storage_container_name terraform output -raw storage_container_name
Encryption Key Generated on first deploy Pulled from primary state (required for standby apply)
Terraform Workspace primary secondary
DR Mode dr = "active" dr = "standby"