Terraform Modules for Disaster Recovery (DR)

This implementation introduces two new Terraform modules specifically for disaster recovery. These modules automate the critical DR operations that would otherwise require manual intervention.

Why These Modules Are Needed

In a disaster recovery scenario, two things must happen quickly:

  1. Traffic must be redirected from the failed primary cluster to the healthy secondary cluster

  2. The secondary cluster must start up and begin serving requests

Without automation, an operator would need to manually update DNS records and scale up Kubernetes deployments - a process that could take 15-30 minutes or more. The modules below reduce this to under 10 minutes with no human intervention.

module.global-dns

The following table provides a summary of this module:

Purpose Summary Deployed on
Provides automatic traffic failover between primary and secondary clusters using Azure Traffic Manager. When users access your LogScale cluster, they use a single global DNS name (like logscale.example.com). This module creates an Azure Traffic Manager that continuously monitors both clusters' health. If the primary cluster becomes unhealthy, Traffic Manager automatically routes all traffic to the secondary cluster - no DNS changes needed, no manual intervention required. Primary (active) cluster only when manage_global_dns = true

Key resources created:

Resource Purpose
azurerm_traffic_manager_profile Manages health-based routing between clusters using Priority routing method
azurerm_traffic_manager_external_endpoint (primary) Points to primary cluster's load balancer IP (priority 1)
azurerm_dns_cname_record Creates the global hostname CNAME pointing to Traffic Manager (optional - see below)

Secondary Endpoint Registration:

The secondary (standby) cluster automatically registers itself with Traffic Manager using azapi_resource.traffic_manager_secondary_endpoint. This resource:

  • Is created when manage_global_dns = false and the primary Traffic Manager endpoint ID is available via remote state (so it persists through dr promotion)

  • Adds the secondary cluster's load balancer IP as a priority 2 endpoint

  • Requires no manual configuration - the standby cluster discovers the Traffic Manager profile from the primary's remote state

This approach eliminates the need for the primary cluster to know the secondary's IP address in advance, simplifying the deployment sequence.

DNS Configuration Options:

The module supports two DNS configurations depending on where your DNS zone is hosted:

Scenario global_dns_create_azure_record global_dns_zone_resource_group Action Required
DNS in Azure true Resource group name Module creates CNAME automatically
DNS external (AWS Route 53, etc.) false "" (empty) Manually create CNAME in external DNS

When DNS is managed externally (e.g., AWS Route 53):

If your DNS zone is hosted outside Azure (common when humio.net or similar domains are managed in AWS Route 53), set:

text
global_dns_zone_name           = "azure-dr.humio.net"  # Used for Traffic Manager host header
global_dns_zone_resource_group = ""                    # Empty - no Azure DNS zone
global_dns_create_azure_record = false                 # Skip Azure DNS CNAME creation

Then manually create a CNAME record in your external DNS provider:

Record Type Name Value TTL
CNAME <global_logscale_hostname>.<zone> <global_logscale_hostname>.trafficmanager.net 60

Example for AWS Route 53:

text
Record Name: logscale.azure-dr.humio.net
Record Type: CNAME
Record Value: <tm-profile>.trafficmanager.net
TTL: 60 seconds

Note

Unlike OCI (which requires NS record delegation for subdomain zones), Azure Traffic Manager uses its own *.trafficmanager.net domain. You only need a simple CNAME record pointing to the Traffic Manager FQDN - no NS delegation required.

Traffic Manager Priority Routing:

The following diagram provides an overview of the traffic routing:

Traffic Manager Priority Routing

Priority Routing Logic:

The following table shows priority routing logic:

Primary Status Secondary Status Traffic Routed To
Online Online Primary (Priority 1)
Online Degraded Primary (Priority 1)
Degraded Online Secondary (Priority 2)
Degraded Degraded No healthy endpoint

Expected Profile Status During Normal Operations:

When the secondary cluster is in standby mode (dr="standby"), the Traffic Manager profile will show a status of "Degraded". This is expected and correct behavior, not an error.

Why this happens:

  • The secondary cluster's humio-operator is scaled to 0 replicas (standby mode)

  • No LogScale pods are running on the secondary cluster

  • The secondary endpoint fails health checks and shows "Degraded"

  • Traffic Manager marks the overall profile as "Degraded" when any endpoint is unhealthy

What you should see:

Component Expected Status Notes
Primary endpoint Online Actively serving traffic
Secondary endpoint Degraded Expected - standby mode, no LogScale pods
Profile status Degraded Expected - reflects secondary's standby state
Traffic routing Working Routes to primary (highest priority Online endpoint)

Verification:

Despite the "Degraded" profile status, traffic routes correctly:

bash
# Global DR FQDN should return HTTP 200
curl -sk https://logscale.azure-dr.humio.net/api/v1/status
# DNS resolves to primary IP
dig +short logscale.azure-dr.humio.net
# Returns: <tm-profile>.trafficmanager.net → <primary-ip>

When to be concerned:

Only if the primary endpoint shows "Degraded" while the primary cluster should be healthy. This indicates an actual issue requiring investigation.

Health Check Configuration:

Setting Value Description
Protocol HTTPS Secure health probes
Port 443 Standard HTTPS port
Path /api/v1/status LogScale health endpoint
Interval 30 seconds Probe frequency
Timeout 10 seconds Max wait for response
Tolerated Failures 3 Failures before marking Degraded
Host Header logscale.azure-dr.humio.net Required for ingress routing

Failover Timing:

  • Detection: ~90 seconds (3 failures × 30s interval)

  • DNS propagation: ~60 seconds (TTL)

  • Total failover time: ~2-3 minutes

module.dr-failover-function

The following table provides a summary of this module:

Purpose Summary Deployed on
Automatically starts the LogScale application on the secondary cluster when the primary fails. The secondary cluster runs in a minimal "standby" state to save costs - the Humio operator is scaled to zero, so no LogScale pods are running. When the primary cluster fails, this module's Azure Function automatically scales up the Humio operator, which then starts the LogScale pod to recover from the primary's data. This happens automatically, triggered by the same health check that Traffic Manager uses. Secondary (standby) cluster only when dr = "standby" and dr_failover_function_enabled = true

Key resources created:

The following table shows the key resources created:

Resource Purpose
azurerm_service_plan Consumption-based (Y1) plan for cost efficiency
azurerm_linux_function_app Python 3.11 function that scales the Humio operator
azurerm_role_assignment Grants the function "AKS Cluster Admin" role to manage deployments
azurerm_monitor_action_group Connects the alert to the function
azurerm_monitor_metric_alert Fires when primary Traffic Manager endpoint becomes unhealthy
azurerm_storage_account Storage account for Function App (deployed in same region as Function App)

Metric Alert Configuration:

The metric alert monitors the health state of the primary Traffic Manager external endpoint:

Setting Value Description
Metric Namespace Microsoft.Network/trafficManagerProfiles Traffic Manager profile metrics
Metric Name ProbeAgentCurrentEndpointStateByProfileResourceId Endpoint health state (1=healthy, 0=unhealthy)
Aggregation Maximum Use maximum value in evaluation window
Operator LessThan Alert when value drops below threshold
Threshold 1 Fires when endpoint is unhealthy (state < 1)
Frequency PT1M Evaluate every 1 minute
Window Size PT1M Evaluate over 1-minute window
Dimension filter EndpointName = <primary-endpoint-name> Filters the metric to the primary external endpoint
Skip metric validation true Allows alert creation even if the metric is temporarily unavailable

Implementation note:

The module scopes the alert to the Traffic Manager profile and uses the EndpointName dimension (extracted from the primary endpoint resource ID) to target only the primary endpoint.

How it works:

The following diagram provides an overview of the process:

DR Azure Failover Process

Failover chain timing:

Stage Duration
Traffic Manager detection ~30-60 seconds
Azure Monitor alert evaluation ~60 seconds
Pre-failover validation (configurable) ~180 seconds (default)
Azure Function execution ~10-20 seconds
Total (detection → function complete) ~4-5 minutes

Configuration options (in tfvars):

Variable Default Description
dr_failover_function_pre_failover_failure_seconds 180 Seconds primary must be failing before triggering failover (set to 0 for testing)
dr_failover_function_cooldown_seconds 300 Minimum time between failovers to prevent flapping
dr_failover_function_max_retries 3 Retry attempts for Kubernetes API calls
dr_failover_function_location null Azure region override for Function App deployment. If not set, defaults to the resource group region. Useful when the primary region lacks quota for consumption-based (Y1) Function Apps.
dr_failover_function_sku Y1 SKU for the Function App Service Plan. Options: Y1 (Consumption), EP1/EP2/EP3 (Premium), B1/B2/B3 (Basic). Use B1 if Consumption/Premium quota is unavailable in your region.

Function App SKU Selection:

Azure Function Apps support multiple pricing tiers. The choice depends on quota availability in your target region:

SKU Type Quote Required Use Case
Y1 Consumption Dynamic VMs Default, cheapest, pay-per-execution
EP1/EP2/EP3 Premium ElasticPremium VMs Pre-warmed instances, VNet integration
B1/B2/B3 Basic BS Series Fallback when Consumption/Premium unavailable

Checking Azure Quota:

If deployment fails with quota errors, check available quota:

shell
# Check Dynamic VMs quota (for Y1 Consumption plan)
az vm list-usage --location <region> -o table | grep -i dynamic

# Check BS Series quota (for B1/B2/B3 Basic plan)
az vm list-usage --location <region> -o table | grep -i "BS Family"

Example tfvars for Basic SKU (quota workaround):

shell
# Use Basic plan when Consumption (Y1) quota is unavailable
dr_failover_function_sku      = "B1"
dr_failover_function_location = "eastus2"

Cross-Region Deployment:

The Function App can be deployed to a different Azure region than the AKS cluster if quota constraints prevent deployment in the primary region. This is configured using the dr_failover_function_location variable:

text
# Deploy Function App to westus when eastus2 lacks Y1 quota
dr_failover_function_location = "westus"

Why this works:

The Azure Function communicates with AKS using Azure's control plane API (ARM), not the pod network. The Function App's Managed Identity is granted the "Azure Kubernetes Service Cluster Admin Role" on the AKS cluster, which is a subscription-scoped RBAC assignment that works regardless of the Function App's region.

AKS Authorized IP Ranges (Critical):

When AKS is configured with authorized IP ranges (ip_ranges_allowed_to_kubeapi), the Function App's outbound IPs must be included in the authorized list. Otherwise, the function will fail to connect to the Kubernetes API with connection timeout errors.

This is handled automatically by the azapi_update_resource.aks_authorized_ips_for_function resource in data-sources.tf, which:

  1. Reads the Function App's possible_outbound_ip_addresses after deployment

  2. Merges these IPs with the existing AKS authorized IP ranges

  3. Updates the AKS cluster's apiServerAccessProfile.authorizedIPRanges

Why this requires a separate resource:

A circular dependency exists because:

  • The AKS cluster must be created before the Function App (Function App needs AKS resource ID for RBAC)

  • The Function App's outbound IPs are only known after creation

  • The AKS cluster needs the Function App's IPs in its authorized ranges

The azapi_update_resource.aks_authorized_ips_for_function resource breaks this cycle by updating AKS after both resources exist.

Troubleshooting Function App connectivity:

If the Function App logs show connection timeout errors like:

text
Connection to
        <ks-cluster>.hcp.<region>.azmk8s.io timed
        out

Verify the function's outbound IPs are in the AKS authorized ranges:

shell
# Get Function App outbound IPs
az functionapp show \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "possibleOutboundIpAddresses" -o tsv | tr ',' '\n'

# Get AKS authorized IP ranges
az aks show \
  --name <aks-cluster-name> \
  --resource-group <rg-name> \
  --query "apiServerAccessProfile.authorizedIpRanges" -o tsv

# If IPs are missing, apply the data-sources update:
terraform apply -var-file=secondary-<region>.tfvars \
  -target=azapi_update_resource.aks_authorized_ips_for_function

Note

The function may have up to 25+ possible outbound IPs depending on the SKU. All must be included in the authorized IP ranges for reliable connectivity.

Troubleshooting Function Code Deployment:

When updating the Function App code, Terraform's zip_deploy_file may not always trigger a proper code update due to Azure's caching mechanisms. If the function continues to fail with errors referencing old code after a Terraform apply, use the Kudu zipdeploy API directly.

Symptoms of stale code deployment:

  • Function logs show errors referencing code/variables that have been removed

  • The FUNCTION_CODE_HASH app setting shows the new hash, but the function behavior doesn't match

  • Function returns HTTP 500 with empty response body

Diagnose by checking deployed code:

shell
# Get deployment credentials
CREDS=$(az functionapp deployment list-publishing-profiles \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "[?publishMethod=='MSDeploy'].{user:userName,pass:userPWD}" -o tsv)
USER=$(echo "$CREDS" | cut -f1)
PASS=$(echo "$CREDS" | cut -f2)

# Check what's in wwwroot
curl -s -u "${USER}:${PASS}" \
  "https://<function-app-name>.scm.azurewebsites.net/api/vfs/site/wwwroot/" | jq '.[].name'

# Check recent function logs for errors
curl -s -u "${USER}:${PASS}" \
  "https://<function-app-name>.scm.azurewebsites.net/api/vfs/LogFiles/Application/Functions/Host/" | \
  jq -r '.[0].href' | xargs curl -s -u "${USER}:${PASS}" | tail -50

Force code deployment via Kudu API:

shell
# Navigate to the dr-failover-function module
cd modules/azure/dr-failover-function

# Get deployment credentials
USER='$<function-app-name>'
PASS=$(az functionapp deployment list-publishing-profiles \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "[?publishMethod=='MSDeploy'].userPWD" -o tsv)

# Deploy directly via Kudu zipdeploy API
curl -X POST -u "${USER}:${PASS}" \
  --data-binary @function_app.zip \
  "https://<function-app-name>.scm.azurewebsites.net/api/zipdeploy?isAsync=false" \
  -H "Content-Type: application/zip" \
  --max-time 300

# Verify deployment succeeded
curl -s -u "${USER}:${PASS}" \
  "https://<function-app-name>.scm.azurewebsites.net/api/deployments" | \
  jq '.[0] | {status, complete, end_time}'

After Kudu deployment, restart the function:

shell
# Restart to clear any cached code
az functionapp restart \
  --name <function-app-name> \
  --resource-group <rg-name>

# Wait 30 seconds for restart, then test
sleep 30

# Test the function
FUNC_KEY=$(az functionapp keys list \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "functionKeys.default" -o tsv)

curl -X POST "https://<function-app-name>.azurewebsites.net/api/dr_failover_trigger?code=${FUNC_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"schemaId":"azureMonitorCommonAlertSchema","data":{"essentials":{"firedDateTime":"2024-01-01T00:00:00Z"}}}'

Why Terraform deployment may fail to update code:

The zip_deploy_file attribute in the azurerm_linux_function_app resource creates the deployment package, but Azure Functions may cache the old code in memory. The FUNCTION_CODE_HASH app setting is used to trigger redeployment when code changes, but the actual extraction and loading of new code depends on Azure's deployment infrastructure.

When Terraform shows "apply complete" but the code isn't updated:

  1. The hash changed and Terraform updated the Function App resource

  2. Azure accepted the new zip file

  3. But the worker process continued running with cached code

The Kudu /api/zipdeploy endpoint bypasses this by:

  1. Uploading the zip directly to the deployment infrastructure

  2. Triggering a full rebuild (when SCM_DO_BUILD_DURING_DEPLOYMENT=true)

  3. Forcing the worker to reload from the new deployment