Failover Timing Reference

Note

The timing measurements in this section were taken during testing with an xsmall cluster size and an advanced cluster type, without any real applications running on the AKS cluster. Production timings may vary depending on cluster size, workload, and resource contention.

Failover Timing Breakdown

Phase Duration Description
1. Alert Detection ~30-90s Azure Monitor detects Traffic Manager probe failures (depends on probe interval)
2. Alert Evaluation ~1-5 min Azure Monitor alert rule evaluation period (typically 1-5 minutes)
3. Pre-Failover Holdoff Configurable PRE_FAILOVER_FAILURE_SECONDS - Function waits to prevent flapping
4. Cooldown Check 0s If a failover happened within COOLDOWN_SECONDS, abort
5. Function Execution ~10-20s Authentication + K8s API calls to scale operator
6. Endpoint Disable ~2-5s Disables primary TM endpoint to prevent automatic failback

Configuration Variables

Variable Default Testing Description
dr_failover_function_pre_failover_failure_seconds 180 0 Seconds to wait after alert before triggering failover
dr_failover_function_cooldown_seconds 300 60 Minimum seconds between failovers
dr_failover_function_max_retries 3 3 Maximum retry attempts for K8s API calls
traffic_manager_health_check_interval 30 10 Seconds between health probes (min: 10)
traffic_manager_health_check_timeout 10 5 Probe timeout in seconds (min: 5)
traffic_manager_tolerated_failures 3 0 Failures before unhealthy (0 = immediate)

How the Holdoff Works

  1. Alert fires โ†’ HTTP trigger invokes the Azure Function

  2. Function records first-failure timestamp as an annotation on the humio-operator Deployment (logscale.dr/degraded-since-epoch)

  3. Function checks holdoff: now - degraded_since >= PRE_FAILOVER_FAILURE_SECONDS

  4. If NOT enough time has passed โ†’ Function returns HTTP 202 and takes no action

  5. If enough time has passed โ†’ Function patches humio-operator replicas 0โ†’1, records logscale.dr/last-failover-epoch, and disables the primary TM endpoint

Note

The function is only invoked when the Azure Monitor alert fires. There is no timer-based retry mechanism.

Total Expected Time (Detection โ†’ Function Complete)

Configuration Pre-Failover Wait Total Time
Production 180s (3 min) ~4-5 minutes
Testing 0s (immediate) ~1-2 minutes

Post-Failover Timeline

After Azure Function completes, additional time is required for full service restoration:

Stage Duration
Operator creates HumioCluster resources ~30-60 seconds
LogScale pods scheduled and started ~60-120 seconds
LogScale initialization (Kafka connect, recovery) ~60-180 seconds
Traffic Manager detects secondary healthy ~30 seconds
Total (Function Complete โ†’ Service Ready) ~4-7 minutes

End-to-End Timeline Summary

Configuration Detection โ†’ Function Function โ†’ Service Total
Production (pre_failover=180s) ~4-5 min ~4-7 min ~9-12 minutes
Testing (pre_failover=0s) ~1-2 min ~4-7 min ~5-9 minutes

Configuring for Testing vs Production

For faster testing (apply to both primary and secondary).

Primary cluster (primary-<region>.tfvars):

terraform
# Traffic Manager health check - faster detection
traffic_manager_health_check_interval = 10 # Probe every 10 seconds (min)
traffic_manager_health_check_timeout = 5 # 5 second timeout
traffic_manager_tolerated_failures = 0 # First failure = unhealthy

Secondary cluster (secondary-<region>.tfvars):

terraform
# Azure Function - immediate failover
dr_failover_function_pre_failover_failure_seconds = 0 # No holdoff delay
dr_failover_function_cooldown_seconds = 60 # 1 minute cooldown

Testing timeline: ~1-2 minutes (10s probe + 1min alert + immediate function)

For production (prevent false positives):

Primary cluster:

terraform
traffic_manager_health_check_interval = 30 # Probe every 30 seconds
traffic_manager_health_check_timeout = 10 # 10 second timeout
traffic_manager_tolerated_failures = 3 # 3 failures before unhealthy

Secondary cluster:

terraform
dr_failover_function_pre_failover_failure_seconds = 180 # 3 minute holdoff
dr_failover_function_cooldown_seconds = 300 # 5 minute cooldown

Production timeline: ~5-7 minutes (90s probe failures + 1min alert + 3min holdoff)