Failover Timing Reference
Note
The timing measurements in this section were taken during testing with an xsmall cluster size and an advanced cluster type, without any real applications running on the AKS cluster. Production timings may vary depending on cluster size, workload, and resource contention.
Failover Timing Breakdown
| Phase | Duration | Description |
|---|---|---|
| 1. Alert Detection | ~30-90s | Azure Monitor detects Traffic Manager probe failures (depends on probe interval) |
| 2. Alert Evaluation | ~1-5 min | Azure Monitor alert rule evaluation period (typically 1-5 minutes) |
| 3. Pre-Failover Holdoff | Configurable | PRE_FAILOVER_FAILURE_SECONDS - Function waits to prevent flapping |
| 4. Cooldown Check | 0s | If a failover happened within COOLDOWN_SECONDS, abort |
| 5. Function Execution | ~10-20s | Authentication + K8s API calls to scale operator |
| 6. Endpoint Disable | ~2-5s | Disables primary TM endpoint to prevent automatic failback |
Configuration Variables
| Variable | Default | Testing | Description |
|---|---|---|---|
dr_failover_function_pre_failover_failure_seconds
| 180 | 0 | Seconds to wait after alert before triggering failover |
dr_failover_function_cooldown_seconds
| 300 | 60 | Minimum seconds between failovers |
dr_failover_function_max_retries
| 3 | 3 | Maximum retry attempts for K8s API calls |
traffic_manager_health_check_interval
| 30 | 10 | Seconds between health probes (min: 10) |
traffic_manager_health_check_timeout
| 10 | 5 | Probe timeout in seconds (min: 5) |
traffic_manager_tolerated_failures
| 3 | 0 | Failures before unhealthy (0 = immediate) |
How the Holdoff Works
Alert fires โ HTTP trigger invokes the Azure Function
Function records first-failure timestamp as an annotation on the humio-operator Deployment (
logscale.dr/degraded-since-epoch)Function checks holdoff: now -
degraded_since >= PRE_FAILOVER_FAILURE_SECONDSIf NOT enough time has passed โ Function returns HTTP 202 and takes no action
If enough time has passed โ Function patches humio-operator replicas 0โ1, records
logscale.dr/last-failover-epoch, and disables the primary TM endpoint
Note
The function is only invoked when the Azure Monitor alert fires. There is no timer-based retry mechanism.
Total Expected Time (Detection โ Function Complete)
| Configuration | Pre-Failover Wait | Total Time |
|---|---|---|
| Production | 180s (3 min) | ~4-5 minutes |
| Testing | 0s (immediate) | ~1-2 minutes |
Post-Failover Timeline
After Azure Function completes, additional time is required for full service restoration:
| Stage | Duration |
|---|---|
| Operator creates HumioCluster resources | ~30-60 seconds |
| LogScale pods scheduled and started | ~60-120 seconds |
| LogScale initialization (Kafka connect, recovery) | ~60-180 seconds |
| Traffic Manager detects secondary healthy | ~30 seconds |
| Total (Function Complete โ Service Ready) | ~4-7 minutes |
End-to-End Timeline Summary
| Configuration | Detection โ Function | Function โ Service | Total |
|---|---|---|---|
| Production (pre_failover=180s) | ~4-5 min | ~4-7 min | ~9-12 minutes |
| Testing (pre_failover=0s) | ~1-2 min | ~4-7 min | ~5-9 minutes |
Configuring for Testing vs Production
For faster testing (apply to both primary and secondary).
Primary cluster (primary-<region>.tfvars):
# Traffic Manager health check - faster detection
traffic_manager_health_check_interval = 10 # Probe every 10 seconds (min)
traffic_manager_health_check_timeout = 5 # 5 second timeout
traffic_manager_tolerated_failures = 0 # First failure = unhealthy
Secondary cluster (secondary-<region>.tfvars):
# Azure Function - immediate failover
dr_failover_function_pre_failover_failure_seconds = 0 # No holdoff delay
dr_failover_function_cooldown_seconds = 60 # 1 minute cooldownTesting timeline: ~1-2 minutes (10s probe + 1min alert + immediate function)
For production (prevent false positives):
Primary cluster:
traffic_manager_health_check_interval = 30 # Probe every 30 seconds
traffic_manager_health_check_timeout = 10 # 10 second timeout
traffic_manager_tolerated_failures = 3 # 3 failures before unhealthySecondary cluster:
dr_failover_function_pre_failover_failure_seconds = 180 # 3 minute holdoff
dr_failover_function_cooldown_seconds = 300 # 5 minute cooldownProduction timeline: ~5-7 minutes (90s probe failures + 1min alert + 3min holdoff)