Failover Timing Summary
This section documents the expected time from primary failure detection to secondary cluster activation.
Failover Timing Summary
The failover process has multiple phases, each with configurable timing:
| Phase | Duration | Description |
|---|---|---|
| 1. Alert Detection | ~30-90s | Azure Monitor detects Traffic Manager probe failures (depends on probe interval) |
| 2. Alert Evaluation | ~1-5 min | Azure Monitor alert rule evaluation period (typically 1-5 minutes) |
| 3. Pre-Failover Holdoff | Configurable |
PRE_FAILOVER_FAILURE_SECONDS - Function waits
to prevent flapping
|
| 4. Cooldown Check | 0s |
If a failover happened within COOLDOWN_SECONDS,
abort
|
| 5. Function Execution | ~10-20s | Authentication + K8s API calls to scale operator |
Configuration Variables
Timing-related configuration variables are shown in the following table:
| Variable | Default | Testing | Description |
|---|---|---|---|
dr_failover_function_pre_failover_failure_seconds
| 180 | 0 | Seconds to wait after alert before triggering failover |
dr_failover_function_cooldown_seconds
| 300 | 60 | Minimum seconds between failovers |
dr_failover_function_max_retries
| 3 | 3 | Maximum retry attempts for K8s API calls |
traffic_manager_health_check_interval
| 30 | 10 | Seconds between health probes (min: 10) |
traffic_manager_health_check_timeout
| 10 | 5 | Probe timeout in seconds (min: 5) |
traffic_manager_tolerated_failures
| 3 | 0 | Failures before unhealthy (0 = immediate) |
How the Holdoff Works
Holdoff works as follows:
Alert fires → HTTP trigger invokes the Azure Function
Function records first-failure timestamp as an annotation on the humio-operator Deployment (
logscale.dr/degraded-since-epoch)Function checks holdoff:
now - degraded_since >= PRE_FAILOVER_FAILURE_SECONDSIf not enough time has passed → Function returns HTTP 202 and takes no action
If enough time has passed → Function patches
humio-operatorreplicas 0→1 and recordslogscale.dr/last-failover-epoch
Note
The function is only invoked when the Azure Monitor alert fires. There is no timer-based retry mechanism.
Total Expected Time (Detection → Function Complete)
The following table shows the totla expected time between detection and function complete:
| Configuration | Pre-Failover Wait | Total Time |
|---|---|---|
| Production | 180s (3 min) | ~4-5 minutes |
| Testing | 0s (immediate) | ~1-2 minutes |
Post-Failover Timeline
After Azure Function completes, additional time is required for full service restoration:
| Stage | Duration |
|---|---|
| Operator creates HumioCluster resources | ~30-60 seconds |
| LogScale pods scheduled and started | ~60-120 seconds |
| LogScale initialization (Kafka connect, recovery) | ~60-180 seconds |
| Traffic Manager detects secondary healthy | ~30 seconds |
| Total (Function Complete → Service Ready) | ~4-7 minutes |
End-to-End Timeline Summary
| Configuration | Detection → Function | Function → Service | Total |
|---|---|---|---|
| Production (pre_failover=180s) | ~4-5 min | ~4-7 min | ~9-12 minutes |
| Testing (pre_failover=0s) | ~1-2 min | ~4-7 min | ~5-9 minutes |
Configuring for Testing vs Production
For faster testing, apply the following to both primary and secondary.
Primary cluster (primary-centralus.tfvars):
# Traffic Manager health check - faster detection
traffic_manager_health_check_interval = 10 # Probe every 10 seconds (min)
traffic_manager_health_check_timeout = 5 # 5 second timeout
traffic_manager_tolerated_failures = 0 # First failure = unhealthy
Secondary cluster (secondary-eastus2.tfvars):
# Azure Function - immediate failover
dr_failover_function_pre_failover_failure_seconds = 0 # No holdoff delay
dr_failover_function_cooldown_seconds = 60 # 1 minute cooldownTesting timeline: ~1-2 minutes (10s probe + 1min alert + immediate function)
For production (prevent false positives).
Primary cluster:
traffic_manager_health_check_interval = 30 # Probe every 30 seconds
traffic_manager_health_check_timeout = 10 # 10 second timeout
traffic_manager_tolerated_failures = 3 # 3 failures before unhealthySecondary cluster:
dr_failover_function_pre_failover_failure_seconds = 180 # 3 minute holdoff
dr_failover_function_cooldown_seconds = 300 # 5 minute cooldownProduction timeline: ~5-7 minutes (90s probe failures + 1min alert + 3min holdoff).