Failover Timing Summary

This section documents the expected time from primary failure detection to secondary cluster activation.

Failover Timing Summary

The failover process has multiple phases, each with configurable timing:

Phase Duration Description
1. Alert Detection ~30-90s Azure Monitor detects Traffic Manager probe failures (depends on probe interval)
2. Alert Evaluation ~1-5 min Azure Monitor alert rule evaluation period (typically 1-5 minutes)
3. Pre-Failover Holdoff Configurable PRE_FAILOVER_FAILURE_SECONDS - Function waits to prevent flapping
4. Cooldown Check 0s If a failover happened within COOLDOWN_SECONDS, abort
5. Function Execution ~10-20s Authentication + K8s API calls to scale operator
Configuration Variables

Timing-related configuration variables are shown in the following table:

Variable Default Testing Description
dr_failover_function_pre_failover_failure_seconds 180 0 Seconds to wait after alert before triggering failover
dr_failover_function_cooldown_seconds 300 60 Minimum seconds between failovers
dr_failover_function_max_retries 3 3 Maximum retry attempts for K8s API calls
traffic_manager_health_check_interval 30 10 Seconds between health probes (min: 10)
traffic_manager_health_check_timeout 10 5 Probe timeout in seconds (min: 5)
traffic_manager_tolerated_failures 3 0 Failures before unhealthy (0 = immediate)
How the Holdoff Works

Holdoff works as follows:

  • Alert fires → HTTP trigger invokes the Azure Function

  • Function records first-failure timestamp as an annotation on the humio-operator Deployment (logscale.dr/degraded-since-epoch)

  • Function checks holdoff: now - degraded_since >= PRE_FAILOVER_FAILURE_SECONDS

  • If not enough time has passed → Function returns HTTP 202 and takes no action

  • If enough time has passed → Function patches humio-operator replicas 0→1 and records logscale.dr/last-failover-epoch

Note

The function is only invoked when the Azure Monitor alert fires. There is no timer-based retry mechanism.

Total Expected Time (Detection → Function Complete)

The following table shows the totla expected time between detection and function complete:

Configuration Pre-Failover Wait Total Time
Production 180s (3 min) ~4-5 minutes
Testing 0s (immediate) ~1-2 minutes
Post-Failover Timeline

After Azure Function completes, additional time is required for full service restoration:

Stage Duration
Operator creates HumioCluster resources ~30-60 seconds
LogScale pods scheduled and started ~60-120 seconds
LogScale initialization (Kafka connect, recovery) ~60-180 seconds
Traffic Manager detects secondary healthy ~30 seconds
Total (Function Complete → Service Ready) ~4-7 minutes
End-to-End Timeline Summary
Configuration Detection → Function Function → Service Total
Production (pre_failover=180s) ~4-5 min ~4-7 min ~9-12 minutes
Testing (pre_failover=0s) ~1-2 min ~4-7 min ~5-9 minutes
Configuring for Testing vs Production

For faster testing, apply the following to both primary and secondary.

Primary cluster (primary-centralus.tfvars):

terraform
# Traffic Manager health check - faster detection
traffic_manager_health_check_interval = 10 # Probe every 10 seconds (min)
traffic_manager_health_check_timeout  = 5  # 5 second timeout
traffic_manager_tolerated_failures    = 0  # First failure = unhealthy

Secondary cluster (secondary-eastus2.tfvars):

terraform
# Azure Function - immediate failover
dr_failover_function_pre_failover_failure_seconds = 0   # No holdoff delay
dr_failover_function_cooldown_seconds             = 60  # 1 minute cooldown

Testing timeline: ~1-2 minutes (10s probe + 1min alert + immediate function)

For production (prevent false positives).

Primary cluster:

terraform
traffic_manager_health_check_interval = 30 # Probe every 30 seconds
traffic_manager_health_check_timeout  = 10 # 10 second timeout
traffic_manager_tolerated_failures    = 3  # 3 failures before unhealthy

Secondary cluster:

terraform
dr_failover_function_pre_failover_failure_seconds = 180 # 3 minute holdoff
dr_failover_function_cooldown_seconds             = 300 # 5 minute cooldown

Production timeline: ~5-7 minutes (90s probe failures + 1min alert + 3min holdoff).