Failover Timing Summary

This section documents the expected time from primary failure detection to secondary cluster activation.

Failover Timing Summary

The failover process has multiple phases, each with configurable timing:

Phase	Duration	Description
1. Alert Detection	~30-90s	Azure Monitor detects Traffic Manager probe failures (depends on probe interval)
2. Alert Evaluation	~1-5 min	Azure Monitor alert rule evaluation period (typically 1-5 minutes)
3. Pre-Failover Holdoff	Configurable	`PRE_FAILOVER_FAILURE_SECONDS` - Function waits to prevent flapping
4. Cooldown Check	0s	If a failover happened within `COOLDOWN_SECONDS`, abort
5. Function Execution	~10-20s	Authentication + K8s API calls to scale operator

Configuration Variables

Timing-related configuration variables are shown in the following table:

Variable	Default	Testing	Description
`dr_failover_function_pre_failover_failure_seconds`	180	0	Seconds to wait after alert before triggering failover
`dr_failover_function_cooldown_seconds`	300	60	Minimum seconds between failovers
`dr_failover_function_max_retries`	3	3	Maximum retry attempts for K8s API calls
`traffic_manager_health_check_interval`	30	10	Seconds between health probes (min: 10)
`traffic_manager_health_check_timeout`	10	5	Probe timeout in seconds (min: 5)
`traffic_manager_tolerated_failures`	3	0	Failures before unhealthy (0 = immediate)

How the Holdoff Works

Holdoff works as follows:

Alert fires → HTTP trigger invokes the Azure Function
Function records first-failure timestamp as an annotation on the humio-operator Deployment (logscale.dr/degraded-since-epoch)
Function checks holdoff: now - degraded_since >= PRE_FAILOVER_FAILURE_SECONDS
If not enough time has passed → Function returns HTTP 202 and takes no action
If enough time has passed → Function patches humio-operator replicas 0→1 and records logscale.dr/last-failover-epoch

Note

The function is only invoked when the Azure Monitor alert fires. There is no timer-based retry mechanism.

Total Expected Time (Detection → Function Complete)

The following table shows the totla expected time between detection and function complete:

Configuration	Pre-Failover Wait	Total Time
Production	180s (3 min)	~4-5 minutes
Testing	0s (immediate)	~1-2 minutes

Post-Failover Timeline

After Azure Function completes, additional time is required for full service restoration:

Stage	Duration
Operator creates HumioCluster resources	~30-60 seconds
LogScale pods scheduled and started	~60-120 seconds
LogScale initialization (Kafka connect, recovery)	~60-180 seconds
Traffic Manager detects secondary healthy	~30 seconds
Total (Function Complete → Service Ready)	~4-7 minutes

End-to-End Timeline Summary

Configuration	Detection → Function	Function → Service	Total
Production (pre_failover=180s)	~4-5 min	~4-7 min	~9-12 minutes
Testing (pre_failover=0s)	~1-2 min	~4-7 min	~5-9 minutes

Configuring for Testing vs Production

For faster testing, apply the following to both primary and secondary.

Primary cluster (primary-centralus.tfvars):

terraform

# Traffic Manager health check - faster detection
traffic_manager_health_check_interval = 10 # Probe every 10 seconds (min)
traffic_manager_health_check_timeout  = 5  # 5 second timeout
traffic_manager_tolerated_failures    = 0  # First failure = unhealthy

Secondary cluster (secondary-eastus2.tfvars):

terraform

# Azure Function - immediate failover
dr_failover_function_pre_failover_failure_seconds = 0   # No holdoff delay
dr_failover_function_cooldown_seconds             = 60  # 1 minute cooldown

Testing timeline: ~1-2 minutes (10s probe + 1min alert + immediate function)

For production (prevent false positives).

Primary cluster:

terraform

traffic_manager_health_check_interval = 30 # Probe every 30 seconds
traffic_manager_health_check_timeout  = 10 # 10 second timeout
traffic_manager_tolerated_failures    = 3  # 3 failures before unhealthy

Secondary cluster:

terraform

dr_failover_function_pre_failover_failure_seconds = 180 # 3 minute holdoff
dr_failover_function_cooldown_seconds             = 300 # 5 minute cooldown

Production timeline: ~5-7 minutes (90s probe failures + 1min alert + 3min holdoff).

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

Failover Timing Summary

Failover Timing Summary

Configuration Variables

How the Holdoff Works

Note

Total Expected Time (Detection → Function Complete)

Post-Failover Timeline

End-to-End Timeline Summary

Configuring for Testing vs Production

Enter search term