Failover Timing Reference

Note

The timing measurements in this section were taken during testing with an xsmall cluster size and an advanced cluster type, without any real applications running on the AKS cluster. Production timings may vary depending on cluster size, workload, and resource contention.

Failover Timing Breakdown

Phase	Duration	Description
1. Alert Detection	~30-90s	Azure Monitor detects Traffic Manager probe failures (depends on probe interval)
2. Alert Evaluation	~1-5 min	Azure Monitor alert rule evaluation period (typically 1-5 minutes)
3. Pre-Failover Holdoff	Configurable	PRE_FAILOVER_FAILURE_SECONDS - Function waits to prevent flapping
4. Cooldown Check	0s	If a failover happened within COOLDOWN_SECONDS, abort
5. Function Execution	~10-20s	Authentication + K8s API calls to scale operator
6. Endpoint Disable	~2-5s	Disables primary TM endpoint to prevent automatic failback

Configuration Variables

Variable	Default	Testing	Description
`dr_failover_function_pre_failover_failure_seconds`	180	0	Seconds to wait after alert before triggering failover
`dr_failover_function_cooldown_seconds`	300	60	Minimum seconds between failovers
`dr_failover_function_max_retries`	3	3	Maximum retry attempts for K8s API calls
`traffic_manager_health_check_interval`	30	10	Seconds between health probes (min: 10)
`traffic_manager_health_check_timeout`	10	5	Probe timeout in seconds (min: 5)
`traffic_manager_tolerated_failures`	3	0	Failures before unhealthy (0 = immediate)

How the Holdoff Works

Alert fires → HTTP trigger invokes the Azure Function
Function records first-failure timestamp as an annotation on the humio-operator Deployment (logscale.dr/degraded-since-epoch)
Function checks holdoff: now - degraded_since >= PRE_FAILOVER_FAILURE_SECONDS
If NOT enough time has passed → Function returns HTTP 202 and takes no action
If enough time has passed → Function patches humio-operator replicas 0→1, records logscale.dr/last-failover-epoch, and disables the primary TM endpoint

Note

The function is only invoked when the Azure Monitor alert fires. There is no timer-based retry mechanism.

Total Expected Time (Detection → Function Complete)

Configuration	Pre-Failover Wait	Total Time
Production	180s (3 min)	~4-5 minutes
Testing	0s (immediate)	~1-2 minutes

Post-Failover Timeline

After Azure Function completes, additional time is required for full service restoration:

Stage	Duration
Operator creates HumioCluster resources	~30-60 seconds
LogScale pods scheduled and started	~60-120 seconds
LogScale initialization (Kafka connect, recovery)	~60-180 seconds
Traffic Manager detects secondary healthy	~30 seconds
Total (Function Complete → Service Ready)	~4-7 minutes

End-to-End Timeline Summary

Configuration	Detection → Function	Function → Service	Total
Production (pre_failover=180s)	~4-5 min	~4-7 min	~9-12 minutes
Testing (pre_failover=0s)	~1-2 min	~4-7 min	~5-9 minutes

Configuring for Testing vs Production

For faster testing (apply to both primary and secondary).

Primary cluster (primary-<region>.tfvars):

terraform

# Traffic Manager health check - faster detection
traffic_manager_health_check_interval = 10 # Probe every 10 seconds (min)
traffic_manager_health_check_timeout = 5 # 5 second timeout
traffic_manager_tolerated_failures = 0 # First failure = unhealthy

Secondary cluster (secondary-<region>.tfvars):

terraform

# Azure Function - immediate failover
dr_failover_function_pre_failover_failure_seconds = 0 # No holdoff delay
dr_failover_function_cooldown_seconds = 60 # 1 minute cooldown

Testing timeline: ~1-2 minutes (10s probe + 1min alert + immediate function)

For production (prevent false positives):

Primary cluster:

terraform

traffic_manager_health_check_interval = 30 # Probe every 30 seconds
traffic_manager_health_check_timeout = 10 # 10 second timeout
traffic_manager_tolerated_failures = 3 # 3 failures before unhealthy

Secondary cluster:

terraform

dr_failover_function_pre_failover_failure_seconds = 180 # 3 minute holdoff
dr_failover_function_cooldown_seconds = 300 # 5 minute cooldown

Production timeline: ~5-7 minutes (90s probe failures + 1min alert + 3min holdoff)

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

Failover Timing Reference

Note

Note

Enter search term