DR Failover Timing

This section documents the expected time from primary failure detection to secondary cluster activation.

Pre-failover validation runs for dr_failover_function_pre_failover_failure_seconds seconds (set to 0 for testing only).

Configurable Timing Variables

The following Terraform variables control DR failover timing:

Variable	Default	Testing	Description
`dr_failover_function_primary_health_check_interval_seconds`	60	10	Health check probe interval
`dr_failover_function_alarm_pending_duration`	"PT1M"	(min: `"PT1M"`)	Time alarm must fire before triggering (OCI minimum is 1 minute)
`dr_failover_function_absent_detection_period`	"2m"	"1m"	Absent metrics detection window
`dr_failover_function_pre_failover_failure_seconds`	180	0	Pre-failover validation duration
`dr_failover_function_alarm_repeat_notification_duration`	"PT10M"	"PT5M"	Re-notification interval

Note

Always use default values in production to prevent false failovers. See the Function Configuration (What You Tune From tfvars) section for the complete configuration reference, including the testing-only HCL example.

Timing Breakdown

Based on actual simulation results from primary-down scenario:

Stage	Configuration	Observed Duration
LB Backends Unhealthy Detection	NetworkPolicy applied	~55 seconds
OCI Monitoring Alarm FIRING	60s pending duration + metric aggregation	~281 seconds (from LB unhealthy)
Function Execution + Operator Scaling	Authentication + K8s API PATCH	~49 seconds
Pre-Failover Validation	`dr_failover_function_pre_failover_failure_seconds = 180` (default)	~180 seconds (included in alarm latency)
Total (Failover Initiated → Operator Scaled)		~385 seconds (~6.4 minutes)

Note

The alarm trigger latency (~281s) includes the OCI Monitoring alarm pending duration, metric aggregation window, and pre-failover validation. In testing mode (dr_failover_function_pre_failover_failure_seconds = 0), this can be significantly reduced.

Post-Failover Timeline

After Function completes and Operator is scaled, additional time is required for full service restoration:

Stage	Observed Duration
LogScale pod scheduled and started	~43 seconds
Secondary endpoint healthy (LB backends)	~1 second (after pod ready)
Total (Operator Scaled → Service Ready)	~44 seconds

Note

The fast pod startup and endpoint health is possible because the secondary cluster already has Kafka running. The LogScale pod only needs to start, connect to Kafka, and pass health checks.

End-to-End Timeline Summary

Based on actual primary-down simulation results:

Milestone	Elapsed Time	Delta
Failover initiated	+0s	-
Primary LB backends unhealthy	+55s	+55s
OCI Monitoring Alarm FIRING	+336s	+281s
Secondary operator scaled 0→1	+385s	+49s
Secondary LogScale pod Ready	+428s	+43s
Secondary endpoint healthy (DR complete)	+429s	+1s
TOTAL FAILOVER TIME	~429s (~7m 9s)	-

Key Metrics:

Alarm Trigger Latency: 281s (from LB unhealthy to alarm FIRING)
Function + Scaling Time: 49s (from alarm FIRING to operator scaled)
Pod Startup Time: 43s (from operator scaled to pod ready)

Configuration	Failover → Operator Scaled	Operator Scaled → Service Ready	Total
Default (production)	~385s (~6.4 min)	~44s	~429s (~7 min)
Testing (`pre_failover_failure_seconds = 0`)	~150-200s (~2.5-3 min)	~44s	~200-250s (~3-4 min)

DR Promotion Scaling Timeline

The following timings were observed during DR promotion testing from standby (1 pod) to active (8 pods):

Phase	Druation	Details
Phase 1: Terraform Apply	~2-3 min	Update `dr="active"`, `dr_use_dedicated_routing=false`
Pod Scale-up (1→8)	~3-5 min	3 digest + 3 ingest + 2 UI pods
All Pods Ready	~5-8 min total	All 8 pods Running with 1/1 Ready
Phase 2: Enable Routing	~1 min	Update `dr_use_dedicated_routing=true`
Service Endpoint Update	Immediate	Services switch to pool-specific selectors

Verification commands::

shell

# Check all pods are running
kubectl get pods -n logging -l app.kubernetes.io/name=humio
# Verify node pool status
kubectl get humiocluster -n logging -o jsonpath='{.items[0].status.nodePoolStatus}' | python3 -m json.tool
# Test API endpoint
curl -sSk -o /dev/null -w "%{http_code}" https://<GLOBAL_FQDN>/api/v1/status
# Verify license
kubectl get humiocluster -n logging -o jsonpath='{.items[0].status.licenseStatus}'

Key observations from testing:

Zero downtime achieved using two-phase promotion
All pods shared same bootstrap token (no split-brain)
License valid and recognized after promotion
Ingest and UI traffic working normally post-promotion

Comparison: AWS vs GCP vs OCI Timing

Stage	AWS	GCP	OCI (Observed)
LB Backend Unhealthy Detection	~30s (10s × 3 failures)	~60-120s (60s interval)	~55s
Alarm/Alert Trigger	~60s	~60s	~281s (includes pre-failover validation)
Function Execution + Operator Scaling	~10-20s	~10-20s	~49s
Pod Startup	~60-120s	~60-120s	~43s
Total (Detection → Service Ready)	~160-220s	~190-320s	~429s (~7 min)

Note

OCI's longer alarm trigger latency is due to the default pre_failover_failure_seconds = 180 setting which validates the primary is truly unhealthy before triggering failover.

AWS's faster detection is due to Route53's 10-second health check interval.

Testing vs Production Configuration

Setting	Testing	Production
`dr_failover_function_pre_failover_failure_seconds`	0	180
Pre-failover validation	Skipped	Enabled
Protection against transient failures	None	~3 minutes

Important

Always use dr_failover_function_pre_failover_failure_seconds = 180 (default) in production to prevent false failovers.

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

DR Failover Timing

Configurable Timing Variables

Note

Timing Breakdown

Note

Post-Failover Timeline

Note

End-to-End Timeline Summary

DR Promotion Scaling Timeline

Comparison: AWS vs GCP vs OCI Timing

Note

Testing vs Production Configuration

Important

Enter search term