DR Failover Timing

This section documents the expected time from primary failure detection to secondary cluster activation.

Pre-failover validation runs for dr_failover_function_pre_failover_failure_seconds seconds (set to 0 for testing only).

Configurable Timing Variables

The following Terraform variables control DR failover timing:

VariableDefaultTestingDescription
dr_failover_function_primary_health_check_interval_seconds6010Health check probe interval
dr_failover_function_alarm_pending_duration"PT1M"(min: "PT1M")Time alarm must fire before triggering (OCI minimum is 1 minute)
dr_failover_function_absent_detection_period"2m""1m"Absent metrics detection window
dr_failover_function_pre_failover_failure_seconds1800Pre-failover validation duration
dr_failover_function_alarm_repeat_notification_duration"PT10M""PT5M"Re-notification interval

Note

Always use default values in production to prevent false failovers. See the Function Configuration (What You Tune From tfvars) section for the complete configuration reference, including the testing-only HCL example.

Timing Breakdown

Based on actual simulation results from primary-down scenario:

StageConfigurationObserved Duration
LB Backends Unhealthy DetectionNetworkPolicy applied~55 seconds
OCI Monitoring Alarm FIRING60s pending duration + metric aggregation~281 seconds (from LB unhealthy)
Function Execution + Operator ScalingAuthentication + K8s API PATCH~49 seconds
Pre-Failover Validationdr_failover_function_pre_failover_failure_seconds = 180 (default)~180 seconds (included in alarm latency)
Total (Failover Initiated โ†’ Operator Scaled)ย ~385 seconds (~6.4 minutes)

Note

The alarm trigger latency (~281s) includes the OCI Monitoring alarm pending duration, metric aggregation window, and pre-failover validation. In testing mode (dr_failover_function_pre_failover_failure_seconds = 0), this can be significantly reduced.

Post-Failover Timeline

After Function completes and Operator is scaled, additional time is required for full service restoration:

StageObserved Duration
LogScale pod scheduled and started~43 seconds
Secondary endpoint healthy (LB backends)~1 second (after pod ready)
Total (Operator Scaled โ†’ Service Ready)~44 seconds

Note

The fast pod startup and endpoint health is possible because the secondary cluster already has Kafka running. The LogScale pod only needs to start, connect to Kafka, and pass health checks.

End-to-End Timeline Summary

Based on actual primary-down simulation results:

MilestoneElapsed TimeDelta
Failover initiated+0s-
Primary LB backends unhealthy+55s+55s
OCI Monitoring Alarm FIRING+336s+281s
Secondary operator scaled 0โ†’1+385s+49s
Secondary LogScale pod Ready+428s+43s
Secondary endpoint healthy (DR complete)+429s+1s
TOTAL FAILOVER TIME~429s (~7m 9s)-

Key Metrics:

  • Alarm Trigger Latency: 281s (from LB unhealthy to alarm FIRING)

  • Function + Scaling Time: 49s (from alarm FIRING to operator scaled)

  • Pod Startup Time: 43s (from operator scaled to pod ready)

ConfigurationFailover โ†’ Operator ScaledOperator Scaled โ†’ Service ReadyTotal
Default (production)~385s (~6.4 min)~44s~429s (~7 min)
Testing (pre_failover_failure_seconds = 0)~150-200s (~2.5-3 min)~44s~200-250s (~3-4 min)
DR Promotion Scaling Timeline

The following timings were observed during DR promotion testing from standby (1 pod) to active (8 pods):

PhaseDruationDetails
Phase 1: Terraform Apply~2-3 minUpdate dr="active", dr_use_dedicated_routing=false
Pod Scale-up (1โ†’8)~3-5 min3 digest + 3 ingest + 2 UI pods
All Pods Ready~5-8 min totalAll 8 pods Running with 1/1 Ready
Phase 2: Enable Routing~1 minUpdate dr_use_dedicated_routing=true
Service Endpoint UpdateImmediateServices switch to pool-specific selectors

Verification commands::

shell
# Check all pods are running
kubectl get pods -n logging -l app.kubernetes.io/name=humio
# Verify node pool status
kubectl get humiocluster -n logging -o jsonpath='{.items[0].status.nodePoolStatus}' | python3 -m json.tool
# Test API endpoint
curl -sSk -o /dev/null -w "%{http_code}" https://<GLOBAL_FQDN>/api/v1/status
# Verify license
kubectl get humiocluster -n logging -o jsonpath='{.items[0].status.licenseStatus}'

Key observations from testing:

  • Zero downtime achieved using two-phase promotion

  • All pods shared same bootstrap token (no split-brain)

  • License valid and recognized after promotion

  • Ingest and UI traffic working normally post-promotion

Comparison: AWS vs GCP vs OCI Timing
StageAWSGCPOCI (Observed)
LB Backend Unhealthy Detection~30s (10s ร— 3 failures)~60-120s (60s interval)~55s
Alarm/Alert Trigger~60s~60s~281s (includes pre-failover validation)
Function Execution + Operator Scaling~10-20s~10-20s~49s
Pod Startup~60-120s~60-120s~43s
Total (Detection โ†’ Service Ready)~160-220s~190-320s~429s (~7 min)

Note

OCI's longer alarm trigger latency is due to the default pre_failover_failure_seconds = 180 setting which validates the primary is truly unhealthy before triggering failover.

AWS's faster detection is due to Route53's 10-second health check interval.

Testing vs Production Configuration
SettingTestingProduction
dr_failover_function_pre_failover_failure_seconds0180
Pre-failover validationSkippedEnabled
Protection against transient failuresNone~3 minutes

Important

Always use dr_failover_function_pre_failover_failure_seconds = 180 (default) in production to prevent false failovers.