DR Failover Timing
This section documents the expected time from primary failure detection to secondary cluster activation.
Pre-failover validation runs for
dr_failover_function_pre_failover_failure_seconds seconds (set to 0 for
testing only).
Configurable Timing Variables
The following Terraform variables control DR failover timing:
| Variable | Default | Testing | Description |
|---|---|---|---|
dr_failover_function_primary_health_check_interval_seconds | 60 | 10 | Health check probe interval |
dr_failover_function_alarm_pending_duration | "PT1M" | (min: "PT1M") | Time alarm must fire before triggering (OCI minimum is 1 minute) |
dr_failover_function_absent_detection_period | "2m" | "1m" | Absent metrics detection window |
dr_failover_function_pre_failover_failure_seconds | 180 | 0 | Pre-failover validation duration |
dr_failover_function_alarm_repeat_notification_duration | "PT10M" | "PT5M" | Re-notification interval |
Note
Always use default values in production to prevent false failovers. See the Function Configuration (What You Tune From tfvars) section for the complete configuration reference, including the testing-only HCL example.
Timing Breakdown
Based on actual simulation results from primary-down scenario:
| Stage | Configuration | Observed Duration |
|---|---|---|
| LB Backends Unhealthy Detection | NetworkPolicy applied | ~55 seconds |
| OCI Monitoring Alarm FIRING | 60s pending duration + metric aggregation | ~281 seconds (from LB unhealthy) |
| Function Execution + Operator Scaling | Authentication + K8s API PATCH | ~49 seconds |
| Pre-Failover Validation | dr_failover_function_pre_failover_failure_seconds = 180
(default) | ~180 seconds (included in alarm latency) |
| Total (Failover Initiated โ Operator Scaled) | ย | ~385 seconds (~6.4 minutes) |
Note
The alarm trigger latency (~281s) includes the OCI Monitoring alarm pending duration,
metric aggregation window, and pre-failover validation. In testing mode
(dr_failover_function_pre_failover_failure_seconds = 0), this can be
significantly reduced.
Post-Failover Timeline
After Function completes and Operator is scaled, additional time is required for full service restoration:
| Stage | Observed Duration |
|---|---|
| LogScale pod scheduled and started | ~43 seconds |
| Secondary endpoint healthy (LB backends) | ~1 second (after pod ready) |
| Total (Operator Scaled โ Service Ready) | ~44 seconds |
Note
The fast pod startup and endpoint health is possible because the secondary cluster already has Kafka running. The LogScale pod only needs to start, connect to Kafka, and pass health checks.
End-to-End Timeline Summary
Based on actual primary-down simulation results:
| Milestone | Elapsed Time | Delta |
|---|---|---|
| Failover initiated | +0s | - |
| Primary LB backends unhealthy | +55s | +55s |
| OCI Monitoring Alarm FIRING | +336s | +281s |
| Secondary operator scaled 0โ1 | +385s | +49s |
| Secondary LogScale pod Ready | +428s | +43s |
| Secondary endpoint healthy (DR complete) | +429s | +1s |
| TOTAL FAILOVER TIME | ~429s (~7m 9s) | - |
Key Metrics:
Alarm Trigger Latency: 281s (from LB unhealthy to alarm FIRING)
Function + Scaling Time: 49s (from alarm FIRING to operator scaled)
Pod Startup Time: 43s (from operator scaled to pod ready)
| Configuration | Failover โ Operator Scaled | Operator Scaled โ Service Ready | Total |
|---|---|---|---|
| Default (production) | ~385s (~6.4 min) | ~44s | ~429s (~7 min) |
Testing (pre_failover_failure_seconds = 0) | ~150-200s (~2.5-3 min) | ~44s | ~200-250s (~3-4 min) |
DR Promotion Scaling Timeline
The following timings were observed during DR promotion testing from standby (1 pod) to active (8 pods):
| Phase | Druation | Details |
|---|---|---|
| Phase 1: Terraform Apply | ~2-3 min | Update dr="active",
dr_use_dedicated_routing=false |
| Pod Scale-up (1โ8) | ~3-5 min | 3 digest + 3 ingest + 2 UI pods |
| All Pods Ready | ~5-8 min total | All 8 pods Running with 1/1 Ready |
| Phase 2: Enable Routing | ~1 min | Update dr_use_dedicated_routing=true |
| Service Endpoint Update | Immediate | Services switch to pool-specific selectors |
Verification commands::
# Check all pods are running
kubectl get pods -n logging -l app.kubernetes.io/name=humio
# Verify node pool status
kubectl get humiocluster -n logging -o jsonpath='{.items[0].status.nodePoolStatus}' | python3 -m json.tool
# Test API endpoint
curl -sSk -o /dev/null -w "%{http_code}" https://<GLOBAL_FQDN>/api/v1/status
# Verify license
kubectl get humiocluster -n logging -o jsonpath='{.items[0].status.licenseStatus}'Key observations from testing:
Zero downtime achieved using two-phase promotion
All pods shared same bootstrap token (no split-brain)
License valid and recognized after promotion
Ingest and UI traffic working normally post-promotion
Comparison: AWS vs GCP vs OCI Timing
| Stage | AWS | GCP | OCI (Observed) |
|---|---|---|---|
| LB Backend Unhealthy Detection | ~30s (10s ร 3 failures) | ~60-120s (60s interval) | ~55s |
| Alarm/Alert Trigger | ~60s | ~60s | ~281s (includes pre-failover validation) |
| Function Execution + Operator Scaling | ~10-20s | ~10-20s | ~49s |
| Pod Startup | ~60-120s | ~60-120s | ~43s |
| Total (Detection โ Service Ready) | ~160-220s | ~190-320s | ~429s (~7 min) |
Note
OCI's longer alarm trigger latency is due to the default pre_failover_failure_seconds
= 180 setting which validates the primary is truly unhealthy before triggering
failover.
AWS's faster detection is due to Route53's 10-second health check interval.
Testing vs Production Configuration
| Setting | Testing | Production |
|---|---|---|
dr_failover_function_pre_failover_failure_seconds | 0 | 180 |
| Pre-failover validation | Skipped | Enabled |
| Protection against transient failures | None | ~3 minutes |
Important
Always use dr_failover_function_pre_failover_failure_seconds = 180
(default) in production to prevent false failovers.