Stage 3: Failover Testing
Test failover before relying on DR in production. The following steps simulate a primary failure and walk through promotion.
Simulate Primary Failure
Choose one of the following approaches.
Option A -- Scale down primary LogScale:
kubectl scale deployment humio-operator -n log --replicas=0 --context=<primary>Option B -- Cordon all primary nodes:
kubectl cordon --all --context=<primary>Observe Failover
The failover mechanism depends on your configuration:
| Mechanism | Behavior |
|---|---|
| Global Load Balancer | Health check fails on primary. GLB routes traffic to secondary within 30-60 seconds. |
| Cloud Function |
Consecutive health check failures exceed
dr_cloud_function_pre_failover_failure_seconds
(default: 180s). Function triggers and scales up standby node pools.
|
| DNS only | Manual intervention required. Update DNS records to point to secondary. |
Promote Secondary to Active
Promotion uses a two-phase approach to avoid a traffic blackhole during the transition.
Why two phases? The dr_use_dedicated_routing variable
controls how the NodePort service selects which pods
receive traffic:
| Value | Service Selector | Effect |
|---|---|---|
| false | Broad label (app.kubernetes.io/name: humio) | Any running LogScale pod can serve any request โ ingest, search, or UI |
| true | Pool-specific labels (k8s-app: logscale-ingest, etc.) | Each traffic type routes only to pods on its designated node pool |
If you set true immediately, but some node pools haven't finished scaling yet, the service has zero matching backends for that traffic type โ requests are dropped. Phase 1 avoids this by accepting traffic on whatever pods are ready.
Phase 1 -- Broad routing (safe while node pools scale up).
Update secondary.tfvars:
dr = "active"
dr_use_dedicated_routing = false # any pod serves any traffic type while pools scale
# logscale_cluster_type remains "advanced" — do NOT change itApply:
terraform applyThis creates all node pools and starts LogScale pods. Because the service selector is broad, traffic flows to whichever pods come up first โ no blackhole even if some pools are still scaling.
Important
The gcp_recover_from_* variables must remain in the
secondary's tfvars during and after promotion. The recovery environment
variables (GCP_RECOVER_FROM_BUCKET, etc.) are injected
whenever gcp_recover_from_bucket is set, regardless of
the dr value. This ensures LogScale can recover data from the
primary's GCS bucket during promotion.
Warning
NEVER remove gcp_recover_from_* variables after
promotion. Removing them changes the pod spec hash, causing the operator
to recreate pods with new PVCs โ resulting in DATA LOSS. These
variables are harmlessly ignored after initial recovery (LogScale
reads them once at snapshot load).
Phase 2 -- Dedicated routing (after all pools are healthy)
Once all node pools are running and pods are scheduled on their designated pools.
Update secondary.tfvars:
dr_use_dedicated_routing = true # ingest→ingest pods, search→UI pods, etc.Apply:
terraform apply
This switches to pool-specific selectors matching the production topology.
Only do this after confirming all pools have healthy pods (kubectl
get pods -n log -o wide).
Restore Primary (After Testing)
After verifying the secondary is serving traffic correctly:
Uncordon primary nodes (if you used Option B) or scale the operator back up (Option A).
Re-run terraform apply on the primary to restore it to active state.
Update DNS or GLB configuration to shift traffic back to the primary.
Demote the secondary back to standby by reverting secondary.tfvars to the original values.