Stage 3: Failover Testing

Test failover before relying on DR in production. The following steps simulate a primary failure and walk through promotion.

Simulate Primary Failure

Choose one of the following approaches.

Option A -- Scale down primary LogScale:

shell
kubectl scale deployment humio-operator -n log --replicas=0 --context=<primary>

Option B -- Cordon all primary nodes:

shell
kubectl cordon --all --context=<primary>

Observe Failover

The failover mechanism depends on your configuration:

Mechanism Behavior
Global Load Balancer Health check fails on primary. GLB routes traffic to secondary within 30-60 seconds.
Cloud Function Consecutive health check failures exceed dr_cloud_function_pre_failover_failure_seconds (default: 180s). Function triggers and scales up standby node pools.
DNS only Manual intervention required. Update DNS records to point to secondary.

Promote Secondary to Active

Promotion uses a two-phase approach to avoid a traffic blackhole during the transition.

Why two phases? The dr_use_dedicated_routing variable controls how the NodePort service selects which pods receive traffic:

Value Service Selector Effect
false Broad label (app.kubernetes.io/name: humio) Any running LogScale pod can serve any request โ€” ingest, search, or UI
true Pool-specific labels (k8s-app: logscale-ingest, etc.) Each traffic type routes only to pods on its designated node pool

If you set true immediately, but some node pools haven't finished scaling yet, the service has zero matching backends for that traffic type โ€” requests are dropped. Phase 1 avoids this by accepting traffic on whatever pods are ready.

Phase 1 -- Broad routing (safe while node pools scale up).

Update secondary.tfvars:

terraform
dr                       = "active"
dr_use_dedicated_routing = false   # any pod serves any traffic type while pools scale
# logscale_cluster_type remains "advanced" &mdash; do NOT change it

Apply:

shell
terraform apply

This creates all node pools and starts LogScale pods. Because the service selector is broad, traffic flows to whichever pods come up first โ€” no blackhole even if some pools are still scaling.

Important

The gcp_recover_from_* variables must remain in the secondary's tfvars during and after promotion. The recovery environment variables (GCP_RECOVER_FROM_BUCKET, etc.) are injected whenever gcp_recover_from_bucket is set, regardless of the dr value. This ensures LogScale can recover data from the primary's GCS bucket during promotion.

Warning

NEVER remove gcp_recover_from_* variables after promotion. Removing them changes the pod spec hash, causing the operator to recreate pods with new PVCs โ€” resulting in DATA LOSS. These variables are harmlessly ignored after initial recovery (LogScale reads them once at snapshot load).

Phase 2 -- Dedicated routing (after all pools are healthy)

Once all node pools are running and pods are scheduled on their designated pools.

Update secondary.tfvars:

terraform
dr_use_dedicated_routing = true   # ingest&rarr;ingest pods, search&rarr;UI pods, etc.

Apply:

shell
terraform apply

This switches to pool-specific selectors matching the production topology. Only do this after confirming all pools have healthy pods (kubectl get pods -n log -o wide).

Restore Primary (After Testing)

After verifying the secondary is serving traffic correctly:

  1. Uncordon primary nodes (if you used Option B) or scale the operator back up (Option A).

  2. Re-run terraform apply on the primary to restore it to active state.

  3. Update DNS or GLB configuration to shift traffic back to the primary.

  4. Demote the secondary back to standby by reverting secondary.tfvars to the original values.