Stage 3: Failover Testing

Test failover before relying on DR in production. The following steps simulate a primary failure and walk through promotion.

Simulate Primary Failure

Choose one of the following approaches.

Option A -- Scale down primary LogScale:

shell

kubectl scale deployment humio-operator -n log --replicas=0 --context=<primary>

Option B -- Cordon all primary nodes:

shell

kubectl cordon --all --context=<primary>

Observe Failover

The failover mechanism depends on your configuration:

Mechanism	Behavior
Global Load Balancer	Health check fails on primary. GLB routes traffic to secondary within 30-60 seconds.
Cloud Function	Consecutive health check failures exceed `dr_cloud_function_pre_failover_failure_seconds` (default: 180s). Function triggers and scales up standby node pools.
DNS only	Manual intervention required. Update DNS records to point to secondary.

Promote Secondary to Active

Promotion uses a two-phase approach to avoid a traffic blackhole during the transition.

Why two phases? The dr_use_dedicated_routing variable controls how the NodePort service selects which pods receive traffic:

Value	Service Selector	Effect
false	Broad label (app.kubernetes.io/name: humio)	Any running LogScale pod can serve any request — ingest, search, or UI
true	Pool-specific labels (k8s-app: logscale-ingest, etc.)	Each traffic type routes only to pods on its designated node pool

If you set true immediately, but some node pools haven't finished scaling yet, the service has zero matching backends for that traffic type — requests are dropped. Phase 1 avoids this by accepting traffic on whatever pods are ready.

Phase 1 -- Broad routing (safe while node pools scale up).

Update secondary.tfvars:

terraform

dr                       = "active"
dr_use_dedicated_routing = false   # any pod serves any traffic type while pools scale
# logscale_cluster_type remains "advanced" &mdash; do NOT change it

Apply:

shell

terraform apply

This creates all node pools and starts LogScale pods. Because the service selector is broad, traffic flows to whichever pods come up first — no blackhole even if some pools are still scaling.

Important

The gcp_recover_from_* variables must remain in the secondary's tfvars during and after promotion. The recovery environment variables (GCP_RECOVER_FROM_BUCKET, etc.) are injected whenever gcp_recover_from_bucket is set, regardless of the dr value. This ensures LogScale can recover data from the primary's GCS bucket during promotion.

Warning

NEVER remove gcp_recover_from_* variables after promotion. Removing them changes the pod spec hash, causing the operator to recreate pods with new PVCs — resulting in DATA LOSS. These variables are harmlessly ignored after initial recovery (LogScale reads them once at snapshot load).

Phase 2 -- Dedicated routing (after all pools are healthy)

Once all node pools are running and pods are scheduled on their designated pools.

Update secondary.tfvars:

terraform

dr_use_dedicated_routing = true   # ingest&rarr;ingest pods, search&rarr;UI pods, etc.

Apply:

shell

terraform apply

This switches to pool-specific selectors matching the production topology. Only do this after confirming all pools have healthy pods (kubectl get pods -n log -o wide).

Restore Primary (After Testing)

After verifying the secondary is serving traffic correctly:

Uncordon primary nodes (if you used Option B) or scale the operator back up (Option A).
Re-run terraform apply on the primary to restore it to active state.
Update DNS or GLB configuration to shift traffic back to the primary.
Demote the secondary back to standby by reverting secondary.tfvars to the original values.

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

Stage 3: Failover Testing

Important

Warning

Enter search term