Stage 2: Failover

The steps described in the following sections are performed when a disaster scenario occurs, such as when a node pool becomes unavailable or the entire Azure region becomes inaccessible. Execute these steps when the primary cluster is unreachable and you need to activate the secondary cluster to restore service.

The architecture is shown in the following diagram:

Secondary Readiness Required Steps

Run on secondary to ensure it's ready to promote later.

On standby, the HumioCluster already declares nodeCount=1, but the Humio operator is scaled to 0. When the Humio Operator is scaled to 1 (by the Azure Function on health check failure or manually), it reconciles the HumioCluster and starts a single LogScale pod. That pod detects the AZURE_RECOVER_FROM_* environment variables, fetches the global snapshot from the primary Azure Blob container, patches bucket/region references to the secondary, marks the primary container read-only, clears host/partition assignments, and writes the patched snapshot to the secondary container.

Scale the Humio operator on secondary
- With Azure Function enabled (default): Traffic Manager health check failure → Azure Monitor Alert → Azure Function scales humio-operator replicas to 1. No further action needed.
- Manually (for example, for tests or if Azure Function is disabled):
  shell
  kubectl --context aks-secondary -n logging scale deploy humio-operator --replicas=1
Do not patch spec.nodeCount it is already set to 1. The operator must be running for the Humio pod to start.
What Happens After Operator Starts:
1. The Humio operator reconciles and creates the Humio pod.
2. The pod reads AZURE_RECOVER_FROM_* env vars.
3. It lists and downloads the latest global-snapshot.json from the primary container.
4. It patches the snapshot to reference the secondary container/region and credentials.
5. It loads the patched snapshot into memory.
6. The cluster starts up with the recovered metadata state; data segments remain in the primary container (read-only access).

Spot-check pods on secondary:

shell

kubectl --context aks-secondary -n logging get pods
# Expect humio-operator (1/1), one Humio pod once recovery starts, and Kafka components running

DNS Architecture and Traffic Flow

This section explains how users access LogScale from a DNS perspective and how traffic flows through the system during normal operations and failover scenarios.

Public Access URLs

Both access methods are public and reach the LogScale cluster via the internet:

URL Type	Example	Purpose
Cluster-specific	<prefix>.<region>.cloudapp.azure.com	Direct access to a specific cluster
Global DR	logscale.azure-dr.humio.net	Failover-aware access via Traffic Manager

DNS Resolution Chain

Cluster-specific URL (direct access):

text

<prefix>.<region>.cloudapp.azure.com
         │
         │  Azure DNS (cloudapp.azure.com zone)
         ▼
   <public-ip> (Public IP of cluster LB)

This is an Azure-assigned public DNS name with format <dns-label>.<region>.cloudapp.azure.com. It always resolves to the same cluster's load balancer IP.

Global DR URL (failover-aware access):

text

logscale.azure-dr.humio.net
         │
         │  AWS Route 53 CNAME
         ▼
<tm-profile>.trafficmanager.net
         │
         │  Azure Traffic Manager (Priority routing)
         ▼
   <primary-ip> (Primary) ──or──▶ <secondary-ip> (if primary unhealthy)

The global DR URL resolves through a three-tier DNS chain:

AWS Route 53: CNAME record pointing to Traffic Manager
Azure Traffic Manager: Health-based routing between clusters
Cluster Load Balancer: Public IP of the healthy cluster

Traffic Flow Diagrams

Normal Operation (Primary Healthy):

During Failover (Primary Unhealthy):

Why Two URL Types?

Aspect	Cluster-Specific URL	Global DR URL
Failover	No automatic failover	Automatic via Traffic Manager
Use case	Direct cluster management, debugging	Production client access
DNS TTL	Standard Azure DNS	Low TTL (60s) for fast failover
Health checks	None	Traffic Manager probes /api/v1/status

Client Configuration Recommendation

For production clients (log shippers, applications):

Use the global DR URL: https://<global_hostname>.<dns_zone>
Automatic failover with no client reconfiguration needed

For cluster administration:

Use cluster-specific URLs for direct access
Primary: https://<primary-prefix>.<region>.cloudapp.azure.com
Secondary: https://<secondary-prefix>.<region>.cloudapp.azure.com

Verification Commands:

shell

# Trace DNS resolution for global DR URL
dig +trace <global_hostname>.<dns_zone>

# Check Traffic Manager routing
dig <global_hostname>.trafficmanager.net

# Verify which cluster is currently serving traffic
curl -s "https://<global_hostname>.<dns_zone>/api/v1/status" | jq '.version'

# Compare with direct cluster access
curl -s "https://<prefix>.<region>.cloudapp.azure.com/api/v1/status" | jq '.version'

Traffic Routing During Failover

During failover, all traffic is directed to the single LogScale digest pod via Kubernetes label selectors. Understanding these labels is critical for troubleshooting traffic routing issues.

Label Selectors Used:

Label	Value	Purpose
`app.kubernetes.io/name`	humio	Generic label matching ANY LogScale pod
`humio.com/node-pool`	<cluster-name>	Pool-specific label for digest pods

Verify DR Recovery Succeeded (logs and snapshot)

Verifying failover recovery:

logscale

DataSnapshotLoader
| #kind != threaddumps

You should see messages similar to:

text

Checking bucket storage localAndHttpWereEmpty=true
Trying to fetch a global snapshot from bucket storage azure if one exists in bucket=secondarylogscalecontainer
Fetching global snapshot from bucket storage azure found no snapshot to fetch.
Trying to fetch a global snapshot as recovery source from bucket storage in azure
Trying to fetch a global snapshot from bucket storage azure if one exists in bucket=primarylogscalecontainer
Fetched global snapshot from bucket storage azure found snapshot with epochOffset={epoch=0 offset=1329609}
Fetched a global snapshot as recovery source from bucket storage in azure and got snapshot with epochOffset={epoch=0 offset=1329609} now patching...
Snapshots to choose from, last is better...; List({epoch=0 offset=1329609},azure)) using kafkaMinOffsetOpt of Some(KafkaMinOffset(...))
Selecting snapshot from source=azure with epochOffset={epoch=0 offset=1329609}
updateSnapshotForDisasterRecovery: Patching region using from=eastus to=westus on bucketId=1
updateSnapshotForDisasterRecovery: Patching bucket using from=primarylogscalecontainer to=secondarylogscalecontainer on bucketId=1
updateSnapshotForDisasterRecovery: Patching access configs from RECOVER_FROM on bucketId=1
updateSnapshotForDisasterRecovery: setting readOnly=true on bucketId=1 keyPrefix= new value for bucket=secondarylogscalecontainer

Inspect the snapshot file on a secondary pod:

kubectl config use-context aks-secondary
POD=$(kubectl -n logging get pod -l app.kubernetes.io/name=humio -o jsonpath='{.items[0].metadata.name}')
kubectl -n logging exec "$POD" -c humio -- cat /data/humio-data/global-data-snapshot.json | head -60
# Confirm bucket/region reflect the secondary and readOnly flags are correct

Ready to promote when:

Operator is 1/1 on secondary.
Kafka components exist on secondary after re-apply.
DataSnapshotLoader logs match the expected sequence above.
Snapshot file shows patched region/bucket pointing to secondary; encryption keys match.

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Other authentication methods

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

Stage 2: Failover

Secondary Readiness Required Steps

DNS Architecture and Traffic Flow

Traffic Routing During Failover

Verify DR Recovery Succeeded (logs and snapshot)

Enter search term