Troubleshooting

Identify Current State

shell

# Check Traffic Manager status
az network traffic-manager endpoint list \
  --profile-name <tm-profile> \
  --resource-group <rg> \
  -o table

# Check AKS cluster status
az aks show --resource-group <rg> --name <cluster-name> --query "provisioningState"

# Check LogScale pods
kubectl get pods -n logging --context <context>

DNS Resolution

shell

# Resolve global DR URL
dig <global-hostname>.<zone>
nslookup <global-hostname>.<zone>

# Trace full resolution
dig +trace <global-hostname>.<zone>

External Connectivity

shell

IP="$(dig +short <global-hostname>.<zone> A | head -n1)"

# Test TCP connectivity
nc -zv "$IP" 443

# Test HTTPS
curl -vk --connect-timeout 8 "https://<global-hostname>.<zone>/"

Traffic Manager Health

shell

# Check endpoint status
az network traffic-manager endpoint show \
  --profile-name <tm-profile> \
  --name <endpoint-name> \
  --resource-group <rg> \
  --type externalEndpoints \
  --query endpointMonitorStatus

Kubernetes Components

shell

# Check ingress controller
kubectl -n logging-ingress get pods
kubectl -n logging-ingress get svc -o wide

# Check LogScale pods
kubectl -n logging get pods
kubectl -n logging get humiocluster -o yaml

# Check humio-operator
kubectl -n logging get deploy humio-operator

Storage Access

shell

# Verify storage firewall allows secondary NAT IP
az storage account show \
  --name <primary-storage-account> \
  --resource-group <primary-rg> \
  --query "networkRuleSet.ipRules"

# Check encryption key secret exists
kubectl get secret logscale-storage-encryption-key -n logging

# Compare encryption key hashes
kubectl get secret -n logging logscale-storage-encryption-key \
  --context aks-primary \
  -o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256

kubectl get secret -n logging logscale-storage-encryption-key \
  --context aks-secondary \
  -o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256

Alert Status

shell

# Get subscription ID
SUB_ID=$(az account show --query id -o tsv)

# Check fired alerts
az rest --method get \
  --uri "https://management.azure.com/subscriptions/${SUB_ID}/providers/Microsoft.AlertsManagement/alerts?api-version=2019-05-05-preview" \
  -o json | jq '.value[] | select(.properties.essentials.monitorCondition == "Fired")'

Function App Logs

shell

# Get deployment credentials
CREDS=$(az functionapp deployment list-publishing-profiles \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "[?publishMethod=='MSDeploy'].{user:userName,pass:userPWD}" -o tsv)
USER=$(echo "$CREDS" | cut -f1)
PASS=$(echo "$CREDS" | cut -f2)

# Check function logs
curl -s -u "${USER}:${PASS}" \
  "https://<function-app-name>.scm.azurewebsites.net/api/vfs/LogFiles/Application/Functions/Host/" | \
  jq -r '.[0].href' | xargs curl -s -u "${USER}:${PASS}" | tail -50

Common Issues and Solutions

Issue	Cause	Solution
TLS timeout on global URL	Traffic Manager returning unhealthy endpoint	Check endpoint status; verify ingress controller is running
403 from storage	NAT Gateway IP not in firewall rules	Re-apply secondary Terraform to update primary storage firewall
Function timeout to AKS	Function IPs not in authorized ranges	Verify `azapi_update_resource.aks_authorized_ips_for_function` applied
Encryption key mismatch	Remote state not configured	Verify `primary_remote_state_config` in secondary tfvars
Alert keeps firing	Alert not cleared after failback	Close alert via API: `az rest --method post --uri ".../changeState?newState=Closed"`
Profile shows "Degraded"	Secondary in standby mode	Expected behavior; verify primary endpoint is "Online"

Failover Simulation

The repository includes a comprehensive DR simulation script at test/simulate-azure-dr-failover.sh. This script automates controlled end-to-end testing of the Azure Monitor → Action Group → Function App chain.

Available scenarios:

Scenario	Description
status	Show current cluster and Traffic Manager state
primary-down	Scale primary operator to 0, trigger failover
failover	Disable Traffic Manager endpoint to trigger failover
transient-outage	Anti-flap test (brief outage, no failover expected)
failback	Restore primary and sync data from secondary (Phases 1-3 and 5)
clean-secondary	Reset secondary cluster post-failback (Phase 4)

Usage:

shell

# Auto-detect kubectl contexts from Terraform state
./test/simulate-azure-dr-failover.sh status

# Specify kubectl contexts directly (faster, skips auto-detection)
./test/simulate-azure-dr-failover.sh status   --primary-context aks-primary   --secondary-context aks-secondary

# Full failover test
./test/simulate-azure-dr-failover.sh primary-down

# Failback after failover
./test/simulate-azure-dr-failover.sh failback
./test/simulate-azure-dr-failover.sh clean-secondary

Note

To trigger failover during testing, nginx-ingress must be scaled to 0 in addition to humio-operator. The primary-down scenario handles this automatically. Scaling only humio-operator may leave nginx's default backend responding to Traffic Manager health probes, preventing the endpoint from being marked unhealthy.

For the full runbook procedure and rollback notes, see the Operations Guide.

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

Troubleshooting

Common Issues and Solutions

Failover Simulation

Note

Enter search term