Troubleshooting

Identify Current State

shell
# Check Traffic Manager status
az network traffic-manager endpoint list \
  --profile-name <tm-profile> \
  --resource-group <rg> \
  -o table

# Check AKS cluster status
az aks show --resource-group <rg> --name <cluster-name> --query "provisioningState"

# Check LogScale pods
kubectl get pods -n logging --context <context>

DNS Resolution

shell
# Resolve global DR URL
dig <global-hostname>.<zone>
nslookup <global-hostname>.<zone>

# Trace full resolution
dig +trace <global-hostname>.<zone>

External Connectivity

shell
IP="$(dig +short <global-hostname>.<zone> A | head -n1)"

# Test TCP connectivity
nc -zv "$IP" 443

# Test HTTPS
curl -vk --connect-timeout 8 "https://<global-hostname>.<zone>/"

Traffic Manager Health

shell
# Check endpoint status
az network traffic-manager endpoint show \
  --profile-name <tm-profile> \
  --name <endpoint-name> \
  --resource-group <rg> \
  --type externalEndpoints \
  --query endpointMonitorStatus

Kubernetes Components

shell
# Check ingress controller
kubectl -n logging-ingress get pods
kubectl -n logging-ingress get svc -o wide

# Check LogScale pods
kubectl -n logging get pods
kubectl -n logging get humiocluster -o yaml

# Check humio-operator
kubectl -n logging get deploy humio-operator

Storage Access

shell
# Verify storage firewall allows secondary NAT IP
az storage account show \
  --name <primary-storage-account> \
  --resource-group <primary-rg> \
  --query "networkRuleSet.ipRules"

# Check encryption key secret exists
kubectl get secret logscale-storage-encryption-key -n logging

# Compare encryption key hashes
kubectl get secret -n logging logscale-storage-encryption-key \
  --context aks-primary \
  -o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256

kubectl get secret -n logging logscale-storage-encryption-key \
  --context aks-secondary \
  -o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256

Alert Status

shell
# Get subscription ID
SUB_ID=$(az account show --query id -o tsv)

# Check fired alerts
az rest --method get \
  --uri "https://management.azure.com/subscriptions/${SUB_ID}/providers/Microsoft.AlertsManagement/alerts?api-version=2019-05-05-preview" \
  -o json | jq '.value[] | select(.properties.essentials.monitorCondition == "Fired")'

Function App Logs

shell
# Get deployment credentials
CREDS=$(az functionapp deployment list-publishing-profiles \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "[?publishMethod=='MSDeploy'].{user:userName,pass:userPWD}" -o tsv)
USER=$(echo "$CREDS" | cut -f1)
PASS=$(echo "$CREDS" | cut -f2)

# Check function logs
curl -s -u "${USER}:${PASS}" \
  "https://<function-app-name>.scm.azurewebsites.net/api/vfs/LogFiles/Application/Functions/Host/" | \
  jq -r '.[0].href' | xargs curl -s -u "${USER}:${PASS}" | tail -50
Common Issues and Solutions
Issue Cause Solution
TLS timeout on global URL Traffic Manager returning unhealthy endpoint Check endpoint status; verify ingress controller is running
403 from storage NAT Gateway IP not in firewall rules Re-apply secondary Terraform to update primary storage firewall
Function timeout to AKS Function IPs not in authorized ranges Verify azapi_update_resource.aks_authorized_ips_for_function applied
Encryption key mismatch Remote state not configured Verify primary_remote_state_config in secondary tfvars
Alert keeps firing Alert not cleared after failback Close alert via API: az rest --method post --uri ".../changeState?newState=Closed"
Profile shows "Degraded" Secondary in standby mode Expected behavior; verify primary endpoint is "Online"
Failover Simulation

The repository includes a comprehensive DR simulation script at test/simulate-azure-dr-failover.sh. This script automates controlled end-to-end testing of the Azure Monitor โ†’ Action Group โ†’ Function App chain.

Available scenarios:

Scenario Description
status Show current cluster and Traffic Manager state
primary-down Scale primary operator to 0, trigger failover
failover Disable Traffic Manager endpoint to trigger failover
transient-outage Anti-flap test (brief outage, no failover expected)
failback Restore primary and sync data from secondary (Phases 1-3 and 5)
clean-secondary Reset secondary cluster post-failback (Phase 4)

Usage:

shell
# Auto-detect kubectl contexts from Terraform state
./test/simulate-azure-dr-failover.sh status

# Specify kubectl contexts directly (faster, skips auto-detection)
./test/simulate-azure-dr-failover.sh status   --primary-context aks-primary   --secondary-context aks-secondary

# Full failover test
./test/simulate-azure-dr-failover.sh primary-down

# Failback after failover
./test/simulate-azure-dr-failover.sh failback
./test/simulate-azure-dr-failover.sh clean-secondary

Note

To trigger failover during testing, nginx-ingress must be scaled to 0 in addition to humio-operator. The primary-down scenario handles this automatically. Scaling only humio-operator may leave nginx's default backend responding to Traffic Manager health probes, preventing the endpoint from being marked unhealthy.

For the full runbook procedure and rollback notes, see the Operations Guide.