Troubleshooting
Identify Current State
# Check Traffic Manager status
az network traffic-manager endpoint list \
--profile-name <tm-profile> \
--resource-group <rg> \
-o table
# Check AKS cluster status
az aks show --resource-group <rg> --name <cluster-name> --query "provisioningState"
# Check LogScale pods
kubectl get pods -n logging --context <context>DNS Resolution
# Resolve global DR URL
dig <global-hostname>.<zone>
nslookup <global-hostname>.<zone>
# Trace full resolution
dig +trace <global-hostname>.<zone>External Connectivity
IP="$(dig +short <global-hostname>.<zone> A | head -n1)"
# Test TCP connectivity
nc -zv "$IP" 443
# Test HTTPS
curl -vk --connect-timeout 8 "https://<global-hostname>.<zone>/"Traffic Manager Health
# Check endpoint status
az network traffic-manager endpoint show \
--profile-name <tm-profile> \
--name <endpoint-name> \
--resource-group <rg> \
--type externalEndpoints \
--query endpointMonitorStatusKubernetes Components
# Check ingress controller
kubectl -n logging-ingress get pods
kubectl -n logging-ingress get svc -o wide
# Check LogScale pods
kubectl -n logging get pods
kubectl -n logging get humiocluster -o yaml
# Check humio-operator
kubectl -n logging get deploy humio-operatorStorage Access
# Verify storage firewall allows secondary NAT IP
az storage account show \
--name <primary-storage-account> \
--resource-group <primary-rg> \
--query "networkRuleSet.ipRules"
# Check encryption key secret exists
kubectl get secret logscale-storage-encryption-key -n logging
# Compare encryption key hashes
kubectl get secret -n logging logscale-storage-encryption-key \
--context aks-primary \
-o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256
kubectl get secret -n logging logscale-storage-encryption-key \
--context aks-secondary \
-o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256Alert Status
# Get subscription ID
SUB_ID=$(az account show --query id -o tsv)
# Check fired alerts
az rest --method get \
--uri "https://management.azure.com/subscriptions/${SUB_ID}/providers/Microsoft.AlertsManagement/alerts?api-version=2019-05-05-preview" \
-o json | jq '.value[] | select(.properties.essentials.monitorCondition == "Fired")'Function App Logs
# Get deployment credentials
CREDS=$(az functionapp deployment list-publishing-profiles \
--name <function-app-name> \
--resource-group <rg-name> \
--query "[?publishMethod=='MSDeploy'].{user:userName,pass:userPWD}" -o tsv)
USER=$(echo "$CREDS" | cut -f1)
PASS=$(echo "$CREDS" | cut -f2)
# Check function logs
curl -s -u "${USER}:${PASS}" \
"https://<function-app-name>.scm.azurewebsites.net/api/vfs/LogFiles/Application/Functions/Host/" | \
jq -r '.[0].href' | xargs curl -s -u "${USER}:${PASS}" | tail -50Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| TLS timeout on global URL | Traffic Manager returning unhealthy endpoint | Check endpoint status; verify ingress controller is running |
| 403 from storage | NAT Gateway IP not in firewall rules | Re-apply secondary Terraform to update primary storage firewall |
| Function timeout to AKS | Function IPs not in authorized ranges |
Verify
azapi_update_resource.aks_authorized_ips_for_function
applied
|
| Encryption key mismatch | Remote state not configured |
Verify primary_remote_state_config in secondary
tfvars
|
| Alert keeps firing | Alert not cleared after failback |
Close alert via API: az rest --method post --uri
".../changeState?newState=Closed"
|
| Profile shows "Degraded" | Secondary in standby mode | Expected behavior; verify primary endpoint is "Online" |
Failover Simulation
The repository includes a comprehensive DR simulation script at
test/simulate-azure-dr-failover.sh. This script
automates controlled end-to-end testing of the Azure Monitor โ Action
Group โ Function App chain.
Available scenarios:
| Scenario | Description |
|---|---|
| status | Show current cluster and Traffic Manager state |
| primary-down | Scale primary operator to 0, trigger failover |
| failover | Disable Traffic Manager endpoint to trigger failover |
| transient-outage | Anti-flap test (brief outage, no failover expected) |
| failback | Restore primary and sync data from secondary (Phases 1-3 and 5) |
| clean-secondary | Reset secondary cluster post-failback (Phase 4) |
Usage:
# Auto-detect kubectl contexts from Terraform state
./test/simulate-azure-dr-failover.sh status
# Specify kubectl contexts directly (faster, skips auto-detection)
./test/simulate-azure-dr-failover.sh status --primary-context aks-primary --secondary-context aks-secondary
# Full failover test
./test/simulate-azure-dr-failover.sh primary-down
# Failback after failover
./test/simulate-azure-dr-failover.sh failback
./test/simulate-azure-dr-failover.sh clean-secondaryNote
To trigger failover during testing, nginx-ingress must be scaled to 0 in addition to humio-operator. The primary-down scenario handles this automatically. Scaling only humio-operator may leave nginx's default backend responding to Traffic Manager health probes, preventing the endpoint from being marked unhealthy.
For the full runbook procedure and rollback notes, see the Operations Guide.