Troubleshooting
Identify Current State
shell
# Check Traffic Manager status
az network traffic-manager endpoint list \
--profile-name <tm-profile> \
--resource-group <rg> \
-o table
# Check AKS cluster status
az aks show --resource-group <rg> --name <cluster-name> --query "provisioningState"
# Check LogScale pods
kubectl get pods -n logging --context <context>DNS Resolution
shell
# Resolve global DR URL
dig <global-hostname>.<zone>
nslookup <global-hostname>.<zone>
# Trace full resolution
dig +trace <global-hostname>.<zone>External Connectivity
shell
IP="$(dig +short <global-hostname>.<zone> A | head -n1)"
# Test TCP connectivity
nc -zv "$IP" 443
# Test HTTPS
curl -vk --connect-timeout 8 "https://<global-hostname>.<zone>/"Traffic Manager Health
shell
# Check endpoint status
az network traffic-manager endpoint show \
--profile-name <tm-profile> \
--name <endpoint-name> \
--resource-group <rg> \
--type externalEndpoints \
--query endpointMonitorStatusKubernetes Components
shell
# Check ingress controller
kubectl -n logging-ingress get pods
kubectl -n logging-ingress get svc -o wide
# Check LogScale pods
kubectl -n logging get pods
kubectl -n logging get humiocluster -o yaml
# Check humio-operator
kubectl -n logging get deploy humio-operatorStorage Access
shell
# Verify storage firewall allows secondary NAT IP
az storage account show \
--name <primary-storage-account> \
--resource-group <primary-rg> \
--query "networkRuleSet.ipRules"
# Check encryption key secret exists
kubectl get secret logscale-storage-encryption-key -n logging
# Compare encryption key hashes
kubectl get secret -n logging logscale-storage-encryption-key \
--context aks-primary \
-o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256
kubectl get secret -n logging logscale-storage-encryption-key \
--context aks-secondary \
-o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256Alert Status
shell
# Get subscription ID
SUB_ID=$(az account show --query id -o tsv)
# Check fired alerts
az rest --method get \
--uri "https://management.azure.com/subscriptions/${SUB_ID}/providers/Microsoft.AlertsManagement/alerts?api-version=2019-05-05-preview" \
-o json | jq '.value[] | select(.properties.essentials.monitorCondition == "Fired")'Function App Logs
shell
# Get deployment credentials
CREDS=$(az functionapp deployment list-publishing-profiles \
--name <function-app-name> \
--resource-group <rg-name> \
--query "[?publishMethod=='MSDeploy'].{user:userName,pass:userPWD}" -o tsv)
USER=$(echo "$CREDS" | cut -f1)
PASS=$(echo "$CREDS" | cut -f2)
# Check function logs
curl -s -u "${USER}:${PASS}" \
"https://<function-app-name>.scm.azurewebsites.net/api/vfs/LogFiles/Application/Functions/Host/" | \
jq -r '.[0].href' | xargs curl -s -u "${USER}:${PASS}" | tail -50Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| TLS timeout on global URL | Traffic Manager returning unhealthy endpoint | Check endpoint status; verify ingress controller is running |
| 403 from storage | NAT Gateway IP not in firewall rules | Re-apply secondary Terraform to update primary storage firewall |
| Function timeout to AKS | Function IPs not in authorized ranges |
Verify
azapi_update_resource.aks_authorized_ips_for_function
applied
|
| Encryption key mismatch | Remote state not configured |
Verify primary_remote_state_config in secondary
tfvars
|
| Alert keeps firing | Alert not cleared after failback |
Close alert via API: az rest --method post --uri
".../changeState?newState=Closed"
|
| Profile shows "Degraded" | Secondary in standby mode | Expected behavior; verify primary endpoint is "Online" |
Failover Simulation
This repo does not currently include a test/ DR simulation script. For a controlled end-to-end test of the Azure Monitor → Action Group → Function App chain, temporarily disable the primary Traffic Manager endpoint and verify the standby cluster scales humio-operator to 1.
For the full runbook procedure and rollback notes, see Testing the Alert Chain in the DR Operations Guide.