Troubleshooting

Identify Current State

shell
# Check Traffic Manager status
az network traffic-manager endpoint list \
  --profile-name <tm-profile> \
  --resource-group <rg> \
  -o table

# Check AKS cluster status
az aks show --resource-group <rg> --name <cluster-name> --query "provisioningState"

# Check LogScale pods
kubectl get pods -n logging --context <context>

DNS Resolution

shell
# Resolve global DR URL
dig <global-hostname>.<zone>
nslookup <global-hostname>.<zone>

# Trace full resolution
dig +trace <global-hostname>.<zone>

External Connectivity

shell
IP="$(dig +short <global-hostname>.<zone> A | head -n1)"

# Test TCP connectivity
nc -zv "$IP" 443

# Test HTTPS
curl -vk --connect-timeout 8 "https://<global-hostname>.<zone>/"

Traffic Manager Health

shell
# Check endpoint status
az network traffic-manager endpoint show \
  --profile-name <tm-profile> \
  --name <endpoint-name> \
  --resource-group <rg> \
  --type externalEndpoints \
  --query endpointMonitorStatus

Kubernetes Components

shell
# Check ingress controller
kubectl -n logging-ingress get pods
kubectl -n logging-ingress get svc -o wide

# Check LogScale pods
kubectl -n logging get pods
kubectl -n logging get humiocluster -o yaml

# Check humio-operator
kubectl -n logging get deploy humio-operator

Storage Access

shell
# Verify storage firewall allows secondary NAT IP
az storage account show \
  --name <primary-storage-account> \
  --resource-group <primary-rg> \
  --query "networkRuleSet.ipRules"

# Check encryption key secret exists
kubectl get secret logscale-storage-encryption-key -n logging

# Compare encryption key hashes
kubectl get secret -n logging logscale-storage-encryption-key \
  --context aks-primary \
  -o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256

kubectl get secret -n logging logscale-storage-encryption-key \
  --context aks-secondary \
  -o jsonpath='{.data.azure-storage-encryption-key}' | base64 -d | shasum -a 256

Alert Status

shell
# Get subscription ID
SUB_ID=$(az account show --query id -o tsv)

# Check fired alerts
az rest --method get \
  --uri "https://management.azure.com/subscriptions/${SUB_ID}/providers/Microsoft.AlertsManagement/alerts?api-version=2019-05-05-preview" \
  -o json | jq '.value[] | select(.properties.essentials.monitorCondition == "Fired")'

Function App Logs

shell
# Get deployment credentials
CREDS=$(az functionapp deployment list-publishing-profiles \
  --name <function-app-name> \
  --resource-group <rg-name> \
  --query "[?publishMethod=='MSDeploy'].{user:userName,pass:userPWD}" -o tsv)
USER=$(echo "$CREDS" | cut -f1)
PASS=$(echo "$CREDS" | cut -f2)

# Check function logs
curl -s -u "${USER}:${PASS}" \
  "https://<function-app-name>.scm.azurewebsites.net/api/vfs/LogFiles/Application/Functions/Host/" | \
  jq -r '.[0].href' | xargs curl -s -u "${USER}:${PASS}" | tail -50
Common Issues and Solutions
Issue Cause Solution
TLS timeout on global URL Traffic Manager returning unhealthy endpoint Check endpoint status; verify ingress controller is running
403 from storage NAT Gateway IP not in firewall rules Re-apply secondary Terraform to update primary storage firewall
Function timeout to AKS Function IPs not in authorized ranges Verify azapi_update_resource.aks_authorized_ips_for_function applied
Encryption key mismatch Remote state not configured Verify primary_remote_state_config in secondary tfvars
Alert keeps firing Alert not cleared after failback Close alert via API: az rest --method post --uri ".../changeState?newState=Closed"
Profile shows "Degraded" Secondary in standby mode Expected behavior; verify primary endpoint is "Online"
Failover Simulation

This repo does not currently include a test/ DR simulation script. For a controlled end-to-end test of the Azure Monitor → Action Group → Function App chain, temporarily disable the primary Traffic Manager endpoint and verify the standby cluster scales humio-operator to 1.

For the full runbook procedure and rollback notes, see Testing the Alert Chain in the DR Operations Guide.