Troubleshooting
Problem likely Cause Resolution
Secondary cannot read primary's snapshots Encryption key mismatch Verify remote state config on secondary. Re-run terraform apply on secondary to re-sync the key.
GCS access denied on secondary Missing cross-region IAM bindings Run terraform apply -target=module.gke on secondary to re-apply IAM bindings.
Cloud Function not triggering Failure threshold not reached Check dr_cloud_function_pre_failover_failure_seconds (default 180s = 3 minutes of consecutive failures required).
GLB health check failing on a healthy cluster NodePort service missing or misconfigured Verify the NodePort service exists: kubectl get svc -n log. Re-apply the LogScale module if missing.
Promotion stuck after Phase 1 Routing flag misconfigured Ensure dr_use_dedicated_routing = false for Phase 1 and true for Phase 2. Apply each phase separately.
Old or missing data after failover Bucket mapping incorrect Verify GCP_RECOVER_FROM_REPLACE_BUCKET has the correct old-bucket/new-bucket mapping.
Missing repos after promotion Recovery env vars not on pods Verify GCP_RECOVER_FROM_BUCKET is set on LogScale pods. If missing, ensure gcp_recover_from_bucket is set in tfvars and re-apply.
DNS not resolving to secondary after failover DNS TTL propagation delay Wait for TTL expiry. Use dig +trace to verify propagation. For faster cutover, set low TTLs (60s) before testing.
Node pools not scaling during promotion Insufficient quota in secondary region Check GCE quotas in the secondary region. Request increases before DR testing.