Troubleshooting
| Problem | likely Cause | Resolution |
|---|---|---|
| Secondary cannot read primary's snapshots | Encryption key mismatch | Verify remote state config on secondary. Re-run terraform apply on secondary to re-sync the key. |
| GCS access denied on secondary | Missing cross-region IAM bindings |
Run terraform apply -target=module.gke on secondary to
re-apply IAM bindings.
|
| Cloud Function not triggering | Failure threshold not reached |
Check
dr_cloud_function_pre_failover_failure_seconds
(default 180s = 3 minutes of consecutive failures required).
|
| GLB health check failing on a healthy cluster | NodePort service missing or misconfigured |
Verify the NodePort service exists: kubectl get svc -n
log. Re-apply the LogScale module if missing.
|
| Promotion stuck after Phase 1 | Routing flag misconfigured |
Ensure dr_use_dedicated_routing = false for Phase 1 and
true for Phase 2. Apply each phase separately.
|
| Old or missing data after failover | Bucket mapping incorrect |
Verify GCP_RECOVER_FROM_REPLACE_BUCKET has the
correct old-bucket/new-bucket mapping.
|
| Missing repos after promotion | Recovery env vars not on pods |
Verify GCP_RECOVER_FROM_BUCKET is set on
LogScale pods. If missing, ensure
gcp_recover_from_bucket is set in tfvars and
re-apply.
|
| DNS not resolving to secondary after failover | DNS TTL propagation delay | Wait for TTL expiry. Use dig +trace to verify propagation. For faster cutover, set low TTLs (60s) before testing. |
| Node pools not scaling during promotion | Insufficient quota in secondary region | Check GCE quotas in the secondary region. Request increases before DR testing. |