DNS Architecture — FQDN Locking Details

Automatic Failback Prevention

Design goal: After failover, traffic must never automatically return to the primary. A recovering primary may pass health probes but have inconsistent data or missing ingest.

Two independent layers prevent automatic failback:

  1. Lambda locks the primary health check FQDN — during failover, the Lambda swaps the primary Route53 health check FQDN to failover-locked.invalid (an RFC 2606 reserved TLD that will never resolve). The health check permanently fails with NXDOMAIN regardless of the primary cluster's actual state. DNS remains on the secondary.

  2. No terraform apply on primary during DR — the primary health check FQDN is defined in Terraform, but lifecycle { ignore_changes = [fqdn, invert_healthcheck] } prevents Terraform from reverting the Lambda's runtime changes. Running terraform apply on the primary workspace will not restore the original FQDN.

Why not health check inversion? An earlier approach used Inverted=true on the primary health check. However, inversion is bidirectional: when the primary is down (health check returns failure), inversion flips it to "healthy", which keeps Route53 routing to the broken primary. The FQDN swap approach avoids this by ensuring the health check always resolves to an unreachable host, producing a consistent failure signal regardless of primary state.

Health Check FQDN State Transitions

Health Check State FQDN Route53 sees... DNS routes to...
Normal operation <primary-hostname>.<zone> Primary healthy Primary (via PRIMARY record)
Primary down <primary-hostname>.<zone> Primary unhealthy Secondary (via SECONDARY record)
Primary recovered, but FQDN locked failover-locked.invalid Primary unhealthy (NXDOMAIN) Secondary (failback prevented)
Operator restores FQDN <primary-hostname>.<zone> Primary healthy Primary (failback complete)