DNS Architecture — FQDN Locking Details
Automatic Failback Prevention
Design goal: After failover, traffic must never automatically return to the primary. A recovering primary may pass health probes but have inconsistent data or missing ingest.
Two independent layers prevent automatic failback:
Lambda locks the primary health check FQDN — during failover, the Lambda swaps the primary Route53 health check FQDN to failover-locked.invalid (an RFC 2606 reserved TLD that will never resolve). The health check permanently fails with NXDOMAIN regardless of the primary cluster's actual state. DNS remains on the secondary.
No terraform apply on primary during DR — the primary health check FQDN is defined in Terraform, but
lifecycle { ignore_changes = [fqdn, invert_healthcheck] }prevents Terraform from reverting the Lambda's runtime changes. Running terraform apply on the primary workspace will not restore the original FQDN.
Why not health check inversion? An earlier approach used
Inverted=true on the primary health check. However, inversion
is bidirectional: when the primary is down (health check returns failure),
inversion flips it to "healthy", which keeps Route53 routing to the broken
primary. The FQDN swap approach avoids this by ensuring the health check
always resolves to an unreachable host, producing a consistent failure
signal regardless of primary state.
Health Check FQDN State Transitions
| Health Check State | FQDN | Route53 sees... | DNS routes to... |
|---|---|---|---|
| Normal operation |
<primary-hostname>.<zone>
| Primary healthy | Primary (via PRIMARY record) |
| Primary down |
<primary-hostname>.<zone>
| Primary unhealthy | Secondary (via SECONDARY record) |
| Primary recovered, but FQDN locked |
failover-locked.invalid
| Primary unhealthy (NXDOMAIN) | Secondary (failback prevented) |
| Operator restores FQDN |
<primary-hostname>.<zone>
| Primary healthy | Primary (failback complete) |