Operations Guide

This Operations Guide is made up of the following sections:

Proposal

This section states the intent, audience, and boundaries of the DR runbook.

Executive Summary

This section provides an overview of how Disaster Recovery (DR) is structured, and what the primary and standby roles do.

Architecture

This section provides an overview of the Disaster Recovery Architecture.

Architecture Considerations

This section of the documentation explains the building blocks behind Disaster Recovery (DR) to help you understand how DNS, certificates, and automation fit together.

Terraform Sequence

This section explains the sequence that must be followed for successful deployment.

Cluster Access

This section explains how to reach both private OKE clusters safely and manage kubeconfig contexts.

Certificate Management

This section covers TLS certificate strategy for the global DR hostname and why DNS-01 is typically required.

Dynamic Secondary IP Lookup via Remote State

This section explains how primary and secondary exchange nginx-ingress LoadBalancer IPs through Terraform remote state.

DR Deployment

This section covers the complete DR deployment process, from prerequisites through the three stages of DR configuration, failover, and promotion.

DR Failover Timing

This section documents the expected time from primary failure detection to secondary cluster activation. Pre-failover validation runs for dr_failover_function_pre_failover_failure_seconds seconds (set to 0 for testing only).

Quick Reference

Quick reference info.

Known Issues and Recommendations

This section lists some known issues and recommended mitigations for DR operations.

Disaster Recovery Additional Resources

Related documentation.