Architecture Considerations
This section of the documentation explains the building blocks behind Disaster Recovery (DR) so that you understand how DNS, certificates, and automation fit together.
DR-Specific Modules Overview
This deployment uses three specialized modules to enable automated DR failover. Each module serves a distinct purpose in the failover chain:
module.global-dnsManages OCI DNS resources for traffic steering between primary and secondary clusters.
Purpose: Provide a single global FQDN (
logscale-dr.oci-dr.humio.net) that automatically routes to the healthy clusterDeployed on: Primary cluster only (
manage_global_dns = true)Key resources created:
OCI DNS Zone (optional, can use existing)
DNS Steering Policy with failover rules
Steering Policy Attachment linking the policy to the zone
(When
use_external_health_check=true) One HTTPS health check monitor (OCI steering policies allow only one monitor)Probes
/api/v1/statususing the global FQDN as Host/SNITargets include the primary LB IP and (when
secondary_ingest_lb_ipis known at apply time) the secondary LB IP(When
use_external_health_check=true) Optional secondary TCP monitor (created only whendr="active"andsecondary_ingest_lb_ipis known); not used by the steering policy
How it works:
With
use_external_health_check=true: OCI Health Check monitors are created for observability only (OCI console dashboards, DR function pre-validation). They do not influence DNS routing. The steering policy always uses FILTER โ PRIORITY โ LIMIT (no HEALTH rule).With
use_external_health_check=false(recommended): the steering policy has no attached monitor; the standby DR failover function flips the steering policy answeris_disabledflag so the FILTER rule removes disabled answers.
external-dns (cluster-local DNS)
This repo can optionally deploy external-dns (OCI provider) via
module.pre-install when external_dns_enabled=true.
DR-safe behavior (when dr != ""):
The global DR FQDN is owned by
module.global-dns(OCI steering policy).external-dnsmust not manage it.external-dnsis configured withsource=service(not Ingress), and only Services withexternal-dns.alpha.kubernetes.io/hostnameare published.The
nginx-ingresscontroller Service is annotated with a per-cluster hostname:Primary (
dr="active"):${primary_logscale_hostname}.${dns_zone_name}Secondary (
dr="standby"):${secondary_logscale_hostname}.${dns_zone_name}
This provides stable, direct per-cluster endpoints for validation/debugging while keeping global failover controlled by OCI DNS Traffic Management.
module.cert-manager-oci-webhookEnables DNS-01 certificate validation using OCI DNS, using native Terraform Kubernetes resources (no external Helm chart dependency).
Note: DNS-01 is recommended when HTTP-01 is likely to fail, for example:
Firewall rules (
public_lb_cidrs) restrict port 80 so Let's Encrypt cannot reach the HTTP-01 challenge, and/or:A DR standby cluster needs the global FQDN certificate issued before failover (standby typically doesn't receive traffic for the global FQDN until failover)
Purpose: Issue Let's Encrypt certificates without requiring HTTP traffic to reach the ingress
Deployed on: Any workspace with
cert_dns01_provider="oci"andcert_dns01_webhook_enabled=truewhen DNS-01 is neededKey resources created:
OCI credentials secret (
kubernetes_secret_v1with write-only data)Webhook RBAC resources:
kubernetes_service_account.webhook- ServiceAccount for webhook podkubernetes_role_binding.webhook_auth_reader- Read extension-apiserver-authentication ConfigMap (kube-system)kubernetes_cluster_role_binding.webhook_auth_delegator- Delegate auth to core apiserverkubernetes_cluster_role.domain_solver- Grant cert-manager permission to use webhookkubernetes_cluster_role_binding.domain_solver- Bind domain-solver to cert-manager ServiceAccountkubernetes_role.secret_reader- Read OCI credential secretskubernetes_role_binding.secret_reader- Bind secret-reader to webhook ServiceAccount
Webhook TLS PKI (Issuer/Certificate) + APIService registration
Webhook Service + Deployment
DNS-01 ClusterIssuer (
letsencrypt-cluster-issuer)
Why DNS-01: HTTP-01 validation requires Let's Encrypt to reach port 80. When firewall rules (
public_lb_cidrs) restrict access, DNS-01 validates via TXT records in OCI DNS instead
How the Webhook Works (Step-by-Step)
The webhook enables DNS-01 ACME challenges by creating temporary TXT records in OCI DNS. Here's the complete flow:
| Step | Component | Action |
|---|---|---|
| 1 | Ingress | Annotation triggers cert-manager to request a certificate |
| 2 | cert-manager | Creates ACME Order with DNS-01 challenge type |
| 3 | ClusterIssuer | Routes challenge to OCI DNS webhook via APIService |
| 4 | Webhook | Reads OCI API credentials from oci-dns-credentials secret |
| 5 | Webhook | Creates _acme-challenge.{domain} TXT record in OCI DNS |
| 6 | Let's Encrypt | Queries DNS, finds the token, validates domain ownership |
| 7 | cert-manager | Stores issued certificate in Kubernetes Secret |
| 8 | Webhook | Removes the challenge TXT record (cleanup) |
HTTP-01 vs DNS-01 Comparison:
| Aspect | HTTP-01 | DNS-01 (Webhook) |
|---|---|---|
| Firewall rules | Must allow Let's Encrypt IPs to port 80 | No inbound access needed |
| Load balancer access | Required | Not required |
| Wildcard certs | Not supported | Supported |
| Pre-failover cert issuance | May fail during DNS switch | Works anytime |
| Complexity | Simple | Requires webhook + OCI credentials |
Automates the scaling of the Humio operator when primary cluster health check fails.
Purpose: Automatically start LogScale on the standby cluster when primary becomes unhealthy
Deployed on: Standby cluster only (
dr = "standby"anddr_failover_function_enabled = true)Key resources created:
OCI Function Application and Function (Python container)
Health Check monitor for primary cluster (if
create_primary_health_check_monitor = true)OCI Monitoring Alarm triggered by health check failures
ONS Notification Topic connecting alarm to function
IAM policies for function to access OKE cluster
Failover chain: Health Check fails โ Alarm fires โ ONS notifies โ Function invoked โ Function cleans up stale TLS secret โ Function scales humio-operator from 0 โ 1 โ Function waits for LogScale pods to become ready โ Function updates DNS steering policy โ Operator reconciles HumioCluster โ LogScale pod starts and recovers from primary bucket
Encryption Key Synchronization
Primary generates the key on first deploy and exports it as a sensitive Terraform output
Secondary reads the key via
data.terraform_remote_stateand creates a Kubernetes secret with the same valueS3_RECOVER_FROM_*environment variables are set on the standby cluster as soon as it is provisioned, but they are only consumed once the single LogScale pod is started during the DR promotion procedure
Global DNS and OCI Function-based Failover Scaler
A global DNS name is managed in OCI DNS with failover records pointing to the primary and secondary load balancer IPs
On the standby OCI cluster (
dr="standby"), an event-driven chain (Health Check โ Monitoring Alarm โ ONS Topic โ OCI Function) scales the Humio operator from 0 โ 1 so it can reconcile the already-declarednodeCount=1and start the single LogScale podThe OCI Function does not change
spec.nodeCount, and it does not scale back down automaticallyTerraform only deploys this OCI Function when
dr="standby"for that workspace. When you promote the secondary todr="active"and re-apply, the Function resources are removed automatically
Function Configuration (What You Tune From tfvars)
The failover automation runs on the standby workspace only
(dr="standby"). These root variables are the practical settings you tune:
| tfvars key | Default | Description |
|---|---|---|
dr_failover_function_absent_detection_period | "2m" | Absent-metrics window for the alarm query. |
dr_failover_function_alarm_pending_duration | "PT1M" | Alarm pending duration (OCI minimum is 1 minute). |
dr_failover_function_alarm_repeat_notification_duration | "PT10M" | Alarm re-notification interval while firing. |
dr_failover_function_create_primary_health_check_monitor | true | Create primary monitor in standby region (only used when
use_external_health_check=true). |
dr_failover_function_log_retention_days | 14 | Set to 30/60/90/120/150/180 for standby (OCI API requirement). |
dr_failover_function_pod_ready_count | 1 | Minimum number of LogScale pods that must be ready before DNS is updated. |
dr_failover_function_pod_ready_timeout | 300 | Maximum seconds to wait for LogScale pods to become ready after scaling operator. |
dr_failover_function_pre_failover_failure_seconds | 180 | Minimum consecutive seconds primary must be failing before scaling the operator.
Use 0 for testing only. |
dr_failover_function_primary_health_check_interval_seconds | 60 | Primary monitor probe interval (shorter = faster detection). |
dr_failover_function_skip_secondary_health_check | false | Skip secondary health check gating for simulations. |
dr_failover_function_use_lb_health_metrics | true | DR failover alarm mode. When true
(recommended), uses OCI Monitoring Classic LB backend health metrics. |
use_external_health_check | false | DNS steering policy mode. When
false (recommended), the steering policy does not attach an OCI Health Checks monitor and failover is controlled by
the DR function toggling is_disabled on steering policy answers. |
| Mode | Variable Setting | Alarm Namespace | How it Works |
|---|---|---|---|
| LB Backend Health (Recommended) | dr_failover_function_use_lb_health_metrics = true | oci_lbaas | Monitors unhealthy backend count from within OCI (not impacted by
public_lb_cidrs). Uses unhealthyBackendServers
metric. |
| External Health Checks | dr_failover_function_use_lb_health_metrics = false | oci_healthchecks | Uses external vantage points (AWS, Azure, GCP). May be blocked by
public_lb_cidrs. |
Why LB Backend Health is Recommended:
When public_lb_cidrs restricts load balancer access to specific IP
ranges (security best practice), external health check vantage points are blocked.
This causes:
Health checks to always report "unhealthy"
DR alarm to fire continuously (false positive)
The steering policy does not use a HEALTH rule. DNS routing is controlled exclusively by the DR failover function via the
is_disabledflag on steering policy answers
With LB backend health metrics (dr_failover_function_use_lb_health_metrics = true):
Health monitoring runs from within OCI infrastructure (LB to backends)
Not affected by
public_lb_cidrssecurity list restrictionsAccurately reflects actual backend health status
Internal defaults worth knowing (not exposed in tfvars by default):
Cooldown is 300s by default and is persisted by the function as an annotation on the humio-operator
Deployment (logscale.dr/last-failover-epoch by default), so it survives cold starts.
Example HCL (testing-only):
dr_failover_function_primary_health_check_interval_seconds = 10 # faster probing
dr_failover_function_absent_detection_period = "1m"
dr_failover_function_alarm_pending_duration = "PT1M" # OCI minimum
dr_failover_function_pre_failover_failure_seconds = 0 # skip validation
dr_failover_function_alarm_repeat_notification_duration = "PT5M"
dr_failover_function_log_retention_days = 30OCIR Image Build Configuration
The DR failover OCI Function requires a Docker image in OCI Container Registry (OCIR). Terraform fully automates the image build and push process.
What the Function DoesThe container packages a Python application that executes automated DR failover logic. When invoked via the Health Check โ Alarm โ ONS โ Function chain, it:
Authenticates to OKE using OCI Resource Principal credentials
Validates failover conditions - confirms primary is truly unhealthy, not just a transient blip
Enforces cooldown periods - prevents flapping between clusters
Cleans up stale TLS secret - deletes the HumioCluster TLS secret to prevent CA certificate mismatch when operator scales up (cert-manager recreates it with the correct CA)
Scales humio-operator from 0 โ 1 replicas via Kubernetes API PATCH
Waits for LogScale pods to become ready (configurable timeout, default 300s)
Updates DNS steering policy - discovers secondary LB IP and updates steering policy answers The operator then reconciles and starts the LogScale pod for DR recovery
Container dependencies: OCI SDK, HTTP client with retry logic, YAML parsing for kubeconfig handling
Automated Build Process
When auto_build_image = true (default), Terraform handles the entire image lifecycle:
| Step | Resource | Action |
|---|---|---|
| 1 | oci_artifacts_container_repository | Creates OCIR repository for the function image |
| 2 | oci_identity_auth_token | Generates auth token for OCIR authentication |
| 3 | null_resource.docker_build_push | Builds and pushes the Docker image |
| 4 | oci_functions_function | Deploys function with the new image digest |
Content-based tagging: The image tag is derived from source
file hashes (v-<sha256-prefix>), ensuring the function updates only when code changes.
Docker installed and running locally
docker buildx available (for --platform linux/amd64)
Configuration (secondary tfvars)
# Enable auto-build (default: true)
dr_failover_function_auto_build_image = true
# OCIR username for Docker login
# The login format used by Terraform is: <namespace>/<username>
# where namespace is auto-fetched from your tenancy.
#
# For native IAM users: use your OCI username (e.g., "rabdalla")
# For IDCS users: include the prefix (e.g., "oracleidentitycloudservice/user@email.com")
#
# To determine your user type, run:
# oci iam user get --user-id <your-user-ocid> --query 'data."identity-provider-id"'
# If the result is "null", you're a native IAM user.
ocir_username = "rabdalla"
# user_ocid is used for auth token creation (already required for OCI authentication)
user_ocid = "ocid1.user.oc1..xxxxx"Note
No manual auth token generation is required. Terraform creates and manages the auth token automatically.
Troubleshooting OCIR Authentication:
If you see an "Unauthorized" error during docker login, verify that:
Your user type matches the username format (native IAM vs IDCS)
The
ocir_usernamevalue is correct for your identity typeRun the OCI CLI command above to check your
identity-provider-id
Using a Custom Image
To use a pre-built or custom image instead of auto-building, set: dr_failover_function_auto_build_image = false.
When disabled, you must provide a valid OCIR image URI that the function can pull. The image must be built for `linux/amd64` architecture and formatted as a single-arch image (not a manifest list) for OCI Functions compatibility.
Object Storage Bucket Naming (Repo Behavior)
In this repo, module.logscale-storage creates the LogScale data bucket with a deterministic name
derived from cluster_name:
Bucket name pattern:
<cluster_name>-logscale-dataExample primary:
dr-primary-logscale-dataExample standby:
dr-secondary-logscale-dataThe bucket name is exported as a Terraform output (
storage_bucket_name) and is intended to be consumed viaterraform_remote_state. The namespace is auto-discovered from the tenancy.
Practical IAM note: DR recovery requires that the standby cluster's LogScale pod can read the primary bucket via OCI Object Storage's S3-compatible API. This repo assumes the required IAM permissions already exist (typically via compartment/tenancy policies for the Object Storage user used to generate S3 credentials).
Cloud Provider Comparison (AWS vs GCP vs OCI)
This table summarizes implementation differences across cloud providers for LogScale DR failover:
| Feature | AWS | GCP | OCI |
|---|---|---|---|
| Traffic Routing | Route53 Failover Policy | Global Load Balancer | DNS Steering Policy |
| Health Checks | Route53 HTTPS checks | GLB/Uptime checks | OCI Health Checks |
| Failover Trigger | CloudWatch Alarm โ SNS โ Lambda | Cloud Monitoring Alert โ Pub/Sub โ Cloud Function | Monitoring Alarm โ ONS โ Function |
| Serverless Runtime | Lambda (Python 3.12) | Cloud Functions Gen2 (Python 3.11) | OCI Functions (Python 3.9+) |
| K8s Authentication | STS presigned URL (k8s-aws-v1) | GKE Workload Identity | Resource Principal + OKE token |
| K8s Client | Official kubernetes Python client | Official kubernetes Python client | requests library |
| Operator Scaling | apps_v1.patch_namespaced_deployment() | apps_v1.patch_namespaced_deployment() | PATCH via requests |
| Storage | S3 with cross-region IAM | GCS with cross-region IAM | Object Storage with IAM policies |
| Encryption Key Sharing | Remote state lookup | Remote state lookup | Remote state lookup |
| Encryption Key Secret | ${cluster}-s3-storage-encryption | ${cluster}-gcp-storage-encryption-key | ${cluster}-oci-storage-encryption |
| Encryption Key Secret Key | s3-storage-encryption-key | gcp-storage-encryption-key | oci-storage-encryption-key |
Network Security Configuration
For detailed OCI networking and security reference (VCN layout, subnets, NSGs, LBs, request flow) information, see Network Security Configuration (Logscale DR).
Operational prerequisites for DR:
OCI NSG rules are unidirectional: ensure both LB NSG egress to Worker NSG and the matching Worker NSG ingress for NodePorts 30000-32767 (see
modules/oci/core/main.tfworker_ingress_lb_nodeport).If
public_lb_cidrsrestricts access, prefer DNS-01 for cert issuance (HTTP-01 validation will fail).For private API endpoints (recommended), set
bastion_client_allow_listand use the bastion tunnel; for public endpoint mode, setcontrol_plane_allowed_cidrs.