Architecture Considerations

This section of the documentation explains the building blocks behind Disaster Recovery (DR) so that you understand how DNS, certificates, and automation fit together.

DR-Specific Modules Overview

This deployment uses three specialized modules to enable automated DR failover. Each module serves a distinct purpose in the failover chain:

module.global-dns

Manages OCI DNS resources for traffic steering between primary and secondary clusters.

  • Purpose: Provide a single global FQDN (logscale-dr.oci-dr.humio.net) that automatically routes to the healthy cluster

  • Deployed on: Primary cluster only (manage_global_dns = true)

  • Key resources created:

    • OCI DNS Zone (optional, can use existing)

    • DNS Steering Policy with failover rules

    • Steering Policy Attachment linking the policy to the zone

    • (When use_external_health_check=true) One HTTPS health check monitor (OCI steering policies allow only one monitor)

    • Probes /api/v1/status using the global FQDN as Host/SNI

    • Targets include the primary LB IP and (when secondary_ingest_lb_ip is known at apply time) the secondary LB IP

    • (When use_external_health_check=true) Optional secondary TCP monitor (created only when dr="active" and secondary_ingest_lb_ip is known); not used by the steering policy

  • How it works:

    • With use_external_health_check=true: OCI Health Check monitors are created for observability only (OCI console dashboards, DR function pre-validation). They do not influence DNS routing. The steering policy always uses FILTER โ†’ PRIORITY โ†’ LIMIT (no HEALTH rule).

    • With use_external_health_check=false (recommended): the steering policy has no attached monitor; the standby DR failover function flips the steering policy answer is_disabled flag so the FILTER rule removes disabled answers.

external-dns (cluster-local DNS)

This repo can optionally deploy external-dns (OCI provider) via module.pre-install when external_dns_enabled=true.

DR-safe behavior (when dr != ""):

  • The global DR FQDN is owned by module.global-dns (OCI steering policy). external-dns must not manage it.

  • external-dns is configured with source=service (not Ingress), and only Services with external-dns.alpha.kubernetes.io/hostname are published.

  • The nginx-ingress controller Service is annotated with a per-cluster hostname:

    • Primary (dr="active"): ${primary_logscale_hostname}.${dns_zone_name}

    • Secondary (dr="standby"): ${secondary_logscale_hostname}.${dns_zone_name}

This provides stable, direct per-cluster endpoints for validation/debugging while keeping global failover controlled by OCI DNS Traffic Management.

module.cert-manager-oci-webhook

Enables DNS-01 certificate validation using OCI DNS, using native Terraform Kubernetes resources (no external Helm chart dependency).

Note: DNS-01 is recommended when HTTP-01 is likely to fail, for example:

  • Firewall rules (public_lb_cidrs) restrict port 80 so Let's Encrypt cannot reach the HTTP-01 challenge, and/or:

    • A DR standby cluster needs the global FQDN certificate issued before failover (standby typically doesn't receive traffic for the global FQDN until failover)

  • Purpose: Issue Let's Encrypt certificates without requiring HTTP traffic to reach the ingress

  • Deployed on: Any workspace with cert_dns01_provider="oci" and cert_dns01_webhook_enabled=true when DNS-01 is needed

  • Key resources created:

    • OCI credentials secret (kubernetes_secret_v1 with write-only data)

    • Webhook RBAC resources:

      • kubernetes_service_account.webhook - ServiceAccount for webhook pod

      • kubernetes_role_binding.webhook_auth_reader - Read extension-apiserver-authentication ConfigMap (kube-system)

      • kubernetes_cluster_role_binding.webhook_auth_delegator - Delegate auth to core apiserver

      • kubernetes_cluster_role.domain_solver - Grant cert-manager permission to use webhook

      • kubernetes_cluster_role_binding.domain_solver - Bind domain-solver to cert-manager ServiceAccount

      • kubernetes_role.secret_reader - Read OCI credential secrets

      • kubernetes_role_binding.secret_reader - Bind secret-reader to webhook ServiceAccount

    • Webhook TLS PKI (Issuer/Certificate) + APIService registration

    • Webhook Service + Deployment

    • DNS-01 ClusterIssuer (letsencrypt-cluster-issuer)

  • Why DNS-01: HTTP-01 validation requires Let's Encrypt to reach port 80. When firewall rules (public_lb_cidrs) restrict access, DNS-01 validates via TXT records in OCI DNS instead

How the Webhook Works (Step-by-Step)

The webhook enables DNS-01 ACME challenges by creating temporary TXT records in OCI DNS. Here's the complete flow:

Step Component Action
1IngressAnnotation triggers cert-manager to request a certificate
2cert-managerCreates ACME Order with DNS-01 challenge type
3ClusterIssuerRoutes challenge to OCI DNS webhook via APIService
4WebhookReads OCI API credentials from oci-dns-credentials secret
5WebhookCreates _acme-challenge.{domain} TXT record in OCI DNS
6Let's EncryptQueries DNS, finds the token, validates domain ownership
7cert-managerStores issued certificate in Kubernetes Secret
8WebhookRemoves the challenge TXT record (cleanup)

HTTP-01 vs DNS-01 Comparison:

Aspect HTTP-01 DNS-01 (Webhook)
Firewall rulesMust allow Let's Encrypt IPs to port 80No inbound access needed
Load balancer accessRequiredNot required
Wildcard certsNot supportedSupported
Pre-failover cert issuanceMay fail during DNS switchWorks anytime
ComplexitySimpleRequires webhook + OCI credentials
module.dr-failover-function

Automates the scaling of the Humio operator when primary cluster health check fails.

  • Purpose: Automatically start LogScale on the standby cluster when primary becomes unhealthy

  • Deployed on: Standby cluster only (dr = "standby" and dr_failover_function_enabled = true)

  • Key resources created:

    • OCI Function Application and Function (Python container)

    • Health Check monitor for primary cluster (if create_primary_health_check_monitor = true)

    • OCI Monitoring Alarm triggered by health check failures

    • ONS Notification Topic connecting alarm to function

    • IAM policies for function to access OKE cluster

  • Failover chain: Health Check fails โ†’ Alarm fires โ†’ ONS notifies โ†’ Function invoked โ†’ Function cleans up stale TLS secret โ†’ Function scales humio-operator from 0 โ†’ 1 โ†’ Function waits for LogScale pods to become ready โ†’ Function updates DNS steering policy โ†’ Operator reconciles HumioCluster โ†’ LogScale pod starts and recovers from primary bucket

Encryption Key Synchronization

  • Primary generates the key on first deploy and exports it as a sensitive Terraform output

  • Secondary reads the key via data.terraform_remote_state and creates a Kubernetes secret with the same value

  • S3_RECOVER_FROM_* environment variables are set on the standby cluster as soon as it is provisioned, but they are only consumed once the single LogScale pod is started during the DR promotion procedure

Global DNS and OCI Function-based Failover Scaler

  • A global DNS name is managed in OCI DNS with failover records pointing to the primary and secondary load balancer IPs

  • On the standby OCI cluster (dr="standby"), an event-driven chain (Health Check โ†’ Monitoring Alarm โ†’ ONS Topic โ†’ OCI Function) scales the Humio operator from 0 โ†’ 1 so it can reconcile the already-declared nodeCount=1 and start the single LogScale pod

  • The OCI Function does not change spec.nodeCount, and it does not scale back down automatically

  • Terraform only deploys this OCI Function when dr="standby" for that workspace. When you promote the secondary to dr="active" and re-apply, the Function resources are removed automatically

Function Configuration (What You Tune From tfvars)

The failover automation runs on the standby workspace only (dr="standby"). These root variables are the practical settings you tune:

tfvars key DefaultDescription
dr_failover_function_absent_detection_period"2m"Absent-metrics window for the alarm query.
dr_failover_function_alarm_pending_duration"PT1M"Alarm pending duration (OCI minimum is 1 minute).
dr_failover_function_alarm_repeat_notification_duration"PT10M"Alarm re-notification interval while firing.
dr_failover_function_create_primary_health_check_monitortrueCreate primary monitor in standby region (only used when use_external_health_check=true).
dr_failover_function_log_retention_days14Set to 30/60/90/120/150/180 for standby (OCI API requirement).
dr_failover_function_pod_ready_count1Minimum number of LogScale pods that must be ready before DNS is updated.
dr_failover_function_pod_ready_timeout300Maximum seconds to wait for LogScale pods to become ready after scaling operator.
dr_failover_function_pre_failover_failure_seconds180Minimum consecutive seconds primary must be failing before scaling the operator. Use 0 for testing only.
dr_failover_function_primary_health_check_interval_seconds60Primary monitor probe interval (shorter = faster detection).
dr_failover_function_skip_secondary_health_checkfalseSkip secondary health check gating for simulations.
dr_failover_function_use_lb_health_metricstrueDR failover alarm mode. When true (recommended), uses OCI Monitoring Classic LB backend health metrics.
use_external_health_checkfalseDNS steering policy mode. When false (recommended), the steering policy does not attach an OCI Health Checks monitor and failover is controlled by the DR function toggling is_disabled on steering policy answers.
Health Monitoring Modes
ModeVariable SettingAlarm NamespaceHow it Works
LB Backend Health (Recommended)dr_failover_function_use_lb_health_metrics = trueoci_lbaasMonitors unhealthy backend count from within OCI (not impacted by public_lb_cidrs). Uses unhealthyBackendServers metric.
External Health Checksdr_failover_function_use_lb_health_metrics = falseoci_healthchecksUses external vantage points (AWS, Azure, GCP). May be blocked by public_lb_cidrs.

Why LB Backend Health is Recommended:

When public_lb_cidrs restricts load balancer access to specific IP ranges (security best practice), external health check vantage points are blocked.

This causes:

  • Health checks to always report "unhealthy"

  • DR alarm to fire continuously (false positive)

  • The steering policy does not use a HEALTH rule. DNS routing is controlled exclusively by the DR failover function via the is_disabled flag on steering policy answers

With LB backend health metrics (dr_failover_function_use_lb_health_metrics = true):

  • Health monitoring runs from within OCI infrastructure (LB to backends)

  • Not affected by public_lb_cidrs security list restrictions

  • Accurately reflects actual backend health status

Internal defaults worth knowing (not exposed in tfvars by default):

Cooldown is 300s by default and is persisted by the function as an annotation on the humio-operator Deployment (logscale.dr/last-failover-epoch by default), so it survives cold starts.

Example HCL (testing-only):

terraform
dr_failover_function_primary_health_check_interval_seconds = 10 # faster probing
dr_failover_function_absent_detection_period = "1m"
dr_failover_function_alarm_pending_duration = "PT1M" # OCI minimum
dr_failover_function_pre_failover_failure_seconds = 0 # skip validation
dr_failover_function_alarm_repeat_notification_duration = "PT5M"
dr_failover_function_log_retention_days = 30
OCIR Image Build Configuration

The DR failover OCI Function requires a Docker image in OCI Container Registry (OCIR). Terraform fully automates the image build and push process.

What the Function Does

The container packages a Python application that executes automated DR failover logic. When invoked via the Health Check โ†’ Alarm โ†’ ONS โ†’ Function chain, it:

  • Authenticates to OKE using OCI Resource Principal credentials

  • Validates failover conditions - confirms primary is truly unhealthy, not just a transient blip

  • Enforces cooldown periods - prevents flapping between clusters

  • Cleans up stale TLS secret - deletes the HumioCluster TLS secret to prevent CA certificate mismatch when operator scales up (cert-manager recreates it with the correct CA)

  • Scales humio-operator from 0 โ†’ 1 replicas via Kubernetes API PATCH

  • Waits for LogScale pods to become ready (configurable timeout, default 300s)

  • Updates DNS steering policy - discovers secondary LB IP and updates steering policy answers The operator then reconciles and starts the LogScale pod for DR recovery

Container dependencies: OCI SDK, HTTP client with retry logic, YAML parsing for kubeconfig handling

Automated Build Process

When auto_build_image = true (default), Terraform handles the entire image lifecycle:

StepResourceAction
1oci_artifacts_container_repositoryCreates OCIR repository for the function image
2oci_identity_auth_tokenGenerates auth token for OCIR authentication
3null_resource.docker_build_pushBuilds and pushes the Docker image
4oci_functions_functionDeploys function with the new image digest

Content-based tagging: The image tag is derived from source file hashes (v-<sha256-prefix>), ensuring the function updates only when code changes.

Prerequisites

  • Docker installed and running locally

  • docker buildx available (for --platform linux/amd64)

Configuration (secondary tfvars)
terraform
# Enable auto-build (default: true)
dr_failover_function_auto_build_image = true
# OCIR username for Docker login
# The login format used by Terraform is: <namespace>/<username>
# where namespace is auto-fetched from your tenancy.
#
# For native IAM users: use your OCI username (e.g., "rabdalla")
# For IDCS users: include the prefix (e.g., "oracleidentitycloudservice/user@email.com")
#
# To determine your user type, run:
# oci iam user get --user-id <your-user-ocid> --query 'data."identity-provider-id"'
# If the result is "null", you're a native IAM user.
ocir_username = "rabdalla"
# user_ocid is used for auth token creation (already required for OCI authentication)
user_ocid = "ocid1.user.oc1..xxxxx"

Note

No manual auth token generation is required. Terraform creates and manages the auth token automatically.

Troubleshooting OCIR Authentication:

If you see an "Unauthorized" error during docker login, verify that:

  • Your user type matches the username format (native IAM vs IDCS)

  • The ocir_username value is correct for your identity type

  • Run the OCI CLI command above to check your identity-provider-id

Using a Custom Image

To use a pre-built or custom image instead of auto-building, set: dr_failover_function_auto_build_image = false.

When disabled, you must provide a valid OCIR image URI that the function can pull. The image must be built for `linux/amd64` architecture and formatted as a single-arch image (not a manifest list) for OCI Functions compatibility.

Object Storage Bucket Naming (Repo Behavior)

In this repo, module.logscale-storage creates the LogScale data bucket with a deterministic name derived from cluster_name:

  • Bucket name pattern: <cluster_name>-logscale-data

  • Example primary: dr-primary-logscale-data

  • Example standby: dr-secondary-logscale-data

  • The bucket name is exported as a Terraform output (storage_bucket_name) and is intended to be consumed via terraform_remote_state. The namespace is auto-discovered from the tenancy.

Practical IAM note: DR recovery requires that the standby cluster's LogScale pod can read the primary bucket via OCI Object Storage's S3-compatible API. This repo assumes the required IAM permissions already exist (typically via compartment/tenancy policies for the Object Storage user used to generate S3 credentials).

Cloud Provider Comparison (AWS vs GCP vs OCI)

This table summarizes implementation differences across cloud providers for LogScale DR failover:

FeatureAWSGCPOCI
Traffic RoutingRoute53 Failover PolicyGlobal Load BalancerDNS Steering Policy
Health ChecksRoute53 HTTPS checksGLB/Uptime checksOCI Health Checks
Failover TriggerCloudWatch Alarm โ†’ SNS โ†’ LambdaCloud Monitoring Alert โ†’ Pub/Sub โ†’ Cloud FunctionMonitoring Alarm โ†’ ONS โ†’ Function
Serverless RuntimeLambda (Python 3.12)Cloud Functions Gen2 (Python 3.11)OCI Functions (Python 3.9+)
K8s AuthenticationSTS presigned URL (k8s-aws-v1)GKE Workload IdentityResource Principal + OKE token
K8s ClientOfficial kubernetes Python clientOfficial kubernetes Python clientrequests library
Operator Scalingapps_v1.patch_namespaced_deployment()apps_v1.patch_namespaced_deployment()PATCH via requests
StorageS3 with cross-region IAMGCS with cross-region IAMObject Storage with IAM policies
Encryption Key SharingRemote state lookupRemote state lookupRemote state lookup
Encryption Key Secret${cluster}-s3-storage-encryption${cluster}-gcp-storage-encryption-key${cluster}-oci-storage-encryption
Encryption Key Secret Keys3-storage-encryption-keygcp-storage-encryption-keyoci-storage-encryption-key
Network Security Configuration

For detailed OCI networking and security reference (VCN layout, subnets, NSGs, LBs, request flow) information, see Network Security Configuration (Logscale DR).

Operational prerequisites for DR:

  • OCI NSG rules are unidirectional: ensure both LB NSG egress to Worker NSG and the matching Worker NSG ingress for NodePorts 30000-32767 (see modules/oci/core/main.tf worker_ingress_lb_nodeport).

  • If public_lb_cidrs restricts access, prefer DNS-01 for cert issuance (HTTP-01 validation will fail).

  • For private API endpoints (recommended), set bastion_client_allow_list and use the bastion tunnel; for public endpoint mode, set control_plane_allowed_cidrs.