Architecture Considerations

This section of the documentation explains the building blocks behind Disaster Recovery (DR) so that you understand how DNS, certificates, and automation fit together.

DR-Specific Modules Overview

This deployment uses three specialized modules to enable automated DR failover. Each module serves a distinct purpose in the failover chain:

module.global-dns

Manages OCI DNS resources for traffic steering between primary and secondary clusters.

Purpose: Provide a single global FQDN (logscale-dr.oci-dr.humio.net) that automatically routes to the healthy cluster
Deployed on: Primary cluster only (manage_global_dns = true)
Key resources created:
- OCI DNS Zone (optional, can use existing)
- DNS Steering Policy with failover rules
- Steering Policy Attachment linking the policy to the zone
- (When use_external_health_check=true) One HTTPS health check monitor (OCI steering policies allow only one monitor)
- Probes /api/v1/status using the global FQDN as Host/SNI
- Targets include the primary LB IP and (when secondary_ingest_lb_ip is known at apply time) the secondary LB IP
- (When use_external_health_check=true) Optional secondary TCP monitor (created only when dr="active" and secondary_ingest_lb_ip is known); not used by the steering policy
How it works:
- With use_external_health_check=true: OCI Health Check monitors are created for observability only (OCI console dashboards, DR function pre-validation). They do not influence DNS routing. The steering policy always uses FILTER → PRIORITY → LIMIT (no HEALTH rule).
- With use_external_health_check=false (recommended): the steering policy has no attached monitor; the standby DR failover function flips the steering policy answer is_disabled flag so the FILTER rule removes disabled answers.

external-dns (cluster-local DNS)

This repo can optionally deploy external-dns (OCI provider) via module.pre-install when external_dns_enabled=true.

DR-safe behavior (when dr != ""):

The global DR FQDN is owned by module.global-dns (OCI steering policy). external-dns must not manage it.
external-dns is configured with source=service (not Ingress), and only Services with external-dns.alpha.kubernetes.io/hostname are published.
The nginx-ingress controller Service is annotated with a per-cluster hostname:
- Primary (dr="active"): ${primary_logscale_hostname}.${dns_zone_name}
- Secondary (dr="standby"): ${secondary_logscale_hostname}.${dns_zone_name}

This provides stable, direct per-cluster endpoints for validation/debugging while keeping global failover controlled by OCI DNS Traffic Management.

module.cert-manager-oci-webhook

Enables DNS-01 certificate validation using OCI DNS, using native Terraform Kubernetes resources (no external Helm chart dependency).

Note: DNS-01 is recommended when HTTP-01 is likely to fail, for example:

Firewall rules (public_lb_cidrs) restrict port 80 so Let's Encrypt cannot reach the HTTP-01 challenge, and/or:
- A DR standby cluster needs the global FQDN certificate issued before failover (standby typically doesn't receive traffic for the global FQDN until failover)
Purpose: Issue Let's Encrypt certificates without requiring HTTP traffic to reach the ingress
Deployed on: Any workspace with cert_dns01_provider="oci" and cert_dns01_webhook_enabled=true when DNS-01 is needed
Key resources created:
- OCI credentials secret (kubernetes_secret_v1 with write-only data)
- Webhook RBAC resources:
  - kubernetes_service_account.webhook - ServiceAccount for webhook pod
  - kubernetes_role_binding.webhook_auth_reader - Read extension-apiserver-authentication ConfigMap (kube-system)
  - kubernetes_cluster_role_binding.webhook_auth_delegator - Delegate auth to core apiserver
  - kubernetes_cluster_role.domain_solver - Grant cert-manager permission to use webhook
  - kubernetes_cluster_role_binding.domain_solver - Bind domain-solver to cert-manager ServiceAccount
  - kubernetes_role.secret_reader - Read OCI credential secrets
  - kubernetes_role_binding.secret_reader - Bind secret-reader to webhook ServiceAccount
- Webhook TLS PKI (Issuer/Certificate) + APIService registration
- Webhook Service + Deployment
- DNS-01 ClusterIssuer (letsencrypt-cluster-issuer)
Why DNS-01: HTTP-01 validation requires Let's Encrypt to reach port 80. When firewall rules (public_lb_cidrs) restrict access, DNS-01 validates via TXT records in OCI DNS instead

How the Webhook Works (Step-by-Step)

The webhook enables DNS-01 ACME challenges by creating temporary TXT records in OCI DNS. Here's the complete flow:

Step	Component	Action
1	Ingress	Annotation triggers cert-manager to request a certificate
2	cert-manager	Creates ACME Order with DNS-01 challenge type
3	ClusterIssuer	Routes challenge to OCI DNS webhook via APIService
4	Webhook	Reads OCI API credentials from `oci-dns-credentials` secret
5	Webhook	Creates `_acme-challenge.{domain}` TXT record in OCI DNS
6	Let's Encrypt	Queries DNS, finds the token, validates domain ownership
7	cert-manager	Stores issued certificate in Kubernetes Secret
8	Webhook	Removes the challenge TXT record (cleanup)

HTTP-01 vs DNS-01 Comparison:

Aspect	HTTP-01	DNS-01 (Webhook)
Firewall rules	Must allow Let's Encrypt IPs to port 80	No inbound access needed
Load balancer access	Required	Not required
Wildcard certs	Not supported	Supported
Pre-failover cert issuance	May fail during DNS switch	Works anytime
Complexity	Simple	Requires webhook + OCI credentials

module.dr-failover-function

Automates the scaling of the Humio operator when primary cluster health check fails.

Purpose: Automatically start LogScale on the standby cluster when primary becomes unhealthy
Deployed on: Standby cluster only (dr = "standby" and dr_failover_function_enabled = true)
Key resources created:
- OCI Function Application and Function (Python container)
- Health Check monitor for primary cluster (if create_primary_health_check_monitor = true)
- OCI Monitoring Alarm triggered by health check failures
- ONS Notification Topic connecting alarm to function
- IAM policies for function to access OKE cluster
Failover chain: Health Check fails → Alarm fires → ONS notifies → Function invoked → Function cleans up stale TLS secret → Function scales humio-operator from 0 → 1 → Function waits for LogScale pods to become ready → Function updates DNS steering policy → Operator reconciles HumioCluster → LogScale pod starts and recovers from primary bucket

Encryption Key Synchronization

Primary generates the key on first deploy and exports it as a sensitive Terraform output
Secondary reads the key via data.terraform_remote_state and creates a Kubernetes secret with the same value
S3_RECOVER_FROM_* environment variables are set on the standby cluster as soon as it is provisioned, but they are only consumed once the single LogScale pod is started during the DR promotion procedure

Global DNS and OCI Function-based Failover Scaler

A global DNS name is managed in OCI DNS with failover records pointing to the primary and secondary load balancer IPs
On the standby OCI cluster (dr="standby"), an event-driven chain (Health Check → Monitoring Alarm → ONS Topic → OCI Function) scales the Humio operator from 0 → 1 so it can reconcile the already-declared nodeCount=1 and start the single LogScale pod
The OCI Function does not change spec.nodeCount, and it does not scale back down automatically
Terraform only deploys this OCI Function when dr="standby" for that workspace. When you promote the secondary to dr="active" and re-apply, the Function resources are removed automatically

Function Configuration (What You Tune From tfvars)

The failover automation runs on the standby workspace only (dr="standby"). These root variables are the practical settings you tune:

tfvars key	Default	Description
`dr_failover_function_absent_detection_period`	"2m"	Absent-metrics window for the alarm query.
`dr_failover_function_alarm_pending_duration`	"PT1M"	Alarm pending duration (OCI minimum is 1 minute).
`dr_failover_function_alarm_repeat_notification_duration`	"PT10M"	Alarm re-notification interval while firing.
`dr_failover_function_create_primary_health_check_monitor`	true	Create primary monitor in standby region (only used when `use_external_health_check=true`).
`dr_failover_function_log_retention_days`	14	Set to 30/60/90/120/150/180 for standby (OCI API requirement).
`dr_failover_function_pod_ready_count`	1	Minimum number of LogScale pods that must be ready before DNS is updated.
`dr_failover_function_pod_ready_timeout`	300	Maximum seconds to wait for LogScale pods to become ready after scaling operator.
`dr_failover_function_pre_failover_failure_seconds`	180	Minimum consecutive seconds primary must be failing before scaling the operator. Use `0` for testing only.
`dr_failover_function_primary_health_check_interval_seconds`	60	Primary monitor probe interval (shorter = faster detection).
`dr_failover_function_skip_secondary_health_check`	false	Skip secondary health check gating for simulations.
`dr_failover_function_use_lb_health_metrics`	true	DR failover alarm mode. When `true` (recommended), uses OCI Monitoring Classic LB backend health metrics.
`use_external_health_check`	false	DNS steering policy mode. When `false` (recommended), the steering policy does not attach an OCI Health Checks monitor and failover is controlled by the DR function toggling `is_disabled` on steering policy answers.

Health Monitoring Modes

Mode	Variable Setting	Alarm Namespace	How it Works
LB Backend Health (Recommended)	`dr_failover_function_use_lb_health_metrics = true`	`oci_lbaas`	Monitors unhealthy backend count from within OCI (not impacted by `public_lb_cidrs`). Uses `unhealthyBackendServers` metric.
External Health Checks	`dr_failover_function_use_lb_health_metrics = false`	`oci_healthchecks`	Uses external vantage points (AWS, Azure, GCP). May be blocked by `public_lb_cidrs`.

Why LB Backend Health is Recommended:

When public_lb_cidrs restricts load balancer access to specific IP ranges (security best practice), external health check vantage points are blocked.

This causes:

Health checks to always report "unhealthy"
DR alarm to fire continuously (false positive)
The steering policy does not use a HEALTH rule. DNS routing is controlled exclusively by the DR failover function via the is_disabled flag on steering policy answers

With LB backend health metrics (dr_failover_function_use_lb_health_metrics = true):

Health monitoring runs from within OCI infrastructure (LB to backends)
Not affected by public_lb_cidrs security list restrictions
Accurately reflects actual backend health status

Internal defaults worth knowing (not exposed in tfvars by default):

Cooldown is 300s by default and is persisted by the function as an annotation on the humio-operator Deployment (logscale.dr/last-failover-epoch by default), so it survives cold starts.

Example HCL (testing-only):

terraform

dr_failover_function_primary_health_check_interval_seconds = 10 # faster probing
dr_failover_function_absent_detection_period = "1m"
dr_failover_function_alarm_pending_duration = "PT1M" # OCI minimum
dr_failover_function_pre_failover_failure_seconds = 0 # skip validation
dr_failover_function_alarm_repeat_notification_duration = "PT5M"
dr_failover_function_log_retention_days = 30

OCIR Image Build Configuration

The DR failover OCI Function requires a Docker image in OCI Container Registry (OCIR). Terraform fully automates the image build and push process.

What the Function Does

The container packages a Python application that executes automated DR failover logic. When invoked via the Health Check → Alarm → ONS → Function chain, it:

Authenticates to OKE using OCI Resource Principal credentials
Validates failover conditions - confirms primary is truly unhealthy, not just a transient blip
Enforces cooldown periods - prevents flapping between clusters
Cleans up stale TLS secret - deletes the HumioCluster TLS secret to prevent CA certificate mismatch when operator scales up (cert-manager recreates it with the correct CA)
Scales humio-operator from 0 → 1 replicas via Kubernetes API PATCH
Waits for LogScale pods to become ready (configurable timeout, default 300s)
Updates DNS steering policy - discovers secondary LB IP and updates steering policy answers The operator then reconciles and starts the LogScale pod for DR recovery

Container dependencies: OCI SDK, HTTP client with retry logic, YAML parsing for kubeconfig handling

Automated Build Process

When auto_build_image = true (default), Terraform handles the entire image lifecycle:

Step	Resource	Action
1	`oci_artifacts_container_repository`	Creates OCIR repository for the function image
2	`oci_identity_auth_token`	Generates auth token for OCIR authentication
3	`null_resource.docker_build_push`	Builds and pushes the Docker image
4	`oci_functions_function`	Deploys function with the new image digest

Content-based tagging: The image tag is derived from source file hashes (v-<sha256-prefix>), ensuring the function updates only when code changes.

Prerequisites

Docker installed and running locally
docker buildx available (for --platform linux/amd64)

Configuration (secondary tfvars)

terraform

# Enable auto-build (default: true)
dr_failover_function_auto_build_image = true
# OCIR username for Docker login
# The login format used by Terraform is: <namespace>/<username>
# where namespace is auto-fetched from your tenancy.
#
# For native IAM users: use your OCI username (e.g., "rabdalla")
# For IDCS users: include the prefix (e.g., "oracleidentitycloudservice/user@email.com")
#
# To determine your user type, run:
# oci iam user get --user-id <your-user-ocid> --query 'data."identity-provider-id"'
# If the result is "null", you're a native IAM user.
ocir_username = "rabdalla"
# user_ocid is used for auth token creation (already required for OCI authentication)
user_ocid = "ocid1.user.oc1..xxxxx"

Note

No manual auth token generation is required. Terraform creates and manages the auth token automatically.

Troubleshooting OCIR Authentication:

If you see an "Unauthorized" error during docker login, verify that:

Your user type matches the username format (native IAM vs IDCS)
The ocir_username value is correct for your identity type
Run the OCI CLI command above to check your identity-provider-id

Using a Custom Image

To use a pre-built or custom image instead of auto-building, set: dr_failover_function_auto_build_image = false.

When disabled, you must provide a valid OCIR image URI that the function can pull. The image must be built for `linux/amd64` architecture and formatted as a single-arch image (not a manifest list) for OCI Functions compatibility.

Object Storage Bucket Naming (Repo Behavior)

In this repo, module.logscale-storage creates the LogScale data bucket with a deterministic name derived from cluster_name:

Bucket name pattern: <cluster_name>-logscale-data
Example primary: dr-primary-logscale-data
Example standby: dr-secondary-logscale-data
The bucket name is exported as a Terraform output (storage_bucket_name) and is intended to be consumed via terraform_remote_state. The namespace is auto-discovered from the tenancy.

Practical IAM note: DR recovery requires that the standby cluster's LogScale pod can read the primary bucket via OCI Object Storage's S3-compatible API. This repo assumes the required IAM permissions already exist (typically via compartment/tenancy policies for the Object Storage user used to generate S3 credentials).

Cloud Provider Comparison (AWS vs GCP vs OCI)

This table summarizes implementation differences across cloud providers for LogScale DR failover:

Feature	AWS	GCP	OCI
Traffic Routing	Route53 Failover Policy	Global Load Balancer	DNS Steering Policy
Health Checks	Route53 HTTPS checks	GLB/Uptime checks	OCI Health Checks
Failover Trigger	CloudWatch Alarm → SNS → Lambda	Cloud Monitoring Alert → Pub/Sub → Cloud Function	Monitoring Alarm → ONS → Function
Serverless Runtime	Lambda (Python 3.12)	Cloud Functions Gen2 (Python 3.11)	OCI Functions (Python 3.9+)
K8s Authentication	STS presigned URL (k8s-aws-v1)	GKE Workload Identity	Resource Principal + OKE token
K8s Client	Official `kubernetes` Python client	Official `kubernetes` Python client	`requests` library
Operator Scaling	`apps_v1.patch_namespaced_deployment()`	`apps_v1.patch_namespaced_deployment()`	PATCH via requests
Storage	S3 with cross-region IAM	GCS with cross-region IAM	Object Storage with IAM policies
Encryption Key Sharing	Remote state lookup	Remote state lookup	Remote state lookup
Encryption Key Secret	`${cluster}-s3-storage-encryption`	`${cluster}-gcp-storage-encryption-key`	`${cluster}-oci-storage-encryption`
Encryption Key Secret Key	`s3-storage-encryption-key`	`gcp-storage-encryption-key`	`oci-storage-encryption-key`

Network Security Configuration

For detailed OCI networking and security reference (VCN layout, subnets, NSGs, LBs, request flow) information, see Network Security Configuration (Logscale DR).

Operational prerequisites for DR:

OCI NSG rules are unidirectional: ensure both LB NSG egress to Worker NSG and the matching Worker NSG ingress for NodePorts 30000-32767 (see modules/oci/core/main.tf worker_ingress_lb_nodeport).
If public_lb_cidrs restricts access, prefer DNS-01 for cert issuance (HTTP-01 validation will fail).
For private API endpoints (recommended), set bastion_client_allow_list and use the bastion tunnel; for public endpoint mode, set control_plane_allowed_cidrs.

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment