DR Deployment

This section covers the complete DR deployment process, from prerequisites through the three stages of DR configuration, failover, and promotion.

Pre-Deployment Verification

  • Prerequisites (once)

    • Terraform >= 1.5.7, kubectl 1.27+, OCI CLI configured.

    • Terraform backend: OCI Object Storage state bucket.

    • Access to both OKE clusters.

  • Quick checks:

    • terraform init works; terraform workspace list shows primary and secondary.

    • OCI identity / backend reachability:

shell
oci iam user get --user-id $OCI_USER_OCID
oci os ns get
terraform version

Storage + remote state:

  • Bucket names are deterministic (<cluster_name>-logscale-data) and exported via Terraform outputs; you generally do not set bucket names manually.

  • Standby applies require access to primary outputs via primary_remote_state_config (for encryption key + primary bucket details).

Stage 1: DR Configuration Setup
Why Kafka Must Be Deployed First

Strimzi generates the Kafka TLS truststore secret (${name_prefix}-strimzi-kafka-cluster-ca-cert) only after Kafka is up.

Humio pods mount this secret and use its ca.password for KAFKA_COMMON_SSL_TRUSTSTORE_PASSWORD and /tmp/kafka/ca.p12 for KAFKA_COMMON_SSL_TRUSTSTORE_LOCATION.

If Humio starts before the secret exists, the pod fails to mount the volume and crashloops.

Deploy Kafka/Strimzi first, then let Humio start.

DR Recovery Environment Variables

OCI LogScale deployments use AWS S3-compatible environment variables for DR recovery (since OCI Object Storage provides S3-compatible API). These variables are automatically set by Terraform when dr = "standby".

Environment Variable Reference

Env VarPurposeFormatExample
S3_RECOVER_FROM_BUCKETSource bucket name where LogScale fetches global-snapshot.json during DR bootbucket-namedr-primary-logscale-data
S3_RECOVER_FROM_REGIONRegion of the source bucket; used to construct the S3 API endpointregion-nameus-chicago-1
S3_RECOVER_FROM_ENDPOINT_BASES3-compatible API base URL; required for OCI since it uses non-AWS endpointshttps://<endpoint>https://axrgs2jgwnhx.compat.objectstorage.us-chicago-1.oraclecloud.com
S3_RECOVER_FROM_REPLACE_REGIONSubstitution pattern to rewrite region references in recovered snapshot metadataold/newus-chicago-1/us-chicago-1
S3_RECOVER_FROM_REPLACE_BUCKETSubstitution pattern to redirect new segment writes to secondary bucketold/newdr-primary-logscale-data/dr-secondary-logscale-data
S3_RECOVER_FROM_ENCRYPTION_KEYSecret reference for decryption key; must match primary's key to read encrypted datasecretKeyRefSee below

  • Format Requirements:

    • S3_RECOVER_FROM_ENDPOINT_BASE: Required for OCI Object Storage. Format is https://<namespace>.compat.objectstorage.<region>.oraclecloud.com. Without this, LogScale defaults to AWS S3 endpoint format which will fail against OCI. Terraform automatically constructs this from the primary cluster's namespace and region via remote state.

    • S3_RECOVER_FROM_REPLACE_REGION: Format is old_region/new_region. When both clusters are in the same region, use us-chicago-1/us-chicago-1.

    • S3_RECOVER_FROM_REPLACE_BUCKET: Format is old_bucket/new_bucket. LogScale uses this to rewrite bucket references when loading snapshots from the primary cluster.

  • How Terraform Sets These Values:

    • s3_recover_from_bucket: Fetched from primary remote state (storage_bucket_name output) or set explicitly in tfvars

    • s3_recover_from_region: Fetched from primary remote state or set explicitly in tfvars

    • s3_recover_from_endpoint_base: Dynamically constructed from primary's namespace and region (https://<namespace>.compat.objectstorage.<region>.oraclecloud.com), or set explicitly in tfvars

    • s3_recover_from_replace_region: Dynamically generated as primary_region/secondary_region from remote state, or set explicitly

    • s3_recover_from_replace_bucket: Dynamically generated as primary_bucket/secondary_bucket using remote state values

    • Encryption key: Fetched from primary remote state and stored in a Kubernetes secret, then referenced via secretKeyRef

Primary Setup (workspace: primary)

The primary cluster is provisioned as usual. The dr="active" variable is required.

Template primary-us-chicago-1.tfvars.example (copy to primary-us-chicago-1.tfvars locally)

terraform
workspace_name         = "primary"
dr                     = "active"
region                 = "us-chicago-1"
cluster_name           = "dr-primary"
logscale_public_fqdn   = "logscale-dr.example.com"

# Global DNS is managed from the active workspace only
manage_global_dns      = true
create_global_dns_zone = true
dns_zone_name          = "example.com"

# Security allowlists (examples) - replace with your real office/VPN IPs.
# Do NOT use 0.0.0.0/0.
bastion_client_allow_list = [
  "YOUR.PUBLIC.IP/32",
]

public_lb_cidrs = [
  "YOUR.PUBLIC.IP/32",
]

Commands:

shell
terraform workspace select primary
terraform apply -var-file=primary-us-chicago-1.tfvars

Verify:

shell
oci ce cluster get --cluster-id <cluster-ocid> --region us-chicago-1
terraform output
# Key outputs include: storage_bucket_name and storage_encryption_key_value (sensitive)
Secondary Setup (workspace: secondary)

The tfvars file for the secondary OKE cluster configures the HumioCluster environment variables S3_RECOVER_FROM_* to point to the primary bucket.

Standby Cluster Initial State:

When dr = "standby", the secondary cluster is deployed with digest/ingress/Kafka capacity, but Humio stays offline until the operator is scaled up:

  • Running Pods (initial state):

    • Kafka brokers: Running - Required for LogScale to function when scaled up

    • Cert-manager: Running - Maintains certificates automatically. For the global failover FQDN, prefer DNS-01 or a pre-issued/wildcard cert to avoid HTTP-01 timing issues during DNS flips; otherwise the ingress may briefly serve the default certificate until issuance completes.

    • Ingress controller: Running to keep load balancer target group healthy

  • Not Running:

    • Humio operator: 0 replicas until failover/promotion

    • LogScale pods: 0 replicas (operator is off; HumioCluster declares nodeCount=1)

For the HumioCluster "License Error": See Issue 2: HumioCluster Shows "License Error" on Standby Cluster for details on the expected "license error" status on standby clusters.

Template secondary-us-chicago-1.tfvars.example (copy to secondary-us-chicago-1.tfvars locally)

terraform
workspace_name = "secondary"
dr = "standby"
region = "us-chicago-1"
cluster_name = "dr-secondary"
# Standby does not manage global DNS objects
manage_global_dns = false
create_global_dns_zone = false
primary_remote_state_config = {
backend = "oci"
workspace = "primary"
config = {
bucket = "your-terraform-state-bucket"
namespace = "your-namespace"
region = "us-chicago-1"
key = "env:/logscale-oci-oke"
auth = "ApiKey"
config_file_profile = "DEFAULT"
}
}
# DR recovery (S3-compatible env vars for OCI Object Storage)
# IMPORTANT: S3_RECOVER_FROM_* refer to the PRIMARY (source) cluster
s3_recover_from_region = "us-chicago-1"
s3_recover_from_replace_region = "us-chicago-1/us-chicago-1"
# s3_recover_from_bucket / s3_recover_from_endpoint_base / s3_recover_from_replace_bucket
# are inferred from primary remote state; override only for bootstrap/debug.
# Security allowlists (examples) - replace with your real office/VPN IPs.
# Do NOT use 0.0.0.0/0.
bastion_client_allow_list = [
"YOUR.PUBLIC.IP/32",
]
public_lb_cidrs = [
"YOUR.PUBLIC.IP/32",
]

Commands

shell
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvars

Verify

shell
# Encryption keys match (compare hashes)
kubectl get secret -n logging dr-primary-oci-storage-encryption --context oci-primary -o jsonpath='{.data.oci-storage-encryption-key}' | base64 -d | shasum -a 256
kubectl get secret -n logging dr-secondary-oci-storage-encryption --context oci-secondary -o jsonpath='{.data.oci-storage-encryption-key}' | base64 -d | shasum -a 256
# Pods minimal on secondary
kubectl get pods -n logging --context oci-secondary
HumioCluster Configuration (standby mode)

  • When dr="standby", Terraform configures the HumioCluster CR with minimal resources and DR-specific environment variables:

    • nodeCount = 1 (declared on the HumioCluster; no pods run until the operator is scaled up)

    • targetReplicationFactor = 1 (minimum viable value for a single node)

    • autoRebalancePartitions = false

Why targetReplicationFactor is 1 on standby:

The targetReplicationFactor represents the desired number of replicas for data segments when the cluster is operational.

Setting it to 1 on standby indicates that when the single node starts during failover, data should be replicated once (within that node).

The Humio operator allows this standby configuration.

  • Environment variables set:

    • S3_STORAGE_ENCRYPTION_KEY (same key as primary, via Kubernetes secret)

    • S3_RECOVER_FROM_REGION (primary region)

    • S3_RECOVER_FROM_BUCKET (primary bucket name)

    • S3_RECOVER_FROM_ENDPOINT_BASE (OCI S3-compatible endpoint)

    • S3_RECOVER_FROM_ENCRYPTION_KEY (references the shared encryption secret)

    • S3_RECOVER_FROM_REPLACE_REGION (format: old_region/new_region)

    • S3_RECOVER_FROM_REPLACE_BUCKET (format: old_bucket/new_bucket)

    • ENABLE_ALERTS = "false" (disable alerts on standby)

Stage 2: Failover - Scale up Humio and read global snapshot
DR Failover Flow

The following table illustrates the automated DR failover sequence triggered by standby automation (Monitoring Alarm โ†’ ONS โ†’ Function) and enforced via OCI DNS steering policy.

Failover Sequence

StepComponentAction
1-3Normal OperationDNS resolves to Primary IP, traffic flows to Primary cluster
4Monitoring SignalDetects primary is unhealthy (LB backend health metrics by default; OCI Health Checks when configured)
5Monitoring AlarmFires after pending duration (default: 1 min)
6ONS TopicReceives alarm notification, invokes Function
7OCI FunctionValidates failure duration, scales humio-operator 0โ†’1
8Humio OperatorReconciles HumioCluster, creates LogScale pod
9LogScale PodReads global snapshot from Primary bucket
10DNS SteeringRoutes traffic to Secondary (now healthy)

Note

When use_external_health_check = false (recommended), The steering policy always uses FILTER โ†’ PRIORITY โ†’ LIMIT (no HEALTH rule), regardless of the use_external_health_check setting.

The DR failover function controls DNS routing by setting is_disabled on steering policy answers โ€” the FILTER rule removes disabled answers. This prevents automatic failback, ensuring an operator must explicitly verify primary readiness before failing back.

Secondary Readiness Required Steps

On standby, the HumioCluster already declares nodeCount=1, but the Humio operator is scaled to 0. When the Humio operator is scaled to 1 (by the OCI Function on health check failure or manually), it reconciles the HumioCluster and starts a single LogScale pod.

  • Scale the Humio operator on secondary:

    • With OCI Function enabled (default): Health Check failure โ†’ Monitoring Alarm โ†’ ONS Topic โ†’ Function scales humio-operator replicas to 1.

    • Manually (e.g., for tests or if Function is disabled):

shell
kubectl --context oci-secondary -n logging scale deploy humio-operator --replicas=1

  • What Happens After Operator Starts:

    • The Humio operator reconciles and creates the Humio pod

    • The pod reads S3_RECOVER_FROM_* env vars (S3-compatible for OCI Object Storage)

    • It lists and downloads the latest global-snapshot.json from the primary bucket

    • It patches the snapshot to reference the secondary bucket/region using S3_RECOVER_FROM_REPLACE_* values

    • It loads the patched snapshot into memory

    • The cluster starts up with the recovered metadata state

.

What Data is Transferred in the Global Snapshot

The global snapshot is a JSON-based export of LogScale's internal cluster state at boot time. Understanding what transfers (and what doesn't) is critical for DR planning:

  • Transferred in the snapshot:

    • Dataspaces (repositories): All repository definitions, views, retention policies, and metadata

    • Bucket storage configurations: Provider info (S3/GCS/Azure/OCI), regions, bucket names, encryption settings, key prefixes

      • During DR recovery, these are patched with new credentials and marked as readOnly=true

    • Segment metadata: References to log data locations including bucket IDs, byte sizes, date ranges, epoch/offset information

      • Note: Only the metadata about segments is transferred, not the actual compressed log data files

    • Datasource configurations: Ingest token references, tags, parser associations

    • License information: License key and installation metadata

    • Cluster identifiers: Humio cluster ID, instance ID, Kafka epoch information

    • System configuration: Blacklisted queries, feature flags

  • Cleared during DR recovery patching (from Humio core):

    • All host entries: Completely dropped via dropAllHostsFromClusterForDisasterRecoveryBoot()

    • All partition assignments: Ingest, segment, and query coordination partitions deleted via dropAllPartitionsConfigsFromClusterForDisasterRecoveryBoot()

    • Segment ownership: ownerHosts, currentHosts, topEpoch, topOffset cleared from all segments

    • Datasource runtime state: currentSegments, ingestEpoch, ingestOffset cleared; ingestIdle set to true

    • Uploaded file host assignments: currentHosts cleared on all uploaded files via patchAllCurrentHostsForUploadedFilesForDisasterRecoveryBoot()

  • NOT in the snapshot (must be manually synced or handled separately):

    • Actual log data: The compressed log event files (segments) remain in the primary Object Storage bucket and are accessed read-only by the secondary via cross-region IAM policies

      • Segments can be GBs or TBs in size - transferring them would be impractical

      • Only the segment metadata (pointers to Object Storage objects) is in the snapshot

    • Kubernetes Secrets (stored in etcd, not Kafka):

      • License Secret (spec.license.secretKeyRef)

      • TLS/CA certificates (spec.tls.caSecretName, spec.ingress.secretName)

      • OAuth/SAML client secrets

      • SMTP/email credentials

      • Image pull secrets (spec.imagePullSecrets)

      • API token secrets for external clusters

      • Environment variable ConfigMaps/Secrets (spec.environmentVariablesSource)

    • Storage encryption keys: Synchronized via Terraform remote state, not the snapshot

      • Security best practice: encryption keys never transit through Kafka

    • Runtime state: Live Kafka consumer positions, query execution state, cache contents

    • Cloud identity configuration: ServiceAccount annotations for Workload Identity (not created by humio-operator)

    • Operator deployment: humio-operator must be deployed consistently in both clusters with matching configuration

Key insight: The global snapshot is LogScale's configuration and metadata state (~MBs), not your log data (~TBs). During DR, the secondary cluster reads the actual log events directly from the primary's Object Storage bucket using the segment metadata as a map.

Bucket storage configs are patched to point to primary storage as read-only

Kubernetes Resources Requiring Manual Sync (from humio-operator analysis):

Before executing a DR failover, ensure these resources exist in the secondary cluster:

Resource TypeExamplesSync Method
License Secrethumio-licenseVelero backup, External Secrets Operator, or manual copy
TLS CertificatesIngress certs, CA certscert-manager (auto), or manual copy
Auth SecretsOAuth/SAML client secretsExternal Secrets Operator or manual copy
Image Pull SecretsRegistry credentialsVelero backup or manual copy
ServiceAccountsPod identity annotationsTerraform or manual configuration
RBACRoles, RoleBindingsVelero backup or GitOps

Spot-check pods on secondary:

shell
kubectl --context oci-secondary -n logging get pods
# Expect humio-operator (1/1), one Humio pod once recovery starts, and Kafka components running
Verify DR Recovery Succeeded (logs and snapshot)

Log in to the secondary LogScale cluster UI, open the Humio repository, and run the following query:

logscale
DataSnapshotLoader
| #kind != threaddumps

You should see messages similar to:

text
Checking bucket storage localAndHttpWereEmpty=true
Trying to fetch a global snapshot from bucket storage s3 if one exists in bucket=dr-secondary-logscale-data
Fetching global snapshot from bucket storage s3 found no snapshot to fetch.
Trying to fetch a global snapshot as recovery source from bucket storage in s3
Trying to fetch a global snapshot from bucket storage s3 if one exists in bucket=dr-primary-logscale-data
Fetched global snapshot from bucket storage s3 found snapshot with epochOffset={epoch=0 offset=699094}
Fetched a global snapshot as recovery source from bucket storage in s3 and got snapshot with epochOffset={epoch=0 offset=699094} now patching...
Snapshots to choose from, last is better...: List(({epoch=0 offset=699094},s3)) using kafkaMinOffsetOpt of Some(KafkaMinOffset(...))
Selecting snapshot from source=s3 with epochOffset={epoch=0 offset=699094}
updateSnapshotForDisasterRecovery: Patching region using from=us-chicago-1 to=us-chicago-1 on bucketId=1
updateSnapshotForDisasterRecovery: Patching bucket using from=dr-primary-logscale-data to=dr-secondary-logscale-data on bucketId=1
updateSnapshotForDisasterRecovery: Patching access configs from RECOVER_FROM on bucketId=1
updateSnapshotForDisasterRecovery: setting readOnly=true on bucketId=1 keyPrefix= new value for bucket=dr-secondary-logscale-data

Note

The storage type shows s3 because OCI Object Storage uses the S3-compatible API. The logs show:

  • First checks the secondary bucket (dr-secondary-logscale-data) - finds no snapshot

  • Then fetches from the primary bucket (dr-primary-logscale-data) as the recovery source

  • Patches the snapshot to use the secondary bucket for new writes

  • Sets the primary bucket reference to readOnly=true

  • Ready to promote when:

    • Operator is 1/1 on secondary

    • Kafka components exist on secondary

    • DataSnapshotLoader logs match the expected sequence above

    • Snapshot file shows patched region/bucket pointing to secondary; encryption keys match

Stage 3: Promote Secondary to Active

Once the LogScale pod is running and has successfully read the global snapshot from the primary bucket, the cluster can be promoted to active status.

Zero-Downtime Promotion (Two-Phase Apply)

For zero-downtime DR promotion, use the two-phase terraform apply approach with the dr_use_dedicated_routing variable. This ensures traffic continues to flow to the existing digest pod while UI/Ingest pods scale up.

Understanding dr_use_dedicated_routing

In plain terms, this variable controls how Kubernetes services find LogScale pods

  • dr_use_dedicated_routing = true (default): Services look for specific pod types. The UI service only routes to UI pods, and the ingest service only routes to ingest pods. This is optimal for production because each pod type is purpose-built for its workload.

  • dr_use_dedicated_routing = false: Services look for any LogScale pod, regardless of type. The UI service will route to digest pods, UI pods, or ingest pods - whichever are available.

Why this matters during promotion

In standby mode, the DR cluster runs a single "digest" pod that can handle all request types (UI, ingest, queries). When you promote to active, the cluster needs to scale up specialized UI and ingest pods, which takes 1-2 minutes.

If you promote with dr_use_dedicated_routing = true (the default), the services immediately start looking for UI pods that don't exist yet. Result: 503 errors until the new pods are ready.

If you promote with dr_use_dedicated_routing = false first, the services continue routing to the existing digest pod while the new pods scale up.

Result: zero downtime.

Once all pods are ready, you apply again with dr_use_dedicated_routing = true to enable optimal routing.

dr_use_dedicated_routing Behavior Matrix:

drdr_use_dedicated_routingSelector UsedUse Case
"" (non-DR)(ignored)Pool-specificNormal production routing
"active"falseGeneric (app.kubernetes.io/name=humio)Phase 1 of promotion - zero downtime
"active"truePool-specificPhase 2 of promotion - optimal routing
"standby"falseGeneric (app.kubernetes.io/name=humio)Standby waiting for failover
"standby"truePool-specificStandby with dedicated routing (rare)

Why two phases are needed:

When promoting from dr="standby" to dr="active", the HumioCluster's node pool configuration changes from digest-only (1 pod) to the full production topology (digest + UI + ingest pods). Without the two-phase approach:

  • Service selectors immediately change to look for

  • UI pods (humio.com/node-pool=<prefix>-ui) UI pods don't exist yet (takes time to scale up)

  • Services have zero endpoints โ†’ 503 errors

With the two phase approach:

  • Phase 1: Selectors use app.kubernetes.io/name=humio to match ALL LogScale pods (including existing digest pod)

  • Traffic continues to existing digest pod during UI/Ingest scale-up

  • Phase 2: After UI/Ingest pods are ready, selectors switch to pool-specific routing

Phase 1: Promote with Generic Selectors (Zero-Downtime)

shell
# Edit tfvars for Phase 1
vi secondary-us-chicago-1.tfvars
dr = "active"
dr_use_dedicated_routing = false # Generic selector - matches ALL pods
# Apply Phase 1
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvars
# Verify UI and Ingest pods are coming up
kubectl --context oci-secondary -n logging get pods -l humio.com/node-pool
# Wait until UI and Ingest pods show Running and Ready

Phase 2: Enable Dedicated Routing (After Pods Ready)

shell
# Edit tfvars for Phase 2 - Choose one of two options:
# Option A: Stay in DR mode with optimal routing
vi secondary-us-chicago-1.tfvars
dr = "active"
dr_use_dedicated_routing = true # Pool-specific selectors - optimal routing
# Option B: Exit DR mode entirely (also enables optimal routing)
vi secondary-us-chicago-1.tfvars
dr = "" # Non-DR mode - also uses pool-specific routing automatically
# dr_use_dedicated_routing is ignored when dr="" (always uses pool-specific)
# Apply Phase 2
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvars
# Verify services have correct endpoints
kubectl --context oci-secondary -n logging get endpoints
Standard Promotion (Single Apply)

If downtime during promotion is acceptable, you can use a single apply:

Actions

terraform
# Edit tfvars, switch to active
vi secondary-us-chicago-1.tfvars
dr = "active" # or dr = "" for non-DR mode (both work for promotion)
# dr_use_dedicated_routing defaults to true (pool-specific routing)
# Apply in secondary workspace
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvars

Note

Setting dr = "active" or dr = "" (empty string) both promote the cluster to active status. However, the choice determines whether the cluster remains part of the DR strategy:

  • dr = "active": The cluster remains part of the DR strategy with full DR infrastructure (global DNS, health checks, health monitoring). Use this if you want the promoted cluster to serve as the new primary in the DR pair.

  • dr = "": The cluster operates standalone without DR infrastructure. Use this if you want to remove the cluster from the DR strategy entirely after promotion.

Promoting the standby to dr="active" does not automatically move ownership of the OCI global DNS resources. In this repo, keep manage_global_dns=true only in a single workspace to avoid two states managing the same steering policy/zone. During an incident, the standby function updates the steering policy directly (Terraform will not fight those emergency updates).

  • What changes automatically:

    • Scales node groups to production sizes

    • Sets production replication factor and enables auto-rebalance

    • Enables alerts by setting ENABLE_ALERTS=true

    • Humio operator scales to 1 and HumioCluster nodeCount follows production values

S3_RECOVER_FROM_* Environment Variable Preservation

Important

The S3_RECOVER_FROM_* environment variables are intentionally kept when promoting from dr="standby" to dr="active". This is a deliberate design choice to prevent pod recreation during DR promotion

Why env vars are preserved

The humio-operator calculates a hash of the pod spec (including environment variables) to determine if pods need to be recreated. If env vars were removed during promotion:

  • The pod spec hash would change

  • The operator would delete and recreate all pods

  • Ephemeral PVCs would be deleted (data loss)

  • Recovered snapshot data would be lost

Why this is safe:

The S3_RECOVER_FROM_* env vars are only read at startup by DataSnapshotLoader.scala:

  • After successful recovery, the local snapshot has a valid Kafka epoch

  • Subsequent pod restarts use the local snapshot, NOT the recovery bucket

  • The primary bucket is marked readOnly=true in cluster state

  • Env vars are harmlessly ignored after initial recovery

Behavior matrix:

ScenarioBehaviorSafe?
Normal operationEnv vars ignored (local snapshot used)Yes
Pod restart (same PVC)Uses local snapshot, skips recovery bucketYes
New pod (fresh PVC)Would re-fetch from recovery bucketOnly if cluster wiped

Note

If you need to remove these env vars later (e.g., after the original primary is decommissioned), do so in a maintenance window when brief pod recreation is acceptable.

Resources destroyed during promotion:

When promoting from dr="standby" to dr="active", Terraform destroys the entire module.dr-failover-function because automated failover is no longer needed on an active cluster. The key resources removed are:

  • OCI Function and Application - The serverless function that scales the Humio operator during failover

  • OCI Monitoring Alarm - The alarm that detects primary cluster health failures

  • ONS Topic and Subscription - The notification chain connecting the alarm to the function

  • IAM Policies and Dynamic Group - Permissions allowing the function to access OKE and scale the operator

  • OCIR Repository and Auth Token - Container registry resources for the function image

  • NSG Rule - Network rule allowing function-to-OKE API communication

  • Kubernetes RBAC - ClusterRole and ClusterRoleBinding for operator scaling permissions

  • Function Logging - Log group and logs for function invocation auditing

In total, approximately 20 OCI and Kubernetes resources are removed. The DR-related Terraform outputs (dr_failover_alarm_id, dr_failover_function_*, dr_failover_topic_id) are also no longer available after promotion.

Verify promotion:

shell
kubectl get humiocluster -n logging --context oci-secondary -o jsonpath='{.spec.environmentVariables}' | jq '.[] | select(.name | startswith("S3_RECOVER"))'
# => empty
kubectl get humiocluster -n logging --context oci-secondary -o jsonpath='{.spec.nodeCount}'
# => production value
kubectl get pods -n logging --context oci-secondary
# => all pods running
OCI DNS Steering Policy Flow:

When you use the DR global DNS pattern (${global_logscale_hostname}.${dns_zone_name}) with OCI DNS Steering Policy failover records, ingestion and UI clients point at a single global FQDN.

In normal operation this record resolves to the primary load balancer IP and the secondary HumioCluster declares nodeCount=1 but runs no Humio pods because the operator is scaled to 0.

If the primary health check fails and OCI DNS Steering Policy updates the global DNS to return the secondary IP, the OCI Function failover scaler scales the Humio operator from 0 โ†’ 1 so the secondary can start the single digest pod and serve traffic. There is no automatic scale-down of the operator or Humio; scale back manually or by re-applying Terraform with dr="standby" after failback.

In this mode, failover/failback tests for ingestion use the same global FQDN and do not require manual DNS record changes; the DNS steering policy and OCI Function together handle the traffic switch.

To verify which cluster is currently serving traffic:

shell
GLOBAL_DR_FQDN="logscale-dr.oci-dr.humio.net" # Your global DR FQDN
dig +short "${GLOBAL_DR_FQDN}"
curl -I "https://${GLOBAL_DR_FQDN}"