DR Deployment
This section covers the complete DR deployment process, from prerequisites through the three stages of DR configuration, failover, and promotion.
Pre-Deployment Verification
Prerequisites (once)
Terraform >= 1.5.7, kubectl 1.27+, OCI CLI configured.
Terraform backend: OCI Object Storage state bucket.
Access to both OKE clusters.
Quick checks:
terraform init works; terraform workspace list shows primary and secondary.
OCI identity / backend reachability:
oci iam user get --user-id $OCI_USER_OCID
oci os ns get
terraform versionStorage + remote state:
Bucket names are deterministic (
<cluster_name>-logscale-data) and exported via Terraform outputs; you generally do not set bucket names manually.Standby applies require access to primary outputs via
primary_remote_state_config(for encryption key + primary bucket details).
Stage 1: DR Configuration Setup
Strimzi generates the Kafka TLS truststore secret
(${name_prefix}-strimzi-kafka-cluster-ca-cert) only after Kafka is up.
Humio pods mount this secret and use its ca.password for
KAFKA_COMMON_SSL_TRUSTSTORE_PASSWORD and
/tmp/kafka/ca.p12 for
KAFKA_COMMON_SSL_TRUSTSTORE_LOCATION.
If Humio starts before the secret exists, the pod fails to mount the volume and crashloops.
Deploy Kafka/Strimzi first, then let Humio start.
DR Recovery Environment Variables OCI LogScale deployments use AWS S3-compatible environment variables for DR recovery (since
OCI Object Storage provides S3-compatible API). These variables are automatically set by
Terraform when dr = "standby".
Environment Variable Reference
| Env Var | Purpose | Format | Example |
|---|---|---|---|
S3_RECOVER_FROM_BUCKET | Source bucket name where LogScale fetches global-snapshot.json
during DR boot | bucket-name | dr-primary-logscale-data |
S3_RECOVER_FROM_REGION | Region of the source bucket; used to construct the S3 API endpoint | region-name | us-chicago-1 |
S3_RECOVER_FROM_ENDPOINT_BASE | S3-compatible API base URL; required for OCI since it uses non-AWS endpoints | https://<endpoint> | https://axrgs2jgwnhx.compat.objectstorage.us-chicago-1.oraclecloud.com |
S3_RECOVER_FROM_REPLACE_REGION | Substitution pattern to rewrite region references in recovered snapshot metadata | old/new | us-chicago-1/us-chicago-1 |
S3_RECOVER_FROM_REPLACE_BUCKET | Substitution pattern to redirect new segment writes to secondary bucket | old/new | dr-primary-logscale-data/dr-secondary-logscale-data |
S3_RECOVER_FROM_ENCRYPTION_KEY | Secret reference for decryption key; must match primary's key to read encrypted data | secretKeyRef | See below |
Format Requirements:
S3_RECOVER_FROM_ENDPOINT_BASE: Required for OCI Object Storage. Format ishttps://<namespace>.compat.objectstorage.<region>.oraclecloud.com. Without this, LogScale defaults to AWS S3 endpoint format which will fail against OCI. Terraform automatically constructs this from the primary cluster's namespace and region via remote state.S3_RECOVER_FROM_REPLACE_REGION: Format isold_region/new_region. When both clusters are in the same region, useus-chicago-1/us-chicago-1.S3_RECOVER_FROM_REPLACE_BUCKET: Format isold_bucket/new_bucket. LogScale uses this to rewrite bucket references when loading snapshots from the primary cluster.
How Terraform Sets These Values:
s3_recover_from_bucket: Fetched from primary remote state (
storage_bucket_name output) or set explicitly in tfvarss3_recover_from_region: Fetched from primary remote state or set explicitly in tfvars
s3_recover_from_endpoint_base: Dynamically constructed from primary's namespace and region (
https://<namespace>.compat.objectstorage.<region>.oraclecloud.com), or set explicitly in tfvarss3_recover_from_replace_region: Dynamically generated as
primary_region/secondary_regionfrom remote state, or set explicitlys3_recover_from_replace_bucket: Dynamically generated as
primary_bucket/secondary_bucketusing remote state valuesEncryption key: Fetched from primary remote state and stored in a Kubernetes secret, then referenced via secretKeyRef
Primary Setup (workspace: primary)
The primary cluster is provisioned as usual. The dr="active" variable is
required.
Template primary-us-chicago-1.tfvars.example (copy to primary-us-chicago-1.tfvars locally)
workspace_name = "primary"
dr = "active"
region = "us-chicago-1"
cluster_name = "dr-primary"
logscale_public_fqdn = "logscale-dr.example.com"
# Global DNS is managed from the active workspace only
manage_global_dns = true
create_global_dns_zone = true
dns_zone_name = "example.com"
# Security allowlists (examples) - replace with your real office/VPN IPs.
# Do NOT use 0.0.0.0/0.
bastion_client_allow_list = [
"YOUR.PUBLIC.IP/32",
]
public_lb_cidrs = [
"YOUR.PUBLIC.IP/32",
]Commands:
terraform workspace select primary
terraform apply -var-file=primary-us-chicago-1.tfvarsVerify:
oci ce cluster get --cluster-id <cluster-ocid> --region us-chicago-1
terraform output
# Key outputs include: storage_bucket_name and storage_encryption_key_value (sensitive) The tfvars file for the secondary OKE cluster configures the HumioCluster environment
variables S3_RECOVER_FROM_* to point to the primary bucket.
Standby Cluster Initial State:
When dr = "standby", the secondary cluster is deployed with
digest/ingress/Kafka capacity, but Humio stays offline until the operator is scaled up:
Running Pods (initial state):
Kafka brokers: Running - Required for LogScale to function when scaled up
Cert-manager: Running - Maintains certificates automatically. For the global failover FQDN, prefer DNS-01 or a pre-issued/wildcard cert to avoid HTTP-01 timing issues during DNS flips; otherwise the ingress may briefly serve the default certificate until issuance completes.
Ingress controller: Running to keep load balancer target group healthy
Not Running:
Humio operator: 0 replicas until failover/promotion
LogScale pods: 0 replicas (operator is off; HumioCluster declares nodeCount=1)
For the HumioCluster "License Error": See Issue 2: HumioCluster Shows "License Error" on Standby Cluster for details on the expected "license error" status on standby clusters.
Template secondary-us-chicago-1.tfvars.example (copy to secondary-us-chicago-1.tfvars locally)
workspace_name = "secondary"
dr = "standby"
region = "us-chicago-1"
cluster_name = "dr-secondary"
# Standby does not manage global DNS objects
manage_global_dns = false
create_global_dns_zone = false
primary_remote_state_config = {
backend = "oci"
workspace = "primary"
config = {
bucket = "your-terraform-state-bucket"
namespace = "your-namespace"
region = "us-chicago-1"
key = "env:/logscale-oci-oke"
auth = "ApiKey"
config_file_profile = "DEFAULT"
}
}
# DR recovery (S3-compatible env vars for OCI Object Storage)
# IMPORTANT: S3_RECOVER_FROM_* refer to the PRIMARY (source) cluster
s3_recover_from_region = "us-chicago-1"
s3_recover_from_replace_region = "us-chicago-1/us-chicago-1"
# s3_recover_from_bucket / s3_recover_from_endpoint_base / s3_recover_from_replace_bucket
# are inferred from primary remote state; override only for bootstrap/debug.
# Security allowlists (examples) - replace with your real office/VPN IPs.
# Do NOT use 0.0.0.0/0.
bastion_client_allow_list = [
"YOUR.PUBLIC.IP/32",
]
public_lb_cidrs = [
"YOUR.PUBLIC.IP/32",
]Commands
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvarsVerify
# Encryption keys match (compare hashes)
kubectl get secret -n logging dr-primary-oci-storage-encryption --context oci-primary -o jsonpath='{.data.oci-storage-encryption-key}' | base64 -d | shasum -a 256
kubectl get secret -n logging dr-secondary-oci-storage-encryption --context oci-secondary -o jsonpath='{.data.oci-storage-encryption-key}' | base64 -d | shasum -a 256
# Pods minimal on secondary
kubectl get pods -n logging --context oci-secondary
When
dr="standby", Terraform configures the HumioCluster CR with minimal resources and DR-specific environment variables:nodeCount = 1(declared on the HumioCluster; no pods run until the operator is scaled up)targetReplicationFactor = 1(minimum viable value for a single node)autoRebalancePartitions = false
Why targetReplicationFactor is 1 on standby:
The targetReplicationFactor represents the desired number of replicas
for data segments when the cluster is operational.
Setting it to 1 on standby indicates that when the single node starts during failover, data should be replicated once (within that node).
The Humio operator allows this standby configuration.
Environment variables set:
S3_STORAGE_ENCRYPTION_KEY(same key as primary, via Kubernetes secret)S3_RECOVER_FROM_REGION(primary region)S3_RECOVER_FROM_BUCKET(primary bucket name)S3_RECOVER_FROM_ENDPOINT_BASE(OCI S3-compatible endpoint)S3_RECOVER_FROM_ENCRYPTION_KEY(references the shared encryption secret)S3_RECOVER_FROM_REPLACE_REGION(format: old_region/new_region)S3_RECOVER_FROM_REPLACE_BUCKET(format: old_bucket/new_bucket)ENABLE_ALERTS = "false"(disable alerts on standby)
Stage 2: Failover - Scale up Humio and read global snapshot
The following table illustrates the automated DR failover sequence triggered by standby automation (Monitoring Alarm โ ONS โ Function) and enforced via OCI DNS steering policy.
Failover Sequence
| Step | Component | Action |
|---|---|---|
| 1-3 | Normal Operation | DNS resolves to Primary IP, traffic flows to Primary cluster |
| 4 | Monitoring Signal | Detects primary is unhealthy (LB backend health metrics by default; OCI Health Checks when configured) |
| 5 | Monitoring Alarm | Fires after pending duration (default: 1 min) |
| 6 | ONS Topic | Receives alarm notification, invokes Function |
| 7 | OCI Function | Validates failure duration, scales humio-operator 0โ1 |
| 8 | Humio Operator | Reconciles HumioCluster, creates LogScale pod |
| 9 | LogScale Pod | Reads global snapshot from Primary bucket |
| 10 | DNS Steering | Routes traffic to Secondary (now healthy) |
Note
When use_external_health_check = false (recommended), The steering policy
always uses FILTER โ PRIORITY โ LIMIT (no HEALTH rule), regardless of the
use_external_health_check setting.
The DR failover function controls DNS routing by setting is_disabled on steering policy answers โ the FILTER rule removes disabled answers. This prevents automatic failback, ensuring an operator must explicitly verify primary readiness before failing back.
On standby, the HumioCluster already declares nodeCount=1, but the
Humio operator is scaled to 0. When the Humio operator is scaled to 1 (by the OCI Function on
health check failure or manually), it reconciles the HumioCluster and starts a single LogScale
pod.
Scale the Humio operator on secondary:
With OCI Function enabled (default): Health Check failure โ Monitoring Alarm โ ONS Topic โ Function scales humio-operator replicas to 1.
Manually (e.g., for tests or if Function is disabled):
kubectl --context oci-secondary -n logging scale deploy humio-operator --replicas=1
What Happens After Operator Starts:
The Humio operator reconciles and creates the Humio pod
The pod reads
S3_RECOVER_FROM_*env vars (S3-compatible for OCI Object Storage)It lists and downloads the latest global-snapshot.json from the primary bucket
It patches the snapshot to reference the secondary bucket/region using
S3_RECOVER_FROM_REPLACE_*valuesIt loads the patched snapshot into memory
The cluster starts up with the recovered metadata state
.
What Data is Transferred in the Global Snapshot
The global snapshot is a JSON-based export of LogScale's internal cluster state at boot time. Understanding what transfers (and what doesn't) is critical for DR planning:
Transferred in the snapshot:
Dataspaces (repositories): All repository definitions, views, retention policies, and metadata
Bucket storage configurations: Provider info (S3/GCS/Azure/OCI), regions, bucket names, encryption settings, key prefixes
During DR recovery, these are patched with new credentials and marked as
readOnly=true
Segment metadata: References to log data locations including bucket IDs, byte sizes, date ranges, epoch/offset information
Note: Only the metadata about segments is transferred, not the actual compressed log data files
Datasource configurations: Ingest token references, tags, parser associations
License information: License key and installation metadata
Cluster identifiers: Humio cluster ID, instance ID, Kafka epoch information
System configuration: Blacklisted queries, feature flags
Cleared during DR recovery patching (from Humio core):
All host entries: Completely dropped via
dropAllHostsFromClusterForDisasterRecoveryBoot()All partition assignments: Ingest, segment, and query coordination partitions deleted via
dropAllPartitionsConfigsFromClusterForDisasterRecoveryBoot()Segment ownership:
ownerHosts,currentHosts,topEpoch,topOffsetcleared from all segmentsDatasource runtime state:
currentSegments,ingestEpoch,ingestOffsetcleared;ingestIdleset totrueUploaded file host assignments:
currentHostscleared on all uploaded files viapatchAllCurrentHostsForUploadedFilesForDisasterRecoveryBoot()
NOT in the snapshot (must be manually synced or handled separately):
Actual log data: The compressed log event files (segments) remain in the primary Object Storage bucket and are accessed read-only by the secondary via cross-region IAM policies
Segments can be GBs or TBs in size - transferring them would be impractical
Only the segment metadata (pointers to Object Storage objects) is in the snapshot
Kubernetes Secrets (stored in etcd, not Kafka):
License Secret (
spec.license.secretKeyRef)TLS/CA certificates (
spec.tls.caSecretName,spec.ingress.secretName)OAuth/SAML client secrets
SMTP/email credentials
Image pull secrets (
spec.imagePullSecrets)API token secrets for external clusters
Environment variable ConfigMaps/Secrets (
spec.environmentVariablesSource)
Storage encryption keys: Synchronized via Terraform remote state, not the snapshot
Security best practice: encryption keys never transit through Kafka
Runtime state: Live Kafka consumer positions, query execution state, cache contents
Cloud identity configuration: ServiceAccount annotations for Workload Identity (not created by humio-operator)
Operator deployment: humio-operator must be deployed consistently in both clusters with matching configuration
Key insight: The global snapshot is LogScale's configuration and metadata state (~MBs), not your log data (~TBs). During DR, the secondary cluster reads the actual log events directly from the primary's Object Storage bucket using the segment metadata as a map.
Bucket storage configs are patched to point to primary storage as read-only
Kubernetes Resources Requiring Manual Sync (from humio-operator analysis):
Before executing a DR failover, ensure these resources exist in the secondary cluster:
| Resource Type | Examples | Sync Method |
|---|---|---|
| License Secret | humio-license | Velero backup, External Secrets Operator, or manual copy |
| TLS Certificates | Ingress certs, CA certs | cert-manager (auto), or manual copy |
| Auth Secrets | OAuth/SAML client secrets | External Secrets Operator or manual copy |
| Image Pull Secrets | Registry credentials | Velero backup or manual copy |
| ServiceAccounts | Pod identity annotations | Terraform or manual configuration |
| RBAC | Roles, RoleBindings | Velero backup or GitOps |
Spot-check pods on secondary:
kubectl --context oci-secondary -n logging get pods
# Expect humio-operator (1/1), one Humio pod once recovery starts, and Kafka components runningLog in to the secondary LogScale cluster UI, open the Humio repository, and run the following query:
DataSnapshotLoader
| #kind != threaddumpsYou should see messages similar to:
Checking bucket storage localAndHttpWereEmpty=true
Trying to fetch a global snapshot from bucket storage s3 if one exists in bucket=dr-secondary-logscale-data
Fetching global snapshot from bucket storage s3 found no snapshot to fetch.
Trying to fetch a global snapshot as recovery source from bucket storage in s3
Trying to fetch a global snapshot from bucket storage s3 if one exists in bucket=dr-primary-logscale-data
Fetched global snapshot from bucket storage s3 found snapshot with epochOffset={epoch=0 offset=699094}
Fetched a global snapshot as recovery source from bucket storage in s3 and got snapshot with epochOffset={epoch=0 offset=699094} now patching...
Snapshots to choose from, last is better...: List(({epoch=0 offset=699094},s3)) using kafkaMinOffsetOpt of Some(KafkaMinOffset(...))
Selecting snapshot from source=s3 with epochOffset={epoch=0 offset=699094}
updateSnapshotForDisasterRecovery: Patching region using from=us-chicago-1 to=us-chicago-1 on bucketId=1
updateSnapshotForDisasterRecovery: Patching bucket using from=dr-primary-logscale-data to=dr-secondary-logscale-data on bucketId=1
updateSnapshotForDisasterRecovery: Patching access configs from RECOVER_FROM on bucketId=1
updateSnapshotForDisasterRecovery: setting readOnly=true on bucketId=1 keyPrefix= new value for bucket=dr-secondary-logscale-dataNote
The storage type shows s3 because OCI Object Storage uses the S3-compatible API. The logs show:
First checks the secondary bucket (
dr-secondary-logscale-data) - finds no snapshotThen fetches from the primary bucket (
dr-primary-logscale-data) as the recovery sourcePatches the snapshot to use the secondary bucket for new writes
Sets the primary bucket reference to
readOnly=true
Ready to promote when:
Operator is 1/1 on secondary
Kafka components exist on secondary
DataSnapshotLoaderlogs match the expected sequence aboveSnapshot file shows patched region/bucket pointing to secondary; encryption keys match
Stage 3: Promote Secondary to Active
Once the LogScale pod is running and has successfully read the global snapshot from the primary bucket, the cluster can be promoted to active status.
Zero-Downtime Promotion (Two-Phase Apply) For zero-downtime DR promotion, use the two-phase terraform apply approach with the
dr_use_dedicated_routing variable. This ensures traffic continues to flow to
the existing digest pod while UI/Ingest pods scale up.
Understanding dr_use_dedicated_routing
In plain terms, this variable controls how Kubernetes services find LogScale pods
dr_use_dedicated_routing = true(default): Services look for specific pod types. The UI service only routes to UI pods, and the ingest service only routes to ingest pods. This is optimal for production because each pod type is purpose-built for its workload.dr_use_dedicated_routing = false: Services look for any LogScale pod, regardless of type. The UI service will route to digest pods, UI pods, or ingest pods - whichever are available.
Why this matters during promotion
In standby mode, the DR cluster runs a single "digest" pod that can handle all request types (UI, ingest, queries). When you promote to active, the cluster needs to scale up specialized UI and ingest pods, which takes 1-2 minutes.
If you promote with dr_use_dedicated_routing = true (the default), the
services immediately start looking for UI pods that don't exist yet. Result: 503 errors until
the new pods are ready.
If you promote with dr_use_dedicated_routing = false first, the services
continue routing to the existing digest pod while the new pods scale up.
Result: zero downtime.
Once all pods are ready, you apply again with dr_use_dedicated_routing =
true to enable optimal routing.
dr_use_dedicated_routing Behavior Matrix:
| dr | dr_use_dedicated_routing | Selector Used | Use Case |
|---|---|---|---|
"" (non-DR) | (ignored) | Pool-specific | Normal production routing |
| "active" | false | Generic (app.kubernetes.io/name=humio) | Phase 1 of promotion - zero downtime |
| "active" | true | Pool-specific | Phase 2 of promotion - optimal routing |
| "standby" | false | Generic (app.kubernetes.io/name=humio) | Standby waiting for failover |
| "standby" | true | Pool-specific | Standby with dedicated routing (rare) |
Why two phases are needed:
When promoting from dr="standby" to dr="active",
the HumioCluster's node pool configuration changes from digest-only (1 pod) to the full
production topology (digest + UI + ingest pods). Without the two-phase approach:
Service selectors immediately change to look for
UI pods (
humio.com/node-pool=<prefix>-ui) UI pods don't exist yet (takes time to scale up)Services have zero endpoints โ 503 errors
With the two phase approach:
Phase 1: Selectors use
app.kubernetes.io/name=humioto match ALL LogScale pods (including existing digest pod)Traffic continues to existing digest pod during UI/Ingest scale-up
Phase 2: After UI/Ingest pods are ready, selectors switch to pool-specific routing
Phase 1: Promote with Generic Selectors (Zero-Downtime)
# Edit tfvars for Phase 1
vi secondary-us-chicago-1.tfvars
dr = "active"
dr_use_dedicated_routing = false # Generic selector - matches ALL pods
# Apply Phase 1
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvars
# Verify UI and Ingest pods are coming up
kubectl --context oci-secondary -n logging get pods -l humio.com/node-pool
# Wait until UI and Ingest pods show Running and ReadyPhase 2: Enable Dedicated Routing (After Pods Ready)
# Edit tfvars for Phase 2 - Choose one of two options:
# Option A: Stay in DR mode with optimal routing
vi secondary-us-chicago-1.tfvars
dr = "active"
dr_use_dedicated_routing = true # Pool-specific selectors - optimal routing
# Option B: Exit DR mode entirely (also enables optimal routing)
vi secondary-us-chicago-1.tfvars
dr = "" # Non-DR mode - also uses pool-specific routing automatically
# dr_use_dedicated_routing is ignored when dr="" (always uses pool-specific)
# Apply Phase 2
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvars
# Verify services have correct endpoints
kubectl --context oci-secondary -n logging get endpointsIf downtime during promotion is acceptable, you can use a single apply:
Actions
# Edit tfvars, switch to active
vi secondary-us-chicago-1.tfvars
dr = "active" # or dr = "" for non-DR mode (both work for promotion)
# dr_use_dedicated_routing defaults to true (pool-specific routing)
# Apply in secondary workspace
terraform workspace select secondary
terraform apply -var-file=secondary-us-chicago-1.tfvarsNote
Setting dr = "active" or dr = "" (empty string) both promote the
cluster to active status. However, the choice determines whether the cluster remains part of
the DR strategy:
dr = "active": The cluster remains part of the DR strategy with full DR infrastructure (global DNS, health checks, health monitoring). Use this if you want the promoted cluster to serve as the new primary in the DR pair.dr = "": The cluster operates standalone without DR infrastructure. Use this if you want to remove the cluster from the DR strategy entirely after promotion.
Promoting the standby to dr="active" does not
automatically move ownership of the OCI global DNS resources. In this repo, keep
manage_global_dns=true only in a single workspace to avoid two states
managing the same steering policy/zone. During an incident, the standby function updates the
steering policy directly (Terraform will not fight those emergency updates).
What changes automatically:
Scales node groups to production sizes
Sets production replication factor and enables auto-rebalance
Enables alerts by setting
ENABLE_ALERTS=trueHumio operator scales to 1 and HumioCluster nodeCount follows production values
S3_RECOVER_FROM_* Environment Variable Preservation
Important
The S3_RECOVER_FROM_* environment variables are intentionally kept when
promoting from dr="standby" to dr="active". This is a
deliberate design choice to prevent pod recreation during DR promotion
Why env vars are preserved
The humio-operator calculates a hash of the pod spec (including environment variables) to determine if pods need to be recreated. If env vars were removed during promotion:
The pod spec hash would change
The operator would delete and recreate all pods
Ephemeral PVCs would be deleted (data loss)
Recovered snapshot data would be lost
Why this is safe:
The S3_RECOVER_FROM_* env vars are only read at startup by
DataSnapshotLoader.scala:
After successful recovery, the local snapshot has a valid Kafka epoch
Subsequent pod restarts use the local snapshot, NOT the recovery bucket
The primary bucket is marked
readOnly=truein cluster stateEnv vars are harmlessly ignored after initial recovery
Behavior matrix:
| Scenario | Behavior | Safe? |
|---|---|---|
| Normal operation | Env vars ignored (local snapshot used) | Yes |
| Pod restart (same PVC) | Uses local snapshot, skips recovery bucket | Yes |
| New pod (fresh PVC) | Would re-fetch from recovery bucket | Only if cluster wiped |
Note
If you need to remove these env vars later (e.g., after the original primary is decommissioned), do so in a maintenance window when brief pod recreation is acceptable.
Resources destroyed during promotion:
When promoting from dr="standby" to dr="active", Terraform
destroys the entire module.dr-failover-function because automated failover
is no longer needed on an active cluster. The key resources removed are:
OCI Function and Application - The serverless function that scales the Humio operator during failover
OCI Monitoring Alarm - The alarm that detects primary cluster health failures
ONS Topic and Subscription - The notification chain connecting the alarm to the function
IAM Policies and Dynamic Group - Permissions allowing the function to access OKE and scale the operator
OCIR Repository and Auth Token - Container registry resources for the function image
NSG Rule - Network rule allowing function-to-OKE API communication
Kubernetes RBAC -
ClusterRoleandClusterRoleBindingfor operator scaling permissionsFunction Logging - Log group and logs for function invocation auditing
In total, approximately 20 OCI and Kubernetes resources are removed. The DR-related
Terraform outputs (dr_failover_alarm_id,
dr_failover_function_*, dr_failover_topic_id) are also
no longer available after promotion.
Verify promotion:
kubectl get humiocluster -n logging --context oci-secondary -o jsonpath='{.spec.environmentVariables}' | jq '.[] | select(.name | startswith("S3_RECOVER"))'
# => empty
kubectl get humiocluster -n logging --context oci-secondary -o jsonpath='{.spec.nodeCount}'
# => production value
kubectl get pods -n logging --context oci-secondary
# => all pods runningOCI DNS Steering Policy Flow:
When you use the DR global DNS pattern
(${global_logscale_hostname}.${dns_zone_name}) with OCI DNS Steering
Policy failover records, ingestion and UI clients point at a single global FQDN.
In normal operation this record resolves to the primary load balancer IP and the secondary
HumioCluster declares nodeCount=1 but runs no Humio pods because the
operator is scaled to 0.
If the primary health check fails and OCI DNS Steering Policy updates the global DNS to
return the secondary IP, the OCI Function failover scaler scales the Humio operator from 0 โ 1
so the secondary can start the single digest pod and serve traffic. There is no automatic
scale-down of the operator or Humio; scale back manually or by re-applying Terraform with
dr="standby" after failback.
In this mode, failover/failback tests for ingestion use the same global FQDN and do not require manual DNS record changes; the DNS steering policy and OCI Function together handle the traffic switch.
To verify which cluster is currently serving traffic:
GLOBAL_DR_FQDN="logscale-dr.oci-dr.humio.net" # Your global DR FQDN
dig +short "${GLOBAL_DR_FQDN}"
curl -I "https://${GLOBAL_DR_FQDN}"