Operations Guide

This implementation bootstraps a standby LogScale cluster on EKS that can take over from a failed primary using LogScale's native bucket storage disaster recovery method, as documented in Start a new LogScale cluster based on another with buckets. LogScale supports bootstrapping a fully independent cluster from the bucket storage of an existing cluster -- the new cluster treats the source bucket as read-only and uses its own bucket for new writes.

Two clusters are managed via Terraform workspaces:

Primary (e.g., us-west-2): production, dr="active".
Secondary (e.g., us-east-2): standby, dr="standby", minimal capacity, reads the primary's S3 bucket using the exact same encryption key pulled via TFE outputs, and keeps all LogScale pods scaled to zero until a failover/promotion is initiated.

Upstream DR Procedure and This Implementation

Upstream Requirement	This Implementation
Single node, empty data directory	Standby HumioCluster declares `nodeCount=1;` Humio operator is scaled to 0 replicas until failover. When scaled up, a fresh pod starts with empty ephemeral storage
Fresh Kafka cluster	Each EKS cluster runs its own independent Strimzi-managed Kafka cluster
No existing snapshot	Standby cluster has never run LogScale; no local snapshot exists
Empty target bucket	Dedicated S3 bucket per cluster (e.g., `logscale-s3-dr-secondary-<region>-<account_id>`)
`S3_RECOVER_FROM_* env vars`	Automatically set by Terraform from primary remote state outputs (`S3_RECOVER_FROM_BUCKET`, `S3_RECOVER_FROM_REGION`, etc.)
`S3_RECOVER_FROM_REPLACE_*` patterns	Set via tfvars (`s3_recover_from_replace_region`, `s3_recover_from_replace_bucket`) or auto-derived from remote state
Encryption key from source cluster	Synchronized via TFE outputs or terraform_remote_state -- primary generates the key, secondary reads it and creates an identical Kubernetes secret
Extend to desired node count after recovery	Two-phase promotion: Phase 1 runs with single digest pod, Phase 2 scales to full production topology via terraform apply

Note

The upstream documentation states that the source bucket must be immutable from the point the disaster recovery process starts. In a real disaster scenario (unplanned failover), the primary cluster is already unreachable, so no new writes occur. For planned failover or cloning, the primary should be shut down gracefully before initiating recovery to ensure the global snapshot references no missing segments.

Key Design Decisions

Region flexibility:

The regions shown (us-west-2 and us-east-2) are examples only. You can deploy in any AWS regions supported by your organization. Update aws_region in your tfvars, the remote state configuration, and any region-specific references (e.g., Route53/ALB) to match your chosen regions.

Key features:

Automated encryption key synchronization (no hardcoding). Standby apply requires the primary key (TFE outputs or explicit value).
Cross-region S3 access via IRSA/IAM policies.
Alerts toggle automatically via ENABLE_ALERTS based on dr (true for active, false for standby).
Standby keeps Humio operator scaled to 0; Lambda (or manual) scales the operator to 1 on failover. NodeCount is already set to 1 on the HumioCluster manifest; no automatic scale-down exists.
Automatic failback prevention: During failover, the Lambda swaps the primary Route53 health check FQDN to an unresolvable host (failover-locked.invalid), which makes Route53 treat the primary as permanently unhealthy. This prevents automatic DNS failback -- an operator must manually restore the health check FQDN to the original primary hostname after verifying primary readiness.
Manual, controlled promotion by changing dr and applying Terraform.

Key Capabilities

Feature	Primary (Active)	Secondary (Standby)
Region	`var.aws_region` (e.g., us-west-2)	`var.aws_region` (e.g., us-east-2)
Cluster Type	Advanced (full production)	Standby (Humio operator off)
Terraform Workspace	primary	secondary
Encryption Key	Generated on first deploy	Pulled from TFE outputs (required for standby apply)
Humio nodeCount	cluster_size digest count	`nodeCount=1` declared, but no pods run until operator is scaled up
S3 Bucket	`logscale-s3-dr-primary-<region>-<account_id>`	`logscale-s3-dr-secondary-<region>-<account_id>`
DR Mode	`dr = "active"`	`dr = "standby"`
Auto Rebalance	Enabled	Disabled
Node Groups	Per `cluster_size` (digest/ingress/ingest/ui/kafka)	Digest/Ingress/Kafka node groups sized per cluster_size (no UI/ingest pods)
Replication Factor	Production value	1 (overridden)
Humio operator	1 replica	0 replicas until failover

The dr variable accepts three values:

"active" - Primary cluster in a DR pair
"standby" - Secondary cluster in a DR pair (minimal capacity, operator scaled to 0)
"" (empty string) - Non-DR single cluster deployment (no DR infrastructure provisioned)

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment

Operations Guide

Upstream DR Procedure and This Implementation

Note

Key Design Decisions

Key Capabilities

Enter search term