Operations Guide

This implementation bootstraps a standby LogScale cluster on EKS that can take over from a failed primary using LogScale's native bucket storage disaster recovery method, as documented in Start a new LogScale cluster based on another with buckets. LogScale supports bootstrapping a fully independent cluster from the bucket storage of an existing cluster -- the new cluster treats the source bucket as read-only and uses its own bucket for new writes.

Two clusters are managed via Terraform workspaces:

  • Primary (e.g., us-west-2): production, dr="active".

  • Secondary (e.g., us-east-2): standby, dr="standby", minimal capacity, reads the primary's S3 bucket using the exact same encryption key pulled via TFE outputs, and keeps all LogScale pods scaled to zero until a failover/promotion is initiated.

Upstream DR Procedure and This Implementation
Upstream Requirement This Implementation
Single node, empty data directory Standby HumioCluster declares nodeCount=1; Humio operator is scaled to 0 replicas until failover. When scaled up, a fresh pod starts with empty ephemeral storage
Fresh Kafka cluster Each EKS cluster runs its own independent Strimzi-managed Kafka cluster
No existing snapshot Standby cluster has never run LogScale; no local snapshot exists
Empty target bucket Dedicated S3 bucket per cluster (e.g., logscale-s3-dr-secondary-<region>-<account_id>)
S3_RECOVER_FROM_* env vars Automatically set by Terraform from primary remote state outputs (S3_RECOVER_FROM_BUCKET, S3_RECOVER_FROM_REGION, etc.)
S3_RECOVER_FROM_REPLACE_* patterns Set via tfvars (s3_recover_from_replace_region, s3_recover_from_replace_bucket) or auto-derived from remote state
Encryption key from source cluster Synchronized via TFE outputs or terraform_remote_state -- primary generates the key, secondary reads it and creates an identical Kubernetes secret
Extend to desired node count after recovery Two-phase promotion: Phase 1 runs with single digest pod, Phase 2 scales to full production topology via terraform apply

Note

The upstream documentation states that the source bucket must be immutable from the point the disaster recovery process starts. In a real disaster scenario (unplanned failover), the primary cluster is already unreachable, so no new writes occur. For planned failover or cloning, the primary should be shut down gracefully before initiating recovery to ensure the global snapshot references no missing segments.

Key Design Decisions

Region flexibility:

The regions shown (us-west-2 and us-east-2) are examples only. You can deploy in any AWS regions supported by your organization. Update aws_region in your tfvars, the remote state configuration, and any region-specific references (e.g., Route53/ALB) to match your chosen regions.

Key features:

  • Automated encryption key synchronization (no hardcoding). Standby apply requires the primary key (TFE outputs or explicit value).

  • Cross-region S3 access via IRSA/IAM policies.

  • Alerts toggle automatically via ENABLE_ALERTS based on dr (true for active, false for standby).

  • Standby keeps Humio operator scaled to 0; Lambda (or manual) scales the operator to 1 on failover. NodeCount is already set to 1 on the HumioCluster manifest; no automatic scale-down exists.

  • Automatic failback prevention: During failover, the Lambda swaps the primary Route53 health check FQDN to an unresolvable host (failover-locked.invalid), which makes Route53 treat the primary as permanently unhealthy. This prevents automatic DNS failback -- an operator must manually restore the health check FQDN to the original primary hostname after verifying primary readiness.

  • Manual, controlled promotion by changing dr and applying Terraform.

Key Capabilities
Feature Primary (Active) Secondary (Standby)
Region var.aws_region (e.g., us-west-2) var.aws_region (e.g., us-east-2)
Cluster Type Advanced (full production) Standby (Humio operator off)
Terraform Workspace primary secondary
Encryption Key Generated on first deploy Pulled from TFE outputs (required for standby apply)
Humio nodeCount cluster_size digest count nodeCount=1 declared, but no pods run until operator is scaled up
S3 Bucket logscale-s3-dr-primary-<region>-<account_id> logscale-s3-dr-secondary-<region>-<account_id>
DR Mode dr = "active" dr = "standby"
Auto Rebalance Enabled Disabled
Node Groups Per cluster_size (digest/ingress/ingest/ui/kafka) Digest/Ingress/Kafka node groups sized per cluster_size (no UI/ingest pods)
Replication Factor Production value 1 (overridden)
Humio operator 1 replica 0 replicas until failover

The dr variable accepts three values:

  • "active" - Primary cluster in a DR pair

  • "standby" - Secondary cluster in a DR pair (minimal capacity, operator scaled to 0)

  • "" (empty string) - Non-DR single cluster deployment (no DR infrastructure provisioned)