Network Architecture

The GCP DR infrastructure uses a regional VPC topology with firewalls, node pools, and label selectors to route traffic during normal operations and during failover.

GLB Architecture Overview:

GCP DR - GLB Architecture
Subnet / VPC Configuration

Each GCP region has a dedicated VPC with the following structure:

Component Configuration Purpose
VPC Network Regional, managed by Terraform Isolates primary and secondary clusters
Subnets Configurable CIDR range per region (gcp_cidr_range) Separate IP ranges per region
Cloud NAT Per region, configured on Cloud Router Outbound internet access from private nodes
Cloud DNS Global zone (if manage_global_dns=true) Global DNS records managed by primary only
Firewall Rules
Rule Direction Source Destination Protocol/Port Purpose
Allow Internal Ingress var.gcp_cidr_range (default 10.128.0.0/20) All TCP/UDP 80-65535 Intra-cluster communication
Allow GLB Health Checks Ingress 130.211.0.0/22, 35.191.0.0/16 All TCP 31036, 10256, 8080 GCP health check probes (NodePort, kubelet, readiness)
Allow Subnet Proxy Ingress var.gcp_subnetwork_proxy_cidr_range All TCP 443, ICMP Internal ingest LB (advanced cluster type only)
Allow Egress Egress All 0.0.0.0/0 All Outbound internet via Cloud NAT
Node Pool Topology by DR Mode

Node pools are deployed based on logscale_cluster_type and provision_kafka_servers, not by var.dr mode. The standby cluster uses the same node pool configuration as active โ€” the cost savings come from the humio-operator being scaled to 0 replicas (no LogScale pods running).

Node Pool Deployment Condition Purpose
Digest Always deployed (no count condition) Core LogScale processing (queries, segment management)
Kafka var.provision_kafka_servers == true Strimzi Kafka brokers for partition management
Ingest logscale_cluster_type == "advanced" Data ingestion and parsing workloads
UI logscale_cluster_type in ["dedicated-ui", "advanced"] User interface and API endpoints

Note

GCP does not have an Ingress node pool. GCP uses native GKE Load Balancer via NodePort service (deploy_nginx_ingress = false), unlike AWS/Azure which use nginx ingress controllers.

Additional cluster-level settings by DR mode:

Component Active (dr="active") Standby (dr="standby")
Humio operator 1 replica 0 replicas
HumioCluster nodeCount cluster_size value 1 (declared, not running)
HumioCluster nodePools Full nodePool spec null (prevents reconciliation loop)
Replication factor Production value 1 (overridden)
Auto rebalance Enabled Disabled
Why Standby Has Minimal Running Workloads

The standby cluster is designed for minimal cost with rapid failover capability:

  1. No active LogScale workloads: During standby, the humio-operator is scaled to 0 replicas, so no LogScale pods run. Node pools are provisioned but idle.

  2. Recovery via snapshot: Failover recovery reads the global snapshot from the primary GCS bucket using a single digest pod. Ingest workloads only resume after promotion.

  3. On-demand scaling: When promoted to dr="active", the Cloud Function scales the operator to 1 replica, which then reconciles the HumioCluster CR.

  4. Resource efficiency: No idle LogScale pods consuming compute resources โ€” standby cost is minimized to Kafka brokers and infrastructure components.

What Runs on Standby
  • Kafka brokers: 3-5 replicas running. Required for LogScale partition management; keeping them running avoids 10-15 minutes of Kafka startup delay during failover.

  • NodePort service: Exposes port 8080 for GLB health checks. Uses DR-aware label selectors (app.kubernetes.io/name=humio, humio.com/feature=OperatorInternal) to target query-capable pods.

  • cert-manager: Running to maintain valid TLS certificates.

  • TopoLVM: Running for LVM volume provisioning.

  • Digest node pool: GKE nodes provisioned but no LogScale pods running until operator scales up.

Why nodePools = null on Standby

When dr="standby", the HumioCluster spec sets nodePools = null to prevent the humio-operator from entering a reconciliation loop. The shared logscale-kubernetes module generates nodePool specs for all pool types (digest, UI, ingest) with nodeCount=0 for pools not deployed on standby. The operator interprets these zero-count pools as stale status entries, cleaning them up each cycle and preventing the digest pod from being created.

Important: nodePools is tied to var.dr == "standby", NOT to dr_use_dedicated_routing. During two-phase promotion:

  • Phase 1 (dr="active", dr_use_dedicated_routing=false): nodePools are restored so UI/Ingest pods begin scaling up

  • Phase 2 (dr_use_dedicated_routing=true): Pool-specific selectors are enabled once pods are ready

Nulling nodePools during Phase 1 would cause a 503 outage in Phase 2 because selectors would update instantly but pods would take minutes to start.

Request Flow (Internet to LogScale)
  1. DNS Resolution: Client resolves <global-hostname>.<zone-name> to the global GLB IP

  2. Global Load Balancer: Routes to backend based on capacity_scaler and health checks:

    • Primary backend: capacity_scaler=1.0 (receives all traffic)

    • Secondary backend: capacity_scaler=0.0 (receives no traffic until failover)

  3. GKE NodePort Service: GLB connects to GKE NodePort service (port 8080 -> 31036) on the digest node pool instance group

  4. Label Selector Routing: NodePort service uses DR-aware selectors to route to appropriate pods:

    • Active: app.kubernetes.io/name=humio + humio.com/feature=OperatorInternal (or pool-specific selectors when dr_use_dedicated_routing=true)

    • Standby: Same selectors, but no pods match until operator scales up

  5. LogScale Pod: Handles request (queries, log ingestion, etc.)