Lambda Function Internals

The DR failover Lambda is covered at a high level in DR Failover Lambda (module.dr-failover-lambda). This section provides the implementation-level details.

Execution Steps

When triggered by an SNS notification from the CloudWatch alarm, the Lambda executes the following steps:

Cooldown Check: Verifies sufficient time has passed since the last failover (prevents flapping)
Pre-Failover Validation: Queries CloudWatch to confirm primary has been failing for the configured duration (default: 180 seconds)
Health Check Verification: Double-checks Route53 health status to ensure primary is still unhealthy (may have recovered)
EKS Authentication: Generates a short-lived bearer token using STS presigned URL (k8s-aws-v1 format)
Kubernetes Client Setup: Connects to the secondary EKS cluster using the generated token
Idempotency Check: Reads current humio-operator replica count (skips if already scaled up)
TLS Secret Cleanup: Deletes the stale cluster TLS secret to prevent CA certificate mismatch errors (see Requirements)
Operator Scaling: Patches the humio-operator deployment from 0 → 1 replicas
Health Check FQDN Lock: Swaps primary health check FQDN to failover-locked.invalid

Key Point: The Lambda only scales the humio-operator deployment. The operator then reconciles the HumioCluster CR (which already declares nodeCount=1) and starts the LogScale pod. The Lambda does not modify the HumioCluster spec.

Cooldown Behavior

The cooldown period (FAILOVER_COOLDOWN_SECONDS=300) is a module-internal Lambda environment variable, not exposed as a root-level tfvars variable. The cooldown timestamp is persisted to AWS SSM Parameter Store at /logscale-dr/<function-name>/last-failover-timestamp, ensuring the cooldown survives Lambda cold starts and function redeployments. Lambda reads this value on startup; cold starts no longer bypass the cooldown.

Retry Logic

All Kubernetes API operations use exponential backoff with jitter:

Parameter	Default	Description
`MAX_RETRIES`	3	Maximum retry attempts per operation
`BASE_DELAY_SECONDS`	1.0	Initial delay before first retry
`MAX_DELAY_SECONDS`	30.0	Maximum delay cap between retries

Retryable errors: HTTP 429 (rate limit), 500, 502, 503, 504, connection errors

Non-retryable errors: HTTP 400, 401, 403, 404, 409, 422 (fail immediately)

Environment Variables

shell

# Target cluster configuration
CLUSTER_NAME = <secondary-cluster>
CLUSTER_REGION = us-east-2
CLUSTER_NAMESPACE = logging
TARGET_OPERATOR_REPLICAS = 1
HUMIOCLUSTER_NAME = <cluster-name-prefix>-log
# Health check configuration
PRIMARY_HEALTH_CHECK_ID = <health-check-id>
SECONDARY_HEALTH_CHECK_ID = <health-check-id>
SKIP_SECONDARY_HEALTH_CHECK = true
# Pre-failover validation
PRE_FAILOVER_FAILURE_SECONDS = 180 # Must fail for 3 minutes before failover
FAILOVER_COOLDOWN_SECONDS = 300 # 5 minute cooldown (module-internal, not exposed as tfvar)
# SSM cooldown persistence
SSM_PARAMETER_PREFIX = /logscale-dr # SSM path prefix for cooldown state
# Full path: /logscale-dr/<function-name>/last-failover-timestamp
# Retry configuration
MAX_RETRIES = 3
BASE_DELAY_SECONDS = 1.0
MAX_DELAY_SECONDS = 30.0

IAM Permissions

The Lambda IAM role requires these permissions:

Permission	Resource	Purpose
`eks:DescribeCluster`	Secondary EKS cluster	Get cluster endpoint and CA certificate
`route53:GetHealthCheckStatus`	Primary and secondary health checks	Validate health status before failover
`route53:UpdateHealthCheck`	Primary health check	Lock FQDN to `failover-locked.invalid`
`cloudwatch:GetMetricStatistics`	Route53 metrics (us-east-1)	Query consecutive failure duration
`kms:Decrypt`	Lambda KMS key	Decrypt environment variables
`ssm:PutParameter`, `ssm:GetParameter`	SSM Parameter Store (`/logscale-dr/*`)	Persist and read failover cooldown timestamp

EKS Access: The Lambda uses an EKS Access Entry (not aws-auth ConfigMap) with AmazonEKSClusterAdminPolicy scoped to the logging namespace only.

Cross-region IAM policy: Uses exact bucket ARN (arn:aws:s3:::<primary-bucket-name>), not wildcard patterns.

Lambda Configuration

Setting	Value
Timeout	Configurable (default: 60s)
Source	DR failover Lambda module (Python)
Runtime	Python 3.12
Memory	Configurable (default: 256 MB)
Log Retention	7 days (via `dr_failover_lambda_log_retention_days`, default=7)
Handler	`dr_failover_handler.lambda_handler`

Versions of this Page

Deployment Overview

Planning Your Deployment

Instance Sizing

Authentication and identity providers

Storage Architecture

Installing Using Containers

Installing On Bare Metal or Cloud Instance

Reference Architectures

Installing Load Balancers

Deploying Auxiliary Services

Configuration Settings

Managing Your Deployment

Testing Your Deployment