Lambda Function Internals

The DR failover Lambda is covered at a high level in DR Failover Lambda (module.dr-failover-lambda). This section provides the implementation-level details.

Execution Steps

When triggered by an SNS notification from the CloudWatch alarm, the Lambda executes the following steps:

  1. Cooldown Check: Verifies sufficient time has passed since the last failover (prevents flapping)

  2. Pre-Failover Validation: Queries CloudWatch to confirm primary has been failing for the configured duration (default: 180 seconds)

  3. Health Check Verification: Double-checks Route53 health status to ensure primary is still unhealthy (may have recovered)

  4. EKS Authentication: Generates a short-lived bearer token using STS presigned URL (k8s-aws-v1 format)

  5. Kubernetes Client Setup: Connects to the secondary EKS cluster using the generated token

  6. Idempotency Check: Reads current humio-operator replica count (skips if already scaled up)

  7. TLS Secret Cleanup: Deletes the stale cluster TLS secret to prevent CA certificate mismatch errors (see Requirements)

  8. Operator Scaling: Patches the humio-operator deployment from 0 → 1 replicas

  9. Health Check FQDN Lock: Swaps primary health check FQDN to failover-locked.invalid

Key Point: The Lambda only scales the humio-operator deployment. The operator then reconciles the HumioCluster CR (which already declares nodeCount=1) and starts the LogScale pod. The Lambda does not modify the HumioCluster spec.

Cooldown Behavior

The cooldown period (FAILOVER_COOLDOWN_SECONDS=300) is a module-internal Lambda environment variable, not exposed as a root-level tfvars variable. The cooldown timestamp is persisted to AWS SSM Parameter Store at /logscale-dr/<function-name>/last-failover-timestamp, ensuring the cooldown survives Lambda cold starts and function redeployments. Lambda reads this value on startup; cold starts no longer bypass the cooldown.

Retry Logic

All Kubernetes API operations use exponential backoff with jitter:

Parameter Default Description
MAX_RETRIES 3 Maximum retry attempts per operation
BASE_DELAY_SECONDS 1.0 Initial delay before first retry
MAX_DELAY_SECONDS 30.0 Maximum delay cap between retries

Retryable errors: HTTP 429 (rate limit), 500, 502, 503, 504, connection errors

Non-retryable errors: HTTP 400, 401, 403, 404, 409, 422 (fail immediately)

Environment Variables
shell
# Target cluster configuration
CLUSTER_NAME = <secondary-cluster>
CLUSTER_REGION = us-east-2
CLUSTER_NAMESPACE = logging
TARGET_OPERATOR_REPLICAS = 1
HUMIOCLUSTER_NAME = <cluster-name-prefix>-log
# Health check configuration
PRIMARY_HEALTH_CHECK_ID = <health-check-id>
SECONDARY_HEALTH_CHECK_ID = <health-check-id>
SKIP_SECONDARY_HEALTH_CHECK = true
# Pre-failover validation
PRE_FAILOVER_FAILURE_SECONDS = 180 # Must fail for 3 minutes before failover
FAILOVER_COOLDOWN_SECONDS = 300 # 5 minute cooldown (module-internal, not exposed as tfvar)
# SSM cooldown persistence
SSM_PARAMETER_PREFIX = /logscale-dr # SSM path prefix for cooldown state
# Full path: /logscale-dr/<function-name>/last-failover-timestamp
# Retry configuration
MAX_RETRIES = 3
BASE_DELAY_SECONDS = 1.0
MAX_DELAY_SECONDS = 30.0
IAM Permissions

The Lambda IAM role requires these permissions:

Permission Resource Purpose
eks:DescribeCluster Secondary EKS cluster Get cluster endpoint and CA certificate
route53:GetHealthCheckStatus Primary and secondary health checks Validate health status before failover
route53:UpdateHealthCheck Primary health check Lock FQDN to failover-locked.invalid
cloudwatch:GetMetricStatistics Route53 metrics (us-east-1) Query consecutive failure duration
kms:Decrypt Lambda KMS key Decrypt environment variables
ssm:PutParameter, ssm:GetParameter SSM Parameter Store (/logscale-dr/*) Persist and read failover cooldown timestamp

EKS Access: The Lambda uses an EKS Access Entry (not aws-auth ConfigMap) with AmazonEKSClusterAdminPolicy scoped to the logging namespace only.

Cross-region IAM policy: Uses exact bucket ARN (arn:aws:s3:::<primary-bucket-name>), not wildcard patterns.

Lambda Configuration
Setting Value
Timeout Configurable (default: 60s)
Source DR failover Lambda module (Python)
Runtime Python 3.12
Memory Configurable (default: 256 MB)
Log Retention 7 days (via dr_failover_lambda_log_retention_days, default=7)
Handler dr_failover_handler.lambda_handler