Lambda Function Internals
The DR failover Lambda is covered at a high level in
DR Failover Lambda (module.dr-failover-lambda).
This section provides the implementation-level details.
Execution Steps
When triggered by an SNS notification from the CloudWatch alarm, the Lambda executes the following steps:
Cooldown Check: Verifies sufficient time has passed since the last failover (prevents flapping)
Pre-Failover Validation: Queries CloudWatch to confirm primary has been failing for the configured duration (default: 180 seconds)
Health Check Verification: Double-checks Route53 health status to ensure primary is still unhealthy (may have recovered)
EKS Authentication: Generates a short-lived bearer token using STS presigned URL (
k8s-aws-v1format)Kubernetes Client Setup: Connects to the secondary EKS cluster using the generated token
Idempotency Check: Reads current humio-operator replica count (skips if already scaled up)
TLS Secret Cleanup: Deletes the stale cluster TLS secret to prevent CA certificate mismatch errors (see Requirements)
Operator Scaling: Patches the humio-operator deployment from 0 → 1 replicas
Health Check FQDN Lock: Swaps primary health check FQDN to
failover-locked.invalid
Key Point: The Lambda only scales the humio-operator deployment. The
operator then reconciles the HumioCluster CR (which already declares
nodeCount=1) and starts the LogScale pod. The
Lambda does not modify the HumioCluster spec.
Cooldown Behavior
The cooldown period (FAILOVER_COOLDOWN_SECONDS=300) is a
module-internal Lambda environment variable, not exposed as a root-level
tfvars variable. The cooldown timestamp is persisted to AWS SSM Parameter
Store at
/logscale-dr/<function-name>/last-failover-timestamp,
ensuring the cooldown survives Lambda cold starts and function
redeployments. Lambda reads this value on startup; cold starts no longer
bypass the cooldown.
Retry Logic
All Kubernetes API operations use exponential backoff with jitter:
| Parameter | Default | Description |
|---|---|---|
MAX_RETRIES
| 3 | Maximum retry attempts per operation |
BASE_DELAY_SECONDS
| 1.0 | Initial delay before first retry |
MAX_DELAY_SECONDS
| 30.0 | Maximum delay cap between retries |
Retryable errors: HTTP 429 (rate limit), 500, 502, 503, 504, connection errors
Non-retryable errors: HTTP 400, 401, 403, 404, 409, 422 (fail immediately)
Environment Variables
# Target cluster configuration
CLUSTER_NAME = <secondary-cluster>
CLUSTER_REGION = us-east-2
CLUSTER_NAMESPACE = logging
TARGET_OPERATOR_REPLICAS = 1
HUMIOCLUSTER_NAME = <cluster-name-prefix>-log
# Health check configuration
PRIMARY_HEALTH_CHECK_ID = <health-check-id>
SECONDARY_HEALTH_CHECK_ID = <health-check-id>
SKIP_SECONDARY_HEALTH_CHECK = true
# Pre-failover validation
PRE_FAILOVER_FAILURE_SECONDS = 180 # Must fail for 3 minutes before failover
FAILOVER_COOLDOWN_SECONDS = 300 # 5 minute cooldown (module-internal, not exposed as tfvar)
# SSM cooldown persistence
SSM_PARAMETER_PREFIX = /logscale-dr # SSM path prefix for cooldown state
# Full path: /logscale-dr/<function-name>/last-failover-timestamp
# Retry configuration
MAX_RETRIES = 3
BASE_DELAY_SECONDS = 1.0
MAX_DELAY_SECONDS = 30.0IAM Permissions
The Lambda IAM role requires these permissions:
| Permission | Resource | Purpose |
|---|---|---|
eks:DescribeCluster
| Secondary EKS cluster | Get cluster endpoint and CA certificate |
route53:GetHealthCheckStatus
| Primary and secondary health checks | Validate health status before failover |
route53:UpdateHealthCheck
| Primary health check |
Lock FQDN to failover-locked.invalid
|
cloudwatch:GetMetricStatistics
| Route53 metrics (us-east-1) | Query consecutive failure duration |
kms:Decrypt
| Lambda KMS key | Decrypt environment variables |
ssm:PutParameter,
ssm:GetParameter
|
SSM Parameter Store (/logscale-dr/*)
| Persist and read failover cooldown timestamp |
EKS Access: The Lambda uses an EKS Access Entry (not aws-auth ConfigMap)
with AmazonEKSClusterAdminPolicy scoped to the logging
namespace only.
Cross-region IAM policy: Uses exact bucket ARN
(arn:aws:s3:::<primary-bucket-name>), not
wildcard patterns.
Lambda Configuration
| Setting | Value |
|---|---|
| Timeout | Configurable (default: 60s) |
| Source | DR failover Lambda module (Python) |
| Runtime | Python 3.12 |
| Memory | Configurable (default: 256 MB) |
| Log Retention |
7 days (via
dr_failover_lambda_log_retention_days,
default=7)
|
| Handler |
dr_failover_handler.lambda_handler
|