Known Issues and Recommendations

This section lists some known issues and recommended mitigations for DR operations.

Issue 1: Secondary Health Check Uses TCP Only

Severity: MEDIUM

Location: modules/oci/global-dns/main.tf

When use_external_health_check=true, the repo can create an optional secondary monitor that checks TCP:8080 reachability. This is intended as a readiness/visibility signal and is not an application-level health check.

Impact:

  • TCP:8080 can appear healthy even if LogScale is not fully functional.

  • The DNS steering policy does not use this secondary TCP monitor for failover decisions.

Recommendation: Treat the secondary TCP monitor as "reachability only". For application health, use /api/v1/status.

Issue 2: HumioCluster Shows "License Error" on Standby Cluster

Severity: LOW (Informational)

In dr="standby" mode, it is expected to see transient/stale HumioCluster status errors such as a "license error" or "connection refused" messages.

Root Cause: The standby design keeps humio-operator scaled to 0 replicas until failover (so no Humio pods are running). The HumioCluster status can lag behind and surface errors that would normally be handled by the operator.

What to do: No action is required. The error clears automatically once humio-operator is scaled up during failover.

Verification:

shell
kubectl --context oci-secondary -n logging get deploy humio-operator
# Expected in standby: replicas: 0
Issue 3: DNS-01 Challenges Fail With "PEM data was not found in buffer

Severity: HIGH

Location: modules/kubernetes/cert-manager-oci-webhook/main.tf (kubernetes_secret.oci_profile)

Symptoms:

DNS-01 challenges fail and cert-manager logs show an OCI client error similar to:

bad configuration: PEM data was not found in buffer

Root Cause: The OCI credentials Secret (oci-dns-credentials) must contain raw values. If the Secret values are accidentally double-base64 encoded, the webhook reads the private key as base64 text rather than PEM content.

Why this happens: Kubernetes Secrets store data as base64, and Terraform's kubernetes_secret.data already base64-encodes provided values. Do not wrap these fields with base64encode().

Verification (secret must decode to PEM):

shell
kubectl --context oci-primary -n logging-cert get secret oci-dns-credentials \
-o jsonpath='{.data.privateKey}' | base64 -d | head -c 50
# Expected prefix: -----BEGIN

Fix: Follow these steps:

  1. Re-apply Terraform: terraform apply -target=module.cert-manager-oci-webhook.

  2. Restart the webhook: kubectl --context <cluster> -n logging-cert rollout restart deploy cert-manager-webhook-oci.

  3. Re-trigger issuance (if needed) by deleting the stuck Certificate/Challenge and letting cert-manager recreate it.

Issue 4: OCI Load Balancer Shows CRITICAL/INVALID\_STATUS\_CODE

Severity: HIGH

Location: main.tf (nginx_ingress_sets), modules/oci/core/main.tf (LB/worker NSG rules)

Symptoms:

  • The public LogScale URL times out or is unreachable.

  • OCI LB backend set health shows many backends as CRITICAL, sometimes with health-check-status: INVALID_STATUS_CODE.

Common causes:

  • LB created in the wrong subnet/VCN because the nginx-ingress Service did not specify the subnet via:

    • service.beta.kubernetes.io/oci-load-balancer-subnet1

  • Missing NodePort/health-check connectivity between the LB and worker nodes (NSG rules must allow NodePort range 30000-32767 and health check port 10256).

Important note on "CRITICAL" backends (can be expected):

  • This repo sets nginx-ingress externalTrafficPolicy: Local so only nodes with ready ingress endpoints answer health checks.

  • In that mode, it is normal for nodes without ready ingress endpoints to fail the health check (even if the service is working overall). The LB should still route to healthy backends.

Verification:

shell
# Confirm the nginx-ingress Service is pinned to the expected subnet (prevents wrong-VCN LBs)
kubectl --context <cluster> -n logging-ingress get svc <nginx-ingress-svc> \
-o jsonpath='{.metadata.annotations.service.beta.kubernetes.io/oci-load-balancer-subnet1}{"\n"}'

Fix: Ensure that you are on a revision that includes:

  • nginx_ingress_sets subnet annotation in main.tf

  • NodePort + health check NSG rules in modules/oci/core/main.tf

Issue 5: Let's Encrypt Rate Limiting Blocks Re-Issuance

Severity: HIGH (during incident)

When re-issuing the same certificate too frequently, Let's Encrypt rate limits can prevent issuance and prolong downtime.

Mitigation: Restore the last-known-good TLS Secret from certs-backup/ (stop-gap until issuance succeeds again):

shell
kubectl --context oci-secondary -n logging apply -f certs-backup/<global-fqdn>-tls-secret.yaml
kubectl --context oci-secondary -n logging get secret <global-fqdn>
Issue 6: Steering Policy Answer IP Drift

Severity: MEDIUM

Location: modules/oci/global-dns/main.tf (lifecycle.ignore_changes = [answers])

Terraform intentionally ignores changes to the steering policy answers to avoid unnecessary replacements caused by dynamic LB IP discovery.

Impact:

  • If the LB IP truly changes (e.g., LB recreation), Terraform might not automatically update the steering policy answers.

  • Emergency/manual answer edits are less likely to be reverted, but rule changes are still managed by Terraform.

Remediation (when IP actually changed):

shell
terraform taint 'module.global-dns[0].oci_dns_steering_policy.logscale_global_failover[0]'
terraform apply