Known Issues and Recommendations
This section lists some known issues and recommended mitigations for DR operations.
Issue 1: Secondary Health Check Uses TCP Only
Severity: MEDIUM
Location: modules/oci/global-dns/main.tf
When use_external_health_check=true, the repo can create an optional secondary monitor that checks TCP:8080 reachability. This is intended as a readiness/visibility signal and is not an
application-level health check.
Impact:
TCP:8080 can appear healthy even if LogScale is not fully functional.
The DNS steering policy does not use this secondary TCP monitor for failover decisions.
Recommendation: Treat the secondary TCP monitor as "reachability only".
For application health, use /api/v1/status.
Issue 2: HumioCluster Shows "License Error" on Standby Cluster
Severity: LOW (Informational)
In dr="standby" mode, it is expected to see transient/stale HumioCluster
status errors such as a "license error" or "connection refused" messages.
Root Cause: The standby design keeps humio-operator scaled to 0 replicas until failover (so no Humio pods are running). The HumioCluster status can lag behind and surface errors that would normally be handled by the operator.
What to do: No action is required. The error clears automatically once humio-operator is scaled up during failover.
Verification:
kubectl --context oci-secondary -n logging get deploy humio-operator
# Expected in standby: replicas: 0Issue 3: DNS-01 Challenges Fail With "PEM data was not found in buffer
Severity: HIGH
Location: modules/kubernetes/cert-manager-oci-webhook/main.tf (kubernetes_secret.oci_profile)
Symptoms:
DNS-01 challenges fail and cert-manager logs show an OCI client error similar to:
bad configuration: PEM data was not found in buffer
Root Cause: The OCI credentials Secret
(oci-dns-credentials) must contain raw values.
If the Secret values are accidentally double-base64 encoded,
the webhook reads the private key as base64 text rather than PEM content.
Why this happens: Kubernetes Secrets store data
as base64, and Terraform's kubernetes_secret.data
already base64-encodes provided values. Do not wrap these fields with base64encode().
Verification (secret must decode to PEM):
kubectl --context oci-primary -n logging-cert get secret oci-dns-credentials \
-o jsonpath='{.data.privateKey}' | base64 -d | head -c 50
# Expected prefix: -----BEGINFix: Follow these steps:
Re-apply Terraform: terraform apply -target=module.cert-manager-oci-webhook.
Restart the webhook: kubectl --context <cluster> -n logging-cert rollout restart deploy cert-manager-webhook-oci.
Re-trigger issuance (if needed) by deleting the stuck Certificate/Challenge and letting cert-manager recreate it.
Issue 4: OCI Load Balancer Shows CRITICAL/INVALID\_STATUS\_CODE
Severity: HIGH
Location: main.tf (nginx_ingress_sets), modules/oci/core/main.tf (LB/worker NSG rules)
Symptoms:
The public LogScale URL times out or is unreachable.
OCI LB backend set health shows many backends as
CRITICAL, sometimes with health-check-status:INVALID_STATUS_CODE.
Common causes:
LB created in the wrong subnet/VCN because the
nginx-ingressService did not specify the subnet via:service.beta.kubernetes.io/oci-load-balancer-subnet1
Missing NodePort/health-check connectivity between the LB and worker nodes (NSG rules must allow NodePort range 30000-32767 and health check port 10256).
Important note on "CRITICAL" backends (can be expected):
This repo sets
nginx-ingress externalTrafficPolicy: Localso only nodes with ready ingress endpoints answer health checks.In that mode, it is normal for nodes without ready ingress endpoints to fail the health check (even if the service is working overall). The LB should still route to healthy backends.
Verification:
# Confirm the nginx-ingress Service is pinned to the expected subnet (prevents wrong-VCN LBs)
kubectl --context <cluster> -n logging-ingress get svc <nginx-ingress-svc> \
-o jsonpath='{.metadata.annotations.service.beta.kubernetes.io/oci-load-balancer-subnet1}{"\n"}'Fix: Ensure that you are on a revision that includes:
nginx_ingress_setssubnet annotation inmain.tfNodePort + health check NSG rules in
modules/oci/core/main.tf
Issue 5: Let's Encrypt Rate Limiting Blocks Re-Issuance
Severity: HIGH (during incident)
When re-issuing the same certificate too frequently, Let's Encrypt rate limits can prevent issuance and prolong downtime.
Mitigation: Restore the last-known-good TLS Secret from certs-backup/ (stop-gap until issuance succeeds again):
kubectl --context oci-secondary -n logging apply -f certs-backup/<global-fqdn>-tls-secret.yaml
kubectl --context oci-secondary -n logging get secret <global-fqdn>Issue 6: Steering Policy Answer IP Drift
Severity: MEDIUM
Location: modules/oci/global-dns/main.tf (lifecycle.ignore_changes = [answers])
Terraform intentionally ignores changes to the steering policy answers to avoid unnecessary replacements caused by dynamic LB IP discovery.
Impact:
If the LB IP truly changes (e.g., LB recreation), Terraform might not automatically update the steering policy answers.
Emergency/manual answer edits are less likely to be reverted, but rule changes are still managed by Terraform.
Remediation (when IP actually changed):
terraform taint 'module.global-dns[0].oci_dns_steering_policy.logscale_global_failover[0]'
terraform apply