Monitoring Tips

Admins should ideally monitor the following to keep clusters running efficiently:

  • Identify trouble delivering data to Kafka. This is indicated by failed ingest requests. This enables you to detect if your cluster is having trouble delivering data to Kafka. You can do this by monitoring the LogScale debug logs.
  • Monitor for large event-latency metric values. This enables you to determine if your LogScale cluster is having trouble keeping up with processing of data already in Kafka.
  • Large ingest-queue-lowest-offset-lag metric values. This enables you to detect if your cluster is building up a large backlog of data in Kafka, due to problems ensuring proper replication of segments locally and uploads of segments into bucket storage.
  • You should configure Kafka's retention to ensure that Kafka cannot run out of disk space, even if LogScale doesn't clean up old events. This enables you to avoid Kafka disks filling up, which would result in failure of the Kafka cluster. Failure would require a Kafka reset, and lose any data that was not yet processed by LogScale.
  • You should monitor your disk usage on Kafka nodes to ensure they're not close to their retention limit. This enables you to avoid losing data because Kafka's retention is taking it before LogScale has processed it. This can also be detected using the ingest-queue-lowest-offset-lag metric.

See also the Capacity Planning section, which has tips on other aspects of your cluster that are worth monitoring, and also the documentation on the LogScale repository schema guide, which has full information on all metrics.