Troubleshooting: Disks Filling Up

Condition or Error

Disks used by LogScale fill up with data

LogScale runs out of disk space

Disk space usage increases and space is not recovered

Log shippers may see HTTP 404 errors if nodes have failed

LogScale may reject ingestion with HTTP 502 errors

Causes

  • Data within LogScale is stored within segments. Data is stored in the configured primary storage location in two situations:

    1. When the segments are created

    2. When they are downloaded from buckets due to a query.

    In some cases, LogScale's local disks fill up before segments can be deleted fast enough, or the configuration has been set incorrectly.

    To confirm the disk usage situation:

    1. You can check the Primary Disk Usage graph within the LogScale Insights package.

      Graph of Disks Filling Up

      Figure 1. Graph of Disks Filling Up


    2. Use df to check the disk space:

      shell
      Filesystem     1K-blocks     Used Available Use% Mounted on
      udev             1967912        0   1967912   0% /dev
      tmpfs             399508     1640    397868   1% /run
      /dev/sda5       19992176  9021684   9931900  48% /
      tmpfs            1997540        0   1997540   0% /dev/shm
      tmpfs               5120        4      5116   1% /run/lock
      tmpfs            1997540        0   1997540   0% /sys/fs/cgroup
      /dev/sda1         523248        4    523244   1% /boot/efi
      /dev/sdb1       20510332   557992  18903816   3% /kafka
      /dev/sdc1       19992176 13588164   5365420  72% /humio

      In this case LogScale data is mounted in /humio and we can see it's 72% in use - disk space usage above 85% probably indicates that the disk space is being exhausted.

      Further diagnosis of the issue depends on the storage configuration:

      • If secondary storage is NOT enabled, check that LOCAL_STORAGE_MIN_AGE_DAYS and LOCAL_STORAGE_PERCENTAGE are set to sensible values. For example:

        ini
        LOCAL_STORAGE_PERCENTAGE=80
        LOCAL_STORAGE_MIN_AGE_DAYS=0

        Important

        These configurations are only valid if bucket storage has been configured.

      • If bucket storage and secondary storage are enabled, check the values set for PRIMARY_STORAGE_PERCENTAGE and LOCAL_STORAGE_PERCENTAGE LogScale will fill the primary storage up to the limit specified by PRIMARY_STORAGE_PERCENTAGE, then the oldest segments (in terms of when they were ingested, not when they were last used) get moved to the secondary. Once secondary storage fills to the LOCAL_STORAGE_PERCENTAGE, LogScale will start deleting the least-recently used files from the secondary disk in order.

Solutions

  • Resolving the issue if LogScale nodes are up:

    • Identify heavy repositories and add retention to trim data.

      In the short term, we need to remove data to stop LogScale disks from filling up.

      To do this you can run this query in the humio repository to see which are the heavy repositories:

      syslog
      class = "*c.h.r.RetentionJob*" "Retention-stats for 'dataspace'="
      | timechart(dataspace, function={max(before_compressed)},unit=bytes,span=30min)

      This will show you how much storage is being used for each of your repositories. Try and target the highest repositories and add retention to those where appropriate. To add retention, go to RepositorySettingsData Retention and either add a time limit or a storage size limit less than what is currently set.

    • Kill all queries OR kill the most resource intensive queries for a short period of time to allow disk utilization to come down.

    • Temporarily disable the node(s) with the highest disk utilization for a short period of time to allow disk usage to come down. This is where the chart in the Primary Disk Usage widget comes in handy. It will show values per node.

    • Check your LogScale version. Improvements to better manage disk utilization are in LogScale Humio Server 1.30.1 LTS (2021-10-01), Humio Server 1.31.0 GA (2021-09-27) and Humio Server 1.32.0 LTS (2021-10-26), with each subsequent version offering more improvements. v1.31 introduced Improved handling of local disk space relative to LOCAL_STORAGE_MIN_AGE_DAYS. Previously, the local disk could overflow when respecting that config, LogScale can now delete the oldest local segments that are present in bucket storage, even when they are within that time range.