Troubleshooting: Disks Filling Up
Condition or Error
Disks used by LogScale fill up with data
LogScale runs out of disk space
Disk space usage increases and space is not recovered
Log shippers may see HTTP 404 errors if nodes have failed
LogScale may reject ingestion with HTTP 502 errors
Causes
Data within LogScale is stored within segments. Data is stored in the configured primary storage location in two situations:
When the segments are created
When they are downloaded from buckets due to a query.
In some cases, LogScale's local disks fill up before segments can be deleted fast enough, or the configuration has been set incorrectly.
To confirm the disk usage situation:
You can check the Primary Disk Usage graph within the LogScale Insights package.
Figure 1. Graph of Disks Filling Up
Use df to check the disk space:
shellFilesystem 1K-blocks Used Available Use% Mounted on udev 1967912 0 1967912 0% /dev tmpfs 399508 1640 397868 1% /run /dev/sda5 19992176 9021684 9931900 48% / tmpfs 1997540 0 1997540 0% /dev/shm tmpfs 5120 4 5116 1% /run/lock tmpfs 1997540 0 1997540 0% /sys/fs/cgroup /dev/sda1 523248 4 523244 1% /boot/efi /dev/sdb1 20510332 557992 18903816 3% /kafka /dev/sdc1 19992176 13588164 5365420 72% /humio
In this case LogScale data is mounted in
/humio
and we can see it's 72% in use - disk space usage above 85% probably indicates that the disk space is being exhausted.Further diagnosis of the issue depends on the storage configuration:
If secondary storage is NOT enabled, check that
LOCAL_STORAGE_MIN_AGE_DAYS
andLOCAL_STORAGE_PERCENTAGE
are set to sensible values. For example:iniLOCAL_STORAGE_PERCENTAGE=80 LOCAL_STORAGE_MIN_AGE_DAYS=0
Important
These configurations are only valid if bucket storage has been configured.
If bucket storage and secondary storage are enabled, check the values set for
PRIMARY_STORAGE_PERCENTAGE
andLOCAL_STORAGE_PERCENTAGE
LogScale will fill the primary storage up to the limit specified byPRIMARY_STORAGE_PERCENTAGE
, then the oldest segments (in terms of when they were ingested, not when they were last used) get moved to the secondary. Once secondary storage fills to theLOCAL_STORAGE_PERCENTAGE
, LogScale will start deleting the least-recently used files from the secondary disk in order.
Solutions
Resolving the issue if LogScale nodes are up:
Identify heavy repositories and add retention to trim data.
In the short term, we need to remove data to stop LogScale disks from filling up.
To do this you can run this query in the humio repository to see which are the heavy repositories:
syslogclass = "*c.h.r.RetentionJob*" "Retention-stats for 'dataspace'=" | timechart(dataspace, function={max(before_compressed)},unit=bytes,span=30min)
This will show you how much storage is being used for each of your repositories. Try and target the highest repositories and add retention to those where appropriate. To add retention, go to Repository → → and either add a time limit or a storage size limit less than what is currently set.
Kill all queries OR kill the most resource intensive queries for a short period of time to allow disk utilization to come down.
Temporarily disable the node(s) with the highest disk utilization for a short period of time to allow disk usage to come down. This is where the chart in the Primary Disk Usage widget comes in handy. It will show values per node.
Check your LogScale version. Improvements to better manage disk utilization are in LogScale Humio Server 1.30.1 LTS (2021-10-01), Humio Server 1.31.0 GA (2021-09-27) and Humio Server 1.32.0 LTS (2021-10-26), with each subsequent version offering more improvements. v1.31 introduced Improved handling of local disk space relative to
LOCAL_STORAGE_MIN_AGE_DAYS
. Previously, the local disk could overflow when respecting that config, LogScale can now delete the oldest local segments that are present in bucket storage, even when they are within that time range.