Backfilling Data
There are different ways to handle the ingestion of non-current data in
Humio; the humioBackfill
tag, and
manual backfill of historical events. Normally, when log shippers send
events from their sources, the events arrive shortly after being produced
and roughly in order. When backfilling old data, however, old and new
events will be mixed, which changes the sequential order. This leads to
data segments that span very broad time intervals, which makes querying
less effective, as unnecessary data is scanned.
Using humioBackfill
To help avoid this problem, at ingest, Humio automatically adds the tag:
humioBackfill
to all events that
are older than 24 hours (MAX_HOURS_SEGMENT_OPEN
). With
humioBackfill
, only the backfilled
data will be affected. Also, if the backfill tag is set, segments will
not be closed early, even if the timespan is larger than 24h.
Sending Historical Data (Backfilling Events)
For example, if you have files with events from the last month on disk and want them sent to Humio, along with the "live" events.
There are some rules you should observe for optimal performance of the resulting data inside Humio, in order for the backfilled data to not interfere with the live events already flowing.
Make sure to ship the historical events in order by their timestamp, or at least as close as possible to this ordering. A few seconds have little consequence, whereas hours or days is suboptimal.
If shipping data in parallel (for example, running multiple Filebeat instances), then make sure to make those streams visible to Humio by using distinct tags for each stream so that they do not overlap the other historical streams.
If those guidelines are not followed, the result is likely to be an increase in the number of segment files and a much higher IO usage when searching time spans that overlap the historical events or the live events that were ingested while the backfill was active. The segment files are likely to get large and have overlapping time spans, leading to a large number of files being searched even when searching a short time interval.
Example: Using Filebeat to Ship Historical Events
As an example, let's say you have one file of 10 GB of logs for each day
in the last month. You want to send all of them in parallel into Humio,
and there is already a stream of live events flowing. In this case you
should run one instance of the desired shipper (in this case, Filebeat)
for each file. Each shipper needs a configuration file that sets a
distinct tag. Let's use the filename being backfilled as the tag value.
For Filebeat this can be accomplished by making the
@source
field that is set by
Filebeat a tag in the parser in Humio. Or better yet, you can add or
extend the fields
section in the
config:
filebeat:
inputs:
- paths:
- /var/log/need-backfilling/myapp.2019-06-17.log
# the section that adds the backfill tag:
fields:
"@humioBackfill": "2019-06-17"
"@tags": ["@type", "@humioBackfill"]
queue.mem:
events: 8000
flush.min_events: 200
flush.timeout: 1s
output:
elasticsearch:
hosts: ["https://$HUMIO_HOST:443/api/v1/ingest/elastic-bulk"]
username: $SENDER
password: $INGEST_TOKEN
compression_level: 5
bulk_max_size: 200
worker: 4
# Don't raise bulk_max_size much: 100 - 300 is the appropriate range.
# While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in Humio.
Filebeat needs a fresh directory for each instance and a separate configuration file. For backfilling purposes, Filebeat should not run as a daemon but in the run-once mode. Here is an example command line to launch Filebeat in this mode. It assumes that you have placed the (distinct) filebeat.yml config file in the config directory named below.
# Removing the registry will make filebeat ship from scratch again.
rm -rf /path/to/filebeat-instance-dir/registry
filebeat -e --once \
--path.config /path/to/filebeat-instance-dir/ \
--path.data /path/to/filebeat-instance-dir/ \
--path.home /path/to/filebeat-instance-dir/ \
--path.logs /path/to/filebeat-instance-dir/logs