There are different ways to handle the ingestion of non-current data in
humioBackfill tag, and manual backfill of
historical events. Normally, when log shippers send events from their
sources, the events arrive shortly after being produced and roughly in
order. When backfilling old data, however, old and new events will be
mixed, which changes the sequential order. This leads to data segments
that span very broad time intervals, which makes querying less effective,
as unnecessary data is scanned.
To help avoid this problem, at ingest, Humio automatically adds the tag:
humioBackfill to all events that are older than 24
humioBackfill, only the backfilled data will be
affected. Also, if the backfill tag is set, segments will not be closed
early, even if the timespan is larger than 24h.
Sending Historical Data (Backfilling Events)
For example, if you have files with events from the last month on disk and want them sent to Humio, along with the "live" events.
There are some rules you should observe for optimal performance of the resulting data inside Humio, in order for the backfilled data to not interfere with the live events already flowing.
Make sure to ship the historical events in order by their timestamp, or at least as close as possible to this ordering. A few seconds have little consequence, whereas hours or days is suboptimal.
If shipping data in parallel (for example, running multiple Filebeat instances), then make sure to make those streams visible to Humio by using distinct tags for each stream so that they do not overlap the other historical streams.
If those guidelines are not followed, the result is likely to be an increase in the number of segment files and a much higher IO usage when searching time spans that overlap the historical events or the live events that were ingested while the backfill was active. The segment files are likely to get large and have overlapping time spans, leading to a large number of files being searched even when searching a short time interval.
Example: Using Filebeat to Ship Historical Events
As an example, let's say you have one file of 10 GB of logs for each day
in the last month. You want to send all of them in parallel into Humio,
and there is already a stream of live events flowing. In this case you
should run one instance of the desired shipper (in this case, Filebeat)
for each file. Each shipper needs a configuration file that sets a
distinct tag. Let's use the filename being backfilled as the tag value.
For Filebeat this can be accomplished by making the
@source field that is set by Filebeat a tag in the
parser in Humio. Or better yet, you can add or extend the
fields section in the config:
filebeat: inputs: - paths: - /var/log/need-backfilling/myapp.2019-06-17.log # the section that adds the backfill tag: fields: "@humioBackfill": "2019-06-17" "@tags": ["@type", "@humioBackfill"] queue.mem: events: 8000 flush.min_events: 200 flush.timeout: 1s output: elasticsearch: hosts: ["https://$HUMIO_HOST:443/api/v1/ingest/elastic-bulk"] username: $SENDER password: $INGEST_TOKEN compression_level: 5 bulk_max_size: 200 worker: 4 # Don't raise bulk_max_size much: 100 - 300 is the appropriate range. # While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in Humio.
Filebeat needs a fresh directory for each instance and a separate configuration file. For backfilling purposes, Filebeat should not run as a daemon but in the run-once mode. Here is an example command line to launch Filebeat in this mode. It assumes that you have placed the (distinct) filebeat.yml config file in the config directory named below.
# Removing the registry will make filebeat ship from scratch again. rm -rf /path/to/filebeat-instance-dir/registry filebeat -e --once \ --path.config /path/to/filebeat-instance-dir/ \ --path.data /path/to/filebeat-instance-dir/ \ --path.home /path/to/filebeat-instance-dir/ \ --path.logs /path/to/filebeat-instance-dir/logs