There are different ways to handle the ingestion of non-current data in
humioBackfill tag, and
manual backfill of historical events. Normally, when log shippers send
events from their sources, the events arrive shortly after being produced
and roughly in order. When backfilling old data, however, old and new
events will be mixed, which changes the sequential order. This leads to
data segments that span very broad time intervals, which makes querying
less effective, as unnecessary data is scanned.
To help avoid this problem, at ingest, LogScale automatically adds the
humioBackfill to all events
that are older than 24 hours (
humioBackfill, only the
backfilled data will be affected. Also, if the backfill tag is set,
segments will not be closed early, even if the timespan is larger than
Sending Historical Data (Backfilling Events)
For example, if you have files with events from the last month on disk and want them sent to LogScale, along with the "live" events.
There are some rules you should observe for optimal performance of the resulting data inside LogScale, in order for the backfilled data to not interfere with the live events already flowing.
Make sure to ship the historical events in order by their timestamp, or at least as close as possible to this ordering. A few seconds have little consequence, whereas hours or days is suboptimal.
If shipping data in parallel (for example, running multiple Filebeat instances), then make sure to make those streams visible to LogScale by using distinct tags for each stream so that they do not overlap the other historical streams.
If those guidelines are not followed, the result is likely to be an increase in the number of segment files and a much higher IO usage when searching time spans that overlap the historical events or the live events that were ingested while the backfill was active. The segment files are likely to get large and have overlapping time spans, leading to a large number of files being searched even when searching a short time interval.
Example: Using Filebeat to Ship Historical Events
As an example, let's say you have one file of 10 GB of logs for each day
in the last month. You want to send all of them in parallel into
LogScale, and there is already a stream of live events flowing. In this
case you should run one instance of the desired shipper (in this case,
Filebeat) for each file. Each shipper needs a configuration file that
sets a distinct tag. Let's use the filename being backfilled as the tag
value. For Filebeat this can be accomplished by making the
@source field that is set by Filebeat a tag in the
parser in LogScale. Or better yet, you can add or extend the
fields section in the config:
filebeat: inputs: - paths: - /var/log/need-backfilling/myapp.2019-06-17.log # the section that adds the backfill tag: fields: "@humioBackfill": "2019-06-17" "@tags": ["@type", "@humioBackfill"] queue.mem: events: 8000 flush.min_events: 200 flush.timeout: 1s output: elasticsearch: hosts: ["https://$HUMIO_HOST:443/api/v1/ingest/elastic-bulk"] username: $SENDER password: $INGEST_TOKEN compression_level: 5 bulk_max_size: 200 worker: 4 # Don't raise bulk_max_size much: 100 - 300 is the appropriate range. # While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in LogScale.
Filebeat needs a fresh directory for each instance and a separate configuration file. For backfilling purposes, Filebeat should not run as a daemon but in the run-once mode. Here is an example command line to launch Filebeat in this mode. It assumes that you have placed the (distinct) filebeat.yml config file in the config directory named below.
# Removing the registry will make filebeat ship from scratch again. rm -rf /path/to/filebeat-instance-dir/registry filebeat -e --once \ --path.config /path/to/filebeat-instance-dir/ \ --path.data /path/to/filebeat-instance-dir/ \ --path.home /path/to/filebeat-instance-dir/ \ --path.logs /path/to/filebeat-instance-dir/logs