Backfilling Data
There are different ways to handle the ingestion of non-current data in
LogScale; the humioBackfill
tag, and
manual backfill of historical events. Normally, when log shippers send
events from their sources, the events arrive shortly after being produced
and roughly in order. When backfilling old data, however, old and new
events will be mixed, which changes the sequential order. This leads to
data segments that span very broad time intervals, which makes querying
less effective, as unnecessary data is scanned.
Using humioBackfill
To help avoid this problem, at ingest, LogScale automatically adds the
tag: humioBackfill
to
all events that are older than 24 hours
(MAX_HOURS_SEGMENT_OPEN
). With
humioBackfill
, only the backfilled
data will be affected. Also, if the backfill tag is set, segments will
not be closed early, even if the timespan is larger than 24h.
Sending Historical Data (Backfilling Events)
For example, if you have files with events from the last month on disk and want them sent to LogScale, along with the "live" events.
There are some rules you should observe for optimal performance of the resulting data inside LogScale, in order for the backfilled data to not interfere with the live events already flowing.
Make sure to ship the historical events in order by their timestamp, or at least as close as possible to this ordering. A few seconds have little consequence, whereas hours or days is sub-optimal.
If shipping data in parallel (for example, running multiple Filebeat instances), then make sure to make those streams visible to LogScale by using distinct tags for each stream so that they do not overlap the other historical streams.
If the above mentioned guidelines are not followed, the result is likely to be an increase in the number of segment files and a much higher IO usage when searching time spans that overlap the historical events or the live events that were ingested while the backfill was active. The segment files are likely to get large and have overlapping time spans, leading to a large number of files being searched even when searching a short time interval.
Example: Using Filebeat to Ship Historical Events
As an example, let's say you have one file of 10 GB of logs for each day
in the last month. You want to send all of them in parallel into
LogScale, and there is already a stream of live events flowing. In this
case, you should run one instance of the desired shipper (in this case,
Filebeat) for each file. Each shipper needs a configuration file that
sets a distinct tag. Let's use the filename being backfilled as the tag
value. For Filebeat this can be accomplished by making the
@source field that is set by Filebeat a tag in the
parser in LogScale. Or better yet, you can add or extend the
fields
section in the config:
filebeat:
inputs:
- paths:
- /var/log/need-backfilling/myapp.2019-06-17.log
# the section that adds the backfill tag:
fields:
"@humioBackfill": "2019-06-17"
"@tags": ["@type", "@humioBackfill"]
queue.mem:
events: 8000
flush.min_events: 200
flush.timeout: 1s
output:
elasticsearch:
hosts: ["https://$HUMIO_HOST:443/api/v1/ingest/elastic-bulk"]
username: $SENDER
password: $INGEST_TOKEN
compression_level: 5
bulk_max_size: 200
worker: 4
# Don't raise bulk_max_size much: 100 - 300 is the appropriate range.
# While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in LogScale.
Filebeat needs a fresh directory for each instance and a separate
configuration file. For backfilling purposes, Filebeat should not run as
a daemon but in the run-once mode. Below is an example command line to
launch Filebeat in run-once mode. It assumes that you have placed the
(distinct) filebeat.yml
config
file in the config directory named below.
# Removing the registry will make filebeat ship from scratch again.
rm -rf /path/to/filebeat-instance-dir/registry
filebeat -e --once \
--path.config /path/to/filebeat-instance-dir/ \
--path.data /path/to/filebeat-instance-dir/ \
--path.home /path/to/filebeat-instance-dir/ \
--path.logs /path/to/filebeat-instance-dir/logs