Backfilling Data

There are different ways to handle the ingestion of non-current data in LogScale; the humioBackfill tag, and manual backfill of historical events. Normally, when log shippers send events from their sources, the events arrive shortly after being produced and roughly in order. When backfilling old data, however, old and new events will be mixed, which changes the sequential order. This leads to data segments that span very broad time intervals, which makes querying less effective, as unnecessary data is scanned.

Using humioBackfill

To help avoid this problem, at ingest, LogScale automatically adds the tag: humioBackfill to all events that are older than 24 hours (MAX_HOURS_SEGMENT_OPEN). With humioBackfill, only the backfilled data will be affected. Also, if the backfill tag is set, segments will not be closed early, even if the timespan is larger than 24h.

Sending Historical Data (Backfilling Events)

For example, if you have files with events from the last month on disk and want them sent to LogScale, along with the "live" events.

There are some rules you should observe for optimal performance of the resulting data inside LogScale, in order for the backfilled data to not interfere with the live events already flowing.

  • Make sure to ship the historical events in order by their timestamp, or at least as close as possible to this ordering. A few seconds have little consequence, whereas hours or days is sub-optimal.

  • If shipping data in parallel (for example, running multiple Filebeat instances), then make sure to make those streams visible to LogScale by using distinct tags for each stream so that they do not overlap the other historical streams.

If the above mentioned guidelines are not followed, the result is likely to be an increase in the number of segment files and a much higher IO usage when searching time spans that overlap the historical events or the live events that were ingested while the backfill was active. The segment files are likely to get large and have overlapping time spans, leading to a large number of files being searched even when searching a short time interval.

Example: Using Filebeat to Ship Historical Events

As an example, let's say you have one file of 10 GB of logs for each day in the last month. You want to send all of them in parallel into LogScale, and there is already a stream of live events flowing. In this case, you should run one instance of the desired shipper (in this case, Filebeat) for each file. Each shipper needs a configuration file that sets a distinct tag. Let's use the filename being backfilled as the tag value. For Filebeat this can be accomplished by making the @source field that is set by Filebeat a tag in the parser in LogScale. Or better yet, you can add or extend the fields section in the config:

yaml
filebeat:
  inputs:
    - paths:
        - /var/log/need-backfilling/myapp.2019-06-17.log
# the section that adds the backfill tag:
      fields:
        "@humioBackfill": "2019-06-17"
        "@tags": ["@type", "@humioBackfill"]
queue.mem:
  events: 8000
  flush.min_events: 200
  flush.timeout: 1s
output:
  elasticsearch:
    hosts: ["https://$HUMIO_HOST:443/api/v1/ingest/elastic-bulk"]
    username: $SENDER
    password: $INGEST_TOKEN
    compression_level: 5
    bulk_max_size: 200
    worker: 4
# Don't raise bulk_max_size much: 100 - 300 is the appropriate range.
# While doing so may increase throughput of ingest it has a negative impact on search performance of the resulting events in LogScale.

Filebeat needs a fresh directory for each instance and a separate configuration file. For backfilling purposes, Filebeat should not run as a daemon but in the run-once mode. Below is an example command line to launch Filebeat in run-once mode. It assumes that you have placed the (distinct) filebeat.yml config file in the config directory named below.

shell
# Removing the registry will make filebeat ship from scratch again.
rm -rf /path/to/filebeat-instance-dir/registry
filebeat -e --once \
   --path.config /path/to/filebeat-instance-dir/  \
   --path.data   /path/to/filebeat-instance-dir/  \
   --path.home   /path/to/filebeat-instance-dir/  \
   --path.logs   /path/to/filebeat-instance-dir/logs