File Source

Overview

The File Source enables the Falcon LogScale Collector to ingest log data from files on the local filesystem. This source type monitors specified files for new content and automatically ingests log entries into LogScale. The collector maintains position tracking to ensure reliable, resumable data collection without duplicating events.

The File Source is ideal for collecting logs from applications that write to local files, such as web servers, application logs, system logs, and other file-based logging systems.

How it works

The File Source operates by monitoring files for changes and reading new content as it's written. Key features include:

  • Position Tracking: The Collector maintains checkpoints of file positions, allowing it to resume reading from where it left off after restarts or interruptions

  • File Rotation Support: Automatically handles log rotation, detecting when files are rotated and continuing to read from new files

  • Multiline Support: Can combine multiple consecutive lines into single events based on configurable patterns

  • Glob Pattern Matching: Supports wildcards and patterns to monitor multiple files simultaneously

File Discovery and Scanning

The File Source treats directories and files differently:

  • Directories: Scanned every 10 seconds for new directories

  • Files: Scanned every 100 milliseconds within discovered directories

  • Linux Systems: Use inotify for event-driven file change detection

  • Other Platforms: Use polling to detect file changes

File Processing

When a new file is discovered, it goes through this sequence:

Scan → Fingerprint → Identify → Track → Read → Process

Here's some detail on the file processing:

  • Fingerprinting: Each file undergoes content-based checksumming (256 bytes to 4KB) to create a unique identity. For compressed files (gzip/bzip2), the header is decompressed and the inner content is fingerprinted. This identity persists across file moves, renames, and rotations, preventing duplicate processing.

  • File Identity: Each new file with an unseen fingerprint receives a unique UUID. Files with identical content are detected as duplicates, with lexicographically smaller paths taking priority.

  • Reading Strategy: Files are read by initially reading up to 4KB to quickly determine format and identity. Then unread data is read in chunks up to MaxBatchSize until the queue is full.

  • File Updates: Modifications are detected within 100ms. New data is read incrementally and the tracked offset is updated. If a file's size decreases, the change is ignored. If later updated with new content, it triggers a new fingerprint, creating a new file identity and resetting the tracked offset.

  • Inactivity Monitoring: Files are monitored for inactivity (configurable timeout, default 60 seconds). On timeout, if the file hasn't grown, handles close and the file enters passive monitoring until modified.

  • Acknowledgement: File offset tracking compares bytes read vs. successfully delivered. For memory queues: ACK on endpoint 200 OK response. For disk queues: ACK when written to disk. Checkpoints persist across collector restarts.

Prerequisites

Before configuring the File Source, ensure that you have:

  • Read permissions on the files and directories to be monitored

  • A configured sink (destination) for the collected events

Configuration

Prerequisites

First, define a sink that will receive the collected events:

yaml
sinks:
  logscale_sink:
    type: logscale
    url: "https://cloud.humio.com/"
    token: "${LOGSCALE_TOKEN}"
    queue:
      type: memory
      maxLimitInMB: 64

Example 1: Basic Configuration

Here's a simple configuration for collecting Apache access logs:

yaml
sources:
  apache_access_logs:
    type: file
    include:
      - "/var/log/apache2/access.log"
    sink: logscale_sink

Example 2: Advanced Configuration with Multiple Features

yaml
sources:
  application_logs:
    type: file
    
    # Include multiple log files
    include:
      - "/var/log/myapp/*.log"
      - "/var/log/myapp/archive/app-*.log"
    
    # Exclude specific files or patterns
    exclude:
      - "/var/log/myapp/*.log.1"
      - "/var/log/myapp/debug.log"
    
    # Exclude files with specific extensions
    excludeExtensions:
      - "gz"
      - "zip"
    
    # Configure multiline parsing for stack traces
    multiLineBeginsWith: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
    
    # Specify file encoding
    encoding: UTF-8
    
    sink: logscale_sink

Example 3: Multiline Java Stack Traces

yaml
sources:
  java_logs:
    type: file
    include:
      - "/var/log/java-app/application.log"
    
    # Match lines starting with timestamp as new events
    multiLineBeginsWith: '^20\d{2}-\d{2}-\d{2}'
    
    sink: logscale_sink

Example 4: Continuation Pattern

yaml
sources:
  continued_logs:
    type: file
    include:
      - "/var/log/app/messages.log"
    
    # Lines starting with whitespace continue the previous event
    multiLineContinuesWith: '^\s+'
    
    sink: logscale_sink

Example 5: Complete Configuration

yaml
sinks:
  logscale_sink:
    type: logscale
    url: "https://cloud.humio.com/"
    token: "${LOGSCALE_TOKEN}"
    queue:
      type: memory
      maxLimitInMB: 64

sources:
  # Apache access logs
  apache_logs:
    type: file
    include:
      - "/var/log/apache2/access.log"
    parser: "apache_combined"
    transforms:
      - type: static_fields
        fields:
          log_type: "apache_access"
          server: "${HOSTNAME}"
    sink: logscale_sink
  
  # Application logs with multiline support
  app_logs:
    type: file
    include:
      - "/var/log/myapp/*.log"
    exclude:
      - "/var/log/myapp/debug.log"
    multiLineBeginsWith: '^20\d{2}-\d{2}-\d{2}'
    encoding: UTF-8
    transforms:
      - type: static_fields
        fields:
          application: "myapp"
          environment: "${ENV}"
    sink: logscale_sink
Multiline Event Handling

The File Source provides three approaches for handling multiline events:

1. Begin Pattern (multiLineBeginsWith)

Identifies lines that start a new event. Lines not matching the pattern are appended to the previous event.

Example: Java logs with timestamps

yaml
multiLineBeginsWith: '^20\d{2}-\d{2}-\d{2}'

2. Continuation Pattern (multiLineContinuesWith)

Identifies lines that continue the previous event. Lines matching the pattern are appended; lines not matching start a new event.

Example: Lines starting with whitespace

yaml
multiLineContinuesWith: '^\s+'

3. End Pattern (multiLineEndsWith)

Identifies lines that end an event. Lines matching the pattern complete the current event; subsequent lines start a new event.

Example: Lines ending with 'F'

yaml
multiLineEndsWith: '\S+ F'
File Rotation Support

The Falcon LogScale Collector strives to support all kinds of file rotation.

  • The Collector fingerprints files larger than 256 bytes and increases the fingerprint block size up to 4096 bytes, as applicable.

  • The Collector supports rotation using the following methods:

    • rename

    • compression

    • truncation

    Where rename and compression files are detected as duplicates. Compressed files are considered static. Renamed files keep their fingerprints and further updates are supported. When files are truncated, the read offset is set to the new size, which may be 0 or non-zero.

    In the situation where the file is truncated followed by a quick update, the read offset depends on the time between the write and the processing of the event.

Read Compressed Files

The Falcon LogScale Collector supports reading gzip and bzip2 compressed files.

If gzip or bzip2 compressed files are matched by the configured include patterns, these will be auto detected as gzip/bzip2 files (using the magic number at the beginning of the file), decompressed and ingested.

By default, files with the following extensions will be ignored/skipped even if they match a configured include pattern:

  • .xz

  • .tgz

  • .z

  • .zip

  • .7z

File extensions to ignore/skip can be configured with the excludeExtensions config option. The default is:

yaml
excludeExtensions: ["xz", "tgz", "z", "zip", "7z"]

If excludeExtensions is set to an empty array, it is possible to override the default setting. These files will not be decompressed before ingest. For example:

yaml
excludeExtensions: []

Effectively sends files in the compressed format.

If it for some reason is desired to exclude gzip and bzip files in addition to the other excluded file extensions, the following option can be used (provided the compressed files are named *.gz, *.bz2):

yaml
excludeExtensions: ["xz", "tgz", "z", "zip", "7z", "gz", "bz2"]
Best Practices

File Selection

  • Use specific paths when possible to minimize overhead

  • Leverage exclude and excludeExtensions to filter unwanted files

  • Be cautious with broad glob patterns that might match many files

Performance Optimization

  • The queue size is typically smaller (64 MB) for file sources since data persists on disk

  • Adjust inactivityTimeout based on your log writing patterns

  • Monitor collector resource usage when tracking many files

Multiline Configuration

  • Choose the appropriate multiline strategy for your log format

  • Test regular expressions thoroughly to avoid incorrect event boundaries

  • Consider performance impact of complex regex patterns

Security

  • Ensure the collector has appropriate read permissions

  • Use environment variables for sensitive configuration values

  • Regularly rotate and archive old log files

Monitoring and Troubleshooting

Monitoring Collector Status

  • Check collector logs for file monitoring status

  • Monitor checkpoint advancement to ensure progress

  • Track lag between file writes and LogScale ingestion

  • Set up alerts for collection failures or stalls

Common Issues and Solutions

Issue Symptom Potential Causes and Solutions
No events collected Files exist but no data appears in LogScale Verify file permissions, check include/exclude patterns, review collector logs
Duplicate events Same events appear multiple times Check for multiple collectors monitoring the same files, verify checkpoint storage
Missing events Some log entries don't appear Review multiline configuration, check for file rotation issues
High resource usage Collector consumes excessive CPU/memory Reduce number of monitored files, optimize glob patterns, adjust polling intervals
Empty @rawstring Ingested events have an empty @rawstring field .CSV files used as input exceed 16 MB in size, and use CRLF as newline delimiter
Unexpected events Unexpected events are ingested when .CSV files are used The File Source is designed for plain text log files where new lines are appended incrementally at the end of the file. CSV files may not behave as expected because they are often updated in the middle rather than appended to.
Configuration Parameters

Table: File source

ParameterTypeRequiredDefault ValueDescription
encodingfileencodingoptional[a]   Specifies the character encoding used for source files, ensuring correct text interpretation. (added in 1.10)
   Values
   UTF-16BE
   UTF-16LE
   UTF-8
excludearray of stringsoptional[a]   Specify the file paths or patterns to exclude when collecting logs. Some file extensions are automatically ignored even if they match an include pattern: xz, tgz, z, zip, 7z. Note: to include these files, set excludeExtensions to an empty array. This will have the side effect that files will not be decompressed before reading.
excludeExtensionsarray of stringsoptional[a] ['xz', 'tgz', 'z', 'zip', '7z'] Specify the file extensions to exclude when collecting data. Some file extensions are automatically ignored even if they match an include pattern: xz, tgz, z, zip, 7z. Note: to include these files, set excludeExtensions to an empty array. This will have the side effect that files will not be decompressed before ingest.
fingerprintBytesintegeroptional[a] 4096 Specifies number of bytes to read from a source file to create a fingerprint for identification. (added in 1.10)
inactivityTimeoutintegeroptional[a] 60 Specify the period of inactivity in seconds for a file to be monitored before its file descriptor is closed to free system resources. When the file changes, it is reopened and the timeout restarts.
includearray of stringsrequired   Specify the file paths to exclude when collecting data. This field supports environment variable expansions. To use an environment variable, reference it using the syntax ${VAR}, where VAR is the name of the variable. The {}-braces may be omitted, however in that case the variable name can only contain: [a-z], [A-Z], [0-9] and "_".
multiLineBeginsWithstringoptional[a]  

The file input can concatenate consecutive lines together to create multiline events. When a regular expression is configured to use, the collector will look for the beginning of new multiline events.

Example: All multiline events beginning with a date, e.g. 2022 you would use:

yaml
multiLineBeginsWith: ^20\d{2}-

In this case, every line that doesn't match the pattern gets appended to the latest line that did. multiLineBeginsWith does not look for a continuation pattern that continues a multiline event.

Note: This option is mutually exclusive with multiLineEndsWith. Only one of these options can be configured for a file source at a time.

multiLineContinuesWithstringoptional[a]  

The file input can concatenate consecutive lines together to create multiline events. When a regular expression is configured to use, the collector will look for the continuation of multiline events. Lines starting with whitespace are often continuations of the previous line. For example, to concatenate lines starting with whitespace (instead of starting at column 0):

yaml
multiLineContinuesWith: ^\s+

In this case, every line that matches the pattern gets appended to the previous line that didn't. multiLineContinuesWith does not look for a beginning pattern that begins a multiline event.

multiLineEndsWithstringoptional[a]  

The file input can concatenate consecutive lines together to create multiline events. When a regular expression is configured to use, the collector will look for the ending of multiline events.

Example: For all multiline events ending with ' F ':

yaml
multiLineEndsWith: ^\S+ \S+ F

This will create a single multiline event for the following case:

yaml
<timestamp> the first part of the message
 <timestamp> the second part of the message
 <timestamp> the rest of the message

While this will create a single line event:

yaml
<timestamp> this is the entire message

Every line that doesn't match the pattern gets added to the latest line that did.

Note: This option is mutually exclusive with multiLineBeginsWith. Only one of these options can be configured for a file source at a time.

parserstringoptional[a]   Specify the parser name in LogScale to use for parsing the logs, if you installed LogScale through a package manager, you can specify the type of logs to be displayed on the search page, for example linux/system-logs:linux/system-logs. If a parser is assigned to the ingest token being used, that parser will be ignored.
sinkstringrequired   Name of the configured sink that which will receive the collected events.
transformstransformoptional[a]   Specify transforms to use for this source (optional), see All Sources: How to Use Transforms for information on how to use transforms.
typefilerequired   The sources block defines the sources of data that the collector will send data to Falcon LogScale.

[a] Optional parameters use their default value unless explicitly set.