File Source

Overview

The File Source enables the Falcon LogScale Collector to ingest log data from files on the local filesystem. This source type monitors specified files for new content and automatically ingests log entries into LogScale. The collector maintains position tracking to ensure reliable, resumable data collection without duplicating events.

The File Source is ideal for collecting logs from applications that write to local files, such as web servers, application logs, system logs, and other file-based logging systems.

How it works

The File Source operates by monitoring files for changes and reading new content as it's written. Key features include:

Position Tracking: The Collector maintains checkpoints of file positions, allowing it to resume reading from where it left off after restarts or interruptions
File Rotation Support: Automatically handles log rotation, detecting when files are rotated and continuing to read from new files
Multiline Support: Can combine multiple consecutive lines into single events based on configurable patterns
Glob Pattern Matching: Supports wildcards and patterns to monitor multiple files simultaneously

File Discovery and Scanning

The File Source treats directories and files differently:

Directories: Scanned every 10 seconds for new directories
Files: Scanned every 100 milliseconds within discovered directories
Linux Systems: Use inotify for event-driven file change detection
Other Platforms: Use polling to detect file changes

File Processing

When a new file is discovered, it goes through this sequence:

Scan → Fingerprint → Identify → Track → Read → Process

Here's some detail on the file processing:

Fingerprinting: Each file undergoes content-based checksumming (256 bytes to 4KB) to create a unique identity. For compressed files (gzip/bzip2), the header is decompressed and the inner content is fingerprinted. This identity persists across file moves, renames, and rotations, preventing duplicate processing.
File Identity: Each new file with an unseen fingerprint receives a unique UUID. Files with identical content are detected as duplicates, with lexicographically smaller paths taking priority.
Reading Strategy: Files are read by initially reading up to 4KB to quickly determine format and identity. Then unread data is read in chunks up to MaxBatchSize until the queue is full.
File Updates: Modifications are detected within 100ms. New data is read incrementally and the tracked offset is updated. If a file's size decreases, the change is ignored. If later updated with new content, it triggers a new fingerprint, creating a new file identity and resetting the tracked offset.
Inactivity Monitoring: Files are monitored for inactivity (configurable timeout, default 60 seconds). On timeout, if the file hasn't grown, handles close and the file enters passive monitoring until modified.
Acknowledgement: File offset tracking compares bytes read vs. successfully delivered. For memory queues: ACK on endpoint 200 OK response. For disk queues: ACK when written to disk. Checkpoints persist across collector restarts.

Prerequisites

Before configuring the File Source, ensure that you have:

Read permissions on the files and directories to be monitored
A configured sink (destination) for the collected events

Configuration

Prerequisites

First, define a sink that will receive the collected events:

yaml

sinks:
  logscale_sink:
    type: logscale
    url: "https://cloud.humio.com/"
    token: "${LOGSCALE_TOKEN}"
    queue:
      type: memory
      maxLimitInMB: 64

Example 1: Basic Configuration

Here's a simple configuration for collecting Apache access logs:

yaml

sources:
  apache_access_logs:
    type: file
    include:
      - "/var/log/apache2/access.log"
    sink: logscale_sink

Example 2: Advanced Configuration with Multiple Features

yaml

sources:
  application_logs:
    type: file
    
    # Include multiple log files
    include:
      - "/var/log/myapp/*.log"
      - "/var/log/myapp/archive/app-*.log"
    
    # Exclude specific files or patterns
    exclude:
      - "/var/log/myapp/*.log.1"
      - "/var/log/myapp/debug.log"
    
    # Exclude files with specific extensions
    excludeExtensions:
      - "gz"
      - "zip"
    
    # Configure multiline parsing for stack traces
    multiLineBeginsWith: '^[0-9]{4}-[0-9]{2}-[0-9]{2}'
    
    # Specify file encoding
    encoding: UTF-8
    
    sink: logscale_sink

Example 3: Multiline Java Stack Traces

yaml

sources:
  java_logs:
    type: file
    include:
      - "/var/log/java-app/application.log"
    
    # Match lines starting with timestamp as new events
    multiLineBeginsWith: '^20\d{2}-\d{2}-\d{2}'
    
    sink: logscale_sink

Example 4: Continuation Pattern

yaml

sources:
  continued_logs:
    type: file
    include:
      - "/var/log/app/messages.log"
    
    # Lines starting with whitespace continue the previous event
    multiLineContinuesWith: '^\s+'
    
    sink: logscale_sink

Example 5: Complete Configuration

yaml

sinks:
  logscale_sink:
    type: logscale
    url: "https://cloud.humio.com/"
    token: "${LOGSCALE_TOKEN}"
    queue:
      type: memory
      maxLimitInMB: 64

sources:
  # Apache access logs
  apache_logs:
    type: file
    include:
      - "/var/log/apache2/access.log"
    parser: "apache_combined"
    transforms:
      - type: static_fields
        fields:
          log_type: "apache_access"
          server: "${HOSTNAME}"
    sink: logscale_sink
  
  # Application logs with multiline support
  app_logs:
    type: file
    include:
      - "/var/log/myapp/*.log"
    exclude:
      - "/var/log/myapp/debug.log"
    multiLineBeginsWith: '^20\d{2}-\d{2}-\d{2}'
    encoding: UTF-8
    transforms:
      - type: static_fields
        fields:
          application: "myapp"
          environment: "${ENV}"
    sink: logscale_sink

Multiline Event Handling

The File Source provides three approaches for handling multiline events:

1. Begin Pattern (multiLineBeginsWith)

Identifies lines that start a new event. Lines not matching the pattern are appended to the previous event.

Example: Java logs with timestamps

yaml

multiLineBeginsWith: '^20\d{2}-\d{2}-\d{2}'

2. Continuation Pattern (multiLineContinuesWith)

Identifies lines that continue the previous event. Lines matching the pattern are appended; lines not matching start a new event.

Example: Lines starting with whitespace

yaml

multiLineContinuesWith: '^\s+'

3. End Pattern (multiLineEndsWith)

Identifies lines that end an event. Lines matching the pattern complete the current event; subsequent lines start a new event.

Example: Lines ending with 'F'

yaml

multiLineEndsWith: '\S+ F'

File Rotation Support

The Falcon LogScale Collector strives to support all kinds of file rotation.

The Collector fingerprints files larger than 256 bytes and increases the fingerprint block size up to 4096 bytes, as applicable.
The Collector supports rotation using the following methods:
- rename
- compression
- truncation
Where rename and compression files are detected as duplicates. Compressed files are considered static. Renamed files keep their fingerprints and further updates are supported. When files are truncated, the read offset is set to the new size, which may be 0 or non-zero.
In the situation where the file is truncated followed by a quick update, the read offset depends on the time between the write and the processing of the event.

Read Compressed Files

The Falcon LogScale Collector supports reading gzip and bzip2 compressed files.

If gzip or bzip2 compressed files are matched by the configured include patterns, these will be auto detected as gzip/bzip2 files (using the magic number at the beginning of the file), decompressed and ingested.

By default, files with the following extensions will be ignored/skipped even if they match a configured include pattern:

.xz
.tgz
.z
.zip
.7z

File extensions to ignore/skip can be configured with the excludeExtensions config option. The default is:

yaml

excludeExtensions: ["xz", "tgz", "z", "zip", "7z"]

If excludeExtensions is set to an empty array, it is possible to override the default setting. These files will not be decompressed before ingest. For example:

yaml

excludeExtensions: []

Effectively sends files in the compressed format.

If it for some reason is desired to exclude gzip and bzip files in addition to the other excluded file extensions, the following option can be used (provided the compressed files are named *.gz, *.bz2):

yaml

excludeExtensions: ["xz", "tgz", "z", "zip", "7z", "gz", "bz2"]

Best Practices

File Selection

Use specific paths when possible to minimize overhead
Leverage exclude and excludeExtensions to filter unwanted files
Be cautious with broad glob patterns that might match many files

Performance Optimization

The queue size is typically smaller (64 MB) for file sources since data persists on disk
Adjust inactivityTimeout based on your log writing patterns
Monitor collector resource usage when tracking many files

Multiline Configuration

Choose the appropriate multiline strategy for your log format
Test regular expressions thoroughly to avoid incorrect event boundaries
Consider performance impact of complex regex patterns

Security

Ensure the collector has appropriate read permissions
Use environment variables for sensitive configuration values
Regularly rotate and archive old log files

Monitoring and Troubleshooting

Monitoring Collector Status

Check collector logs for file monitoring status
Monitor checkpoint advancement to ensure progress
Track lag between file writes and LogScale ingestion
Set up alerts for collection failures or stalls

Common Issues and Solutions

Issue	Symptom	Potential Causes and Solutions
No events collected	Files exist but no data appears in LogScale	Verify file permissions, check include/exclude patterns, review collector logs
Duplicate events	Same events appear multiple times	Check for multiple collectors monitoring the same files, verify checkpoint storage
Missing events	Some log entries don't appear	Review multiline configuration, check for file rotation issues
High resource usage	Collector consumes excessive CPU/memory	Reduce number of monitored files, optimize glob patterns, adjust polling intervals
Empty `@rawstring`	Ingested events have an empty `@rawstring` field	.CSV files used as input exceed 16 MB in size, and use CRLF as newline delimiter
Unexpected events	Unexpected events are ingested when .CSV files are used	The File Source is designed for plain text log files where new lines are appended incrementally at the end of the file. CSV files may not behave as expected because they are often updated in the middle rather than appended to.

Configuration Parameters

Table: File source

Parameter	Type	Required	Default Value	Description
`encoding`	fileencoding	optional^[a]		Specifies the character encoding used for source files, ensuring correct text interpretation. (added in 1.10)
			Values
			`UTF-16BE`
			`UTF-16LE`
			`UTF-8`
`exclude`	array of strings	optional^[a]		Specify the file paths or patterns to exclude when collecting logs. Some file extensions are automatically ignored even if they match an include pattern: `xz`, `tgz`, `z`, `zip`, `7z`. Note: to include these files, set `excludeExtensions` to an empty array. This will have the side effect that files will not be decompressed before reading.
`excludeExtensions`	array of strings	optional^[a]	`['xz', 'tgz', 'z', 'zip', '7z']`	Specify the file extensions to exclude when collecting data. Some file extensions are automatically ignored even if they match an include pattern: `xz`, `tgz`, `z`, `zip`, `7z`. Note: to include these files, set `excludeExtensions` to an empty array. This will have the side effect that files will not be decompressed before ingest.
`fingerprintBytes`	integer	optional^[a]	`4096`	Specifies number of bytes to read from a source file to create a fingerprint for identification. (added in 1.10)
`inactivityTimeout`	integer	optional^[a]	`60`	Specify the period of inactivity in seconds for a file to be monitored before its file descriptor is closed to free system resources. When the file changes, it is reopened and the timeout restarts.
`include`	array of strings	required		Specify the file paths to exclude when collecting data. This field supports environment variable expansions. To use an environment variable, reference it using the syntax `${VAR}`, where VAR is the name of the variable. The {}-braces may be omitted, however in that case the variable name can only contain: [a-z], [A-Z], [0-9] and "_".
`multiLineBeginsWith`	string	optional^[a]		The file input can concatenate consecutive lines together to create multiline events. When a regular expression is configured to use, the collector will look for the beginning of new multiline events. Example: All multiline events beginning with a date, e.g. `2022` you would use: yaml `multiLineBeginsWith: ^20\d{2}-` In this case, every line that doesn't match the pattern gets appended to the latest line that did. `multiLineBeginsWith` does not look for a continuation pattern that continues a multiline event. Note: This option is mutually exclusive with `multiLineEndsWith`. Only one of these options can be configured for a file source at a time.
`multiLineContinuesWith`	string	optional^[a]		The file input can concatenate consecutive lines together to create multiline events. When a regular expression is configured to use, the collector will look for the continuation of multiline events. Lines starting with whitespace are often continuations of the previous line. For example, to concatenate lines starting with whitespace (instead of starting at column 0): yaml `multiLineContinuesWith: ^\s+` In this case, every line that matches the pattern gets appended to the previous line that didn't. `multiLineContinuesWith` does not look for a beginning pattern that begins a multiline event.
`multiLineEndsWith`	string	optional^[a]		The file input can concatenate consecutive lines together to create multiline events. When a regular expression is configured to use, the collector will look for the ending of multiline events. Example: For all multiline events ending with ' F ': yaml `multiLineEndsWith: ^\S+ \S+ F` This will create a single multiline event for the following case: yaml `<timestamp> the first part of the message <timestamp> the second part of the message <timestamp> the rest of the message` While this will create a single line event: yaml `<timestamp> this is the entire message` Every line that doesn't match the pattern gets added to the latest line that did. Note: This option is mutually exclusive with `multiLineBeginsWith`. Only one of these options can be configured for a file source at a time.
`parser`	string	optional^[a]		Specify the parser name in LogScale to use for parsing the logs, if you installed LogScale through a package manager, you can specify the type of logs to be displayed on the search page, for example linux/system-logs:linux/system-logs. If a parser is assigned to the ingest token being used, that parser will be ignored.
`sink`	string	required		Name of the configured sink that which will receive the collected events.
`transforms`	transform	optional^[a]		Specify transforms to use for this source (optional), see All Sources: How to Use Transforms for information on how to use transforms.
`type`	file	required		The sources block defines the sources of data that the collector will send data to Falcon LogScale.
^[a]Optional parameters use their default value unless explicitly set.

Versions of this Page

File Source

How it works

Prerequisites

Configuration

Multiline Event Handling

File Rotation Support

Read Compressed Files

Best Practices

Monitoring and Troubleshooting

Configuration Parameters

Enter search term