Data Ingestion Overview

Data ingestion and its setup is an essential step in using and configuring LogScale- how organizations choose to do both is predicated on the organization's needs and its tools, platforms, applications, and the data itself. No matter what option is selected, the process is designed to incorporate log data as quickly as possible and with as few blockers as possible, using several key tools and components. LogScale is then able to queue and store the data ready to be queried. After installing LogScale, logs are brought into a centralized repository with a user-selected system.

Logs are ingested according to a user's need and the data itself, and the method used for ingest is dependent on factors such as operating system (OS), log format, source(s), and more. The options for ingest include:

  • AWS S3 Bucket(s)

  • Azure Feed(s)

  • Falcon Log Collector and Other Log Shippers

  • LogScale Ingest Tokens

  • LogScale API

  • Falcon Data Replicator (FDR) Feeds

Note

It's important to note that LogScale is designed for primarily live data- data that is historical and therefore static has a different set of considerations and requirements: Backfilling Data | Falcon LogScale Cloud 1.208.0-1.213.0.

Ingest Methods

LogScale supports a variety of ingest methods and associated data, including the following:

  • Falcon LogScale Collector and Other Log Shippers

  • AWS S3 Bucket(s)

  • Azure Feed(s)

  • LogScale Ingest Tokens

  • LogScale API

  • Falcon Data Replicator (FDR) Feeds

Falcon LogScale Collector and Other Log Shippers

Log shippers are system tools that gather data from a server and send them to LogScale for analysis. They are built to support seamless data transfer, and account for common problems associated with the process that impact reliability and consistent performance. A user's application writes logs to a log file, then the log shipper reads and pre-processes the data before shipping the data using one of LogScale's Ingest API.

The Falcon LogScale Collector is capable of supporting multiple data sources, with data sources being defined as data points from which the data itself is collected. Each data source offers different capabilities, all of which are geared towards the unique needs of a user and their respective organizations.

Falcon LogScale Collector currently supports the following inputs or data sources:

  • Collecting Events from Files

  • Windows Events

  • Syslog Receiver

  • Exec Input

  • SystemD Logs on Linux

  • macOS Unified Logs

Third-Party Log Shippers

Log shippers allow users to transfer log files and metrics reliably. The benefits include retransmission if the transfer files, and batched messages.

Note

Third-party log shippers are generally recommended ONLY when a user's data is not supported by Falcon Log Collector, or the user's toolset already includes another tool that performs similar tasks.

Ingest API

LogScale's Ingest API can be useful in a variety of cases where there are challenges or restrictions to a user's capabilities. Use cases are defined more fully within the documentation.

Elasticsearch Bulk API

Elasticsearch's bulk API makes it possible to perform many index/delete operations in a single API call, greatly increasing the indexing speed.

Note

Ingest API and Elasticsearch Bulk API are generally recommended ONLY when a user is attempting to integrate logging with one of their own internal tools or software, and Falcon Log Collector is therefore redundant.

Amazon Web Service (AWS) Simple Storage Service (S3) Bucket(s)

AWS log data represent a powerful opportunity to gain insight into an organization's data. LogScale allows users to ingest and manage AWS log types from S3 buckets via Amazon Simple Queue Service (SQS), then generate alarms, alerts, and queries.

Common AWS Services
  • AWS VPC flow

  • CloudTrail

  • CloudWatch

Logs from AWS sources are ingested by LogScale via SQS and then funneled to S3 buckets. Ingest via SQS queue continues at scale, with an ingest schedule that reflects the number of incoming messages. This configuration does have some latency- this is dependent on factors from both the originator of the event log and from the user. If reconfiguration is necessary, scaling will be reset accordingly, also a consideration for latency.

Prerequisites
  • AWS access and knowledge proficiency

  • A log source that is properly configured according to the documentation

  • Access to LogScale

  • Appropriate permissions

For more information, see the documentation here: https://library.humio.com/falcon-logscale-cloud/ingesting-data-aws-feeds.html?highlight=s3

Azure Feed(s)

LogScale continuously polls the Azure Event Hub, ingests data, and scales the ingest process based on the number of partitions configured in the event hub. LogScale users can ingest and manage logs from Azure Event Hubs and then leverage the result with queries, alerts, and alarms. Latency does occur between events occurring and their general availability, both from Azure Monitor itself and from the user.

Data Sources
  • Microsoft Defender

    Microsoft Defender products and services

  • Azure Monitor

    Azure Monitor documentation

  • Microsoft Entra ID

    Microsoft Entra ID documentation

Prerequisites

The following are prerequisites for this configuration. Users must have:

  • An Event Hub with data. For more information, see Microsoft's documentation: Create an event hub using the Azure portal

  • A Storage Account with Blob storage (new or existing). For more information, see Microsoft's documentation: Introduction to Azure Blob Storage

  • Falcon LogScale access

  • Azure permissions

  • Read access to Event Hub

  • Read and write access to Storage Blog

LogScale Ingest Tokens

Ingest tokens are unique strings that work in conjunction with endpoints to identify a repository and allow users to send data to that repository with appropriate authentication.

The ingest token allows LogScale to identify the repository that the data will be ingested into, and the parser that will be used to extract the fields and data from the original log files.

Important

Ingest tokens are tied to a repository, not a user. This provides a better way of managing access control and is more convenient for most use cases. For example, if a user leaves the organization or project, you do not need to re-provision all agents that send data with a new token. You also do not have to create fake user accounts. However, it's important to remember that because ingest tokens are tied to a repository and not a user, ingest tokens can't be used to query LogScale, read data, or log in.

Repositories can also have multiple ingest tokens. This helps route data and associate parsers- and since ingest tokens are tied to the repository and not a user specifically, they offer better access management.

Ingest tokens can also be used for different ingest methods. For more information, see LogScale Third-Party Log Shippers and Ingest API | Falcon LogScale APIs 1.118.0-1.216.0

To create a Repository Ingest Token, see Ingest Tokens | Falcon LogScale Cloud 1.208.0-1.217.0

LogScale Ingest API

Logscale's Ingest API is an opportunity to ingest data with unique requirements from unique situations. Users can think of this as an alternative to Falcon LogScale Collector, with several use cases:

  • When Falcon LogScale Collector is not currently supported, either due to system or platform requirements

  • When users don't have control over the formatting of event messaging

  • When users require backward compatibility with Splunk tools/scripts/collectors

  • When users have existing OpenTelemetry feeds to ingest into LogScale

  • When users need compatibility with tools that use Elasticsearch's Bulk API

Ingest APIs, ingest tokens, and HTTP endpoints work in conjunction with one another to provide a user with secure data that can be used confidently and consistently, and allows for data to be ingested from other data sources or tools directly from the output of other databases or as part of a customer's custom application. The Ingest API can be used directly or through one of LogScale's APIs or software libraries.

For more information about ingest tokens, see the documentation here: Ingest Tokens | Falcon LogScale Cloud 1.208.0-1.217.0

See the Ingest API reference page for more information. For a list of supported software, see the Software Libraries in the Appendix.

  • Application Programming Interfaces (APIs) | Falcon LogScale APIs 1.118.0-1.215.0

  • Ingest API | Falcon LogScale APIs 1.118.0-1.216.0

Falcon Data Replicator (FDR) Feeds

FDR feeds are a service that provide events and third-party data using Amazon S3 and Simple Queue Service (SQS) in JSON format. Unlike other options offered, FDR feeds perform regular data transfers rather than supporting a data stream. FDR feeds can also be used to store and access data with unique requirements, send Falcon data to other tools, and more.

Considerations

There are several considerations to examine when ingesting data using FDR feeds, including:

  • Users must have a subscription to both Falcon Data Replicator and Falcon Insight XDR

  • FDR feeds do not include all Crowdstrike API data, and does not replicate all Crowdstrike data

  • FDR is a data source that transfers data on a regular basis- there is not a real-time option, which might not be sufficient for a user's need.

Users who employ an FDR feed will be able to send compressed and batched data on a scheduled basis to an S3 bucket, then notified using an SQS queue. Batch sizes can vary, and batch data more than 7 days old is automatically deleted from S3 buckets provided by CrowdStrike (customer-owned buckets aren't subject to this retention period). This means that the data a user might store is dependent on these factors and the retention period.

For more information, see the documentation here: https://falcon.crowdstrike.com/documentation/page/fa572b1c/falcon-data-replicator

Parsing Data

Logs that are sent to LogScale for ingestion must be parsed before being stored in a repository, which is achieved with the use of parsers. Parsers enrich the incoming data, and are composed of both a script and parser settings- the script is written in CrowdStrike Query Language (CQL), and defines the transformation requirements for incoming events in order for them to become either a searchable event or a more searchable event.

Parser settings are intended to configure and adjust certain actions, like assigning events to a particular datasource (Event Tags) or removing certain fields to optimize operational costs. When data is parsed, it is put on a Kafka ingest queue and an acknowledgement is returned in the response to a client.

Parsers take data and extract fields that are then stored along with the original text. Data can be structured or unstructured, and users must specify what parser is to be used and in which repository the data is to be stored.

Parser Types

There are three options for a parser:

  • Pre-built parsers

  • Built-in parsers

  • Custom parsers

Pre-built Parsers

A number of built-in parsers exist that are built-in to LogScale and automatically available, these include parsers for syslog, key-value and JSON formats. For ingesting third-party data, LogScale supports a number of packages that include parsers, dashboards, and corresponding widgets and alerts.

These packages provide integration with 3rd-party system and enable you to ingest and parse data without having to write a custom parser. Those parsers normalize ingested data to the CrowdStrike Parsing Standard (CPS, or more commonly referred to as Pa-Sta), a common set of fields that allows for data from different third-party systems such as firewalls or security systems to be normalized so that they can be queried in a consistent and common form.

Built-in Parsers

Built-in parsers are provided by LogScale to parse common log formats, especially widely-used formats like accesslog, which is used for web servers like Apache and Nginx.

LogScale provides the following parsers:

  • accesslog

  • audit-log

  • corelight-es

  • corelight-json

  • json

  • json-for-action

  • kv

  • kv-generic

  • kv-millis

  • serilog-jsonformatter

  • syslog

  • Syslog-utc

  • zeek-json

Custom Parsers

Custom parsers are a recommended solution for users who have not found a suitable option within the pre-built and built-in parser options already available via LogScale.

Parser Script Goals

The goal for a parser script is to:

  • Extract the correct timestamp from the event

  • Set the fields you want to use frequently in your searches