Ingestion: Ingest Phase

Data sent to LogScale is received by the ingest layer. The ingest layer handles the incoming request. First data is matched and validated for the given ingest protocol. Using the protocols incoming data is turned into log events. Typically, users will create parsers that will be applied to structure and enrich the incoming data. When data is parsed, it is put on a Kafka ingest queue and an acknowledgement is returned in the response to a client.

  • Validating the input

  • Extracting timestamps, or adding them if not available

  • Parsing of the data using a user-defined parser to extract fields or reformat data

  • Completed events are placed into the Kafka ingest queue

Parsers During Ingestion

Whether you are ingesting structured or formatted data that has already identified the fields and information, or raw text log lines where the information needs to be extracted, the role of the parser is to extract, format and augment or enrich the incoming data stream for storage.

Parsers within LogScale allow for the following operations during ingestion:

  • Identify specific fields according to the source data type

  • Identify metadata fields such as the timestamp and translate them to the LogScale standard

  • Augment the information, for example formatting fields into a standard format, or resolving IP addresses

  • Assign key fields to a standardized format to allow data from different source formats to be queried using the same field names

The process of parsing is one of enrichment of the data. Most log data is free text, but storing the information in a fixed field improves the ability to query and process the information during search. LogScale always stores the original raw text along with any extracted field data.

For a more detailed example, let's look at the output from an HTTP web server. Each line represents a request/response from the web server and will be turned into an event just consisting of the raw string. The raw string for the web server is typically structured in a well known format and contains information on HTTP status code, HTTP method, response time, URL, user agent etc. It is possible to create parsers that can parse a given structure. For a web server this could look like this:

accesslog
47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] "GET /?p=1 HTTP/2.0" 200 5316 "https://domain1.com/?p=1" "Mozilla/5.0 (Windows NT 6.1)"

The structure of the data contains a lot of information that we can parse and extract:

block-beta columns 1 block:line A["47.29.201.179"] B["-"] C["-"] D["[28/Feb/2019:13:17:10 +0000]"] E["GET /?p=1 HTTP/2.0"] F["200"] G["5316"] H["https://domain1.com/?p=1"] I["Mozilla/5.0 (Windows NT 6.1)"] end space block:desc J["IP Address"] K["Auth"] L["Username"] M["Timestamp"] N["HTTP Request"] O["HTTP Response Code"] P["Response Size"] Q["Referrer"] R["Client"] end

Creating a parser for this format will add the following fields to the LogScale event and give it structure:

@rawstring 47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] ...(full text)
@timestamp cluster ingest time (not event time)

This raw data will be parsed into the following fields:

@rawstring 47.29.201.179 - - [28/Feb/2019:13:17:10 +0000] ...
@timestamp 28/Feb/2019:13:17:10 +0000
method GET
version 2.0
status 200
size 5316
url https://domain1.com/?p=1
user-agent Mozilla/5.0 (Windows NT 6.1)

Parsers are written using the LogScale Query Language (LQL). Using the same language as you use for querying enables you to use the same functions and constructs as when querying data. In addition, because you can always parse and extract information during the query process by re-examining the original @rawstring all of the principles are familiar.

Typically, a parser makes use of the following functions and syntax:

  • Parsing functions for specific data types, such as parseJSON(), parseXml(), parseCsv() and parseTimestamp()

  • Regular expressions using the /regex/ or regex() functions to identify key information and place it into fields

  • Statements for selecting the processing method like case or if

With the full suite of LQL tools available, you can also perform enrichment of the data. In the above example, it is possible to make the parser geocode the client ip address, enriching the event with the country/area/city of the request:

...| ipLocation(clientIP)

For more information on writing and creating parsers, see Parsing Data.

In the parser configuration it is also possible to specify which fields in the event should be tags. Tags are discussed in Tag Fields and Datasources.

Tag Fields and Datasources

LogScale organizes data into the logical store of repositories and an underlying physical structure called datasource. During ingest, the token (and sometimes the parser) will define the repository that data is stored in.

Tags are fields that define the datasources within a repository. Datasources (and their tags) are used to group data which is important for physically organizing and distributing the data and optimizing search performance. Tags are defined during ingest, either by the log shipper, or by the parser.

Tags can be static, for example a parser could specify all data parsed with this parser should have the tag #type=syslog. The parser can also specify fields on the events being parsed that are tags. Using network data as an example, events could have a protocol field that could serve as a tag. The protocol field could have different values like #protocol=dns or #protocol=http. When searching, performance would be optimized when searching using the #protocol field.

In the example below we have a set of web server hosts and we are collecting the database logs on each host. We use the fields #host and #source as tag fields. A datasource is created for each combination of tag fields and their values, and this configures how segment files are created:

block-beta columns 6 A["#host=server1 #source=http.log"] block:block1:5 SegmentA1:1 space:2 SegmentA2:1 SegmentA3:1 end B["#host=server1 #source=access.log"] block:block2:5 SegmentB1:4 space:1 SegmentB2:1 end C["#host=server2 #source=http.log"] block:block3:5 SegmentC1 SegmentC2 SegmentC3 SegmentC4 SegmentC5 end D["#host=server3 #source=loadbalance"] block:block4:5 SegmentD1 SegmentD2 SegmentD3 SegmentD4 SegmentD5 SegmentD6 end blockArrowId6<["&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Time&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"]>(right):6

Tag fields start with the # character. Tags can be used to search this specific set of data, with the performance of the query optimized because only the segments that match the given tag need to be loaded. For example this search is used to only look at the datasource for the http.log on server1:

accesslog
#host=server1 #source=http.log

The datasource is the combination of these two tags, and it affects how the data is physically stored. When doing a search, if the tags are used then only those physical datasources are being scanned which can improve the performance of the search. This is described in detail in Metadata.

For more information on datasources and how they are used and managed by LogScale, see Datasources.

Avoid using tags that have a high-cardinality (i.e. have a large number of unique values), as this increases the number of combinations, memory requirements, and segment organization. Please consider this and other recommendations listed in Event Tags when defining tag fields.

Ingest tokens

LogScale requires clients to provide an Ingest Token. Ingest tokens are created in LogScale and support:

  • Authorization and authentication; data can only be ingested if a valid Ingest Token has been used.

  • Ingest tokens are unique to the repository where the data will be stored. You cannot use an Ingest Token for repository A to ingest data into repository B.

  • Tokens are associated with a specific parser

Using an Ingest Token means having a unique string for communicating a specific type of data and processing (through the parser) for that data. To limit the ingestion of data to specific hosts, create a unique Ingest Token for each host and parser configuration.

Alternatively, the parser can choose how the incoming data is parsed and processed based on a field within the source log file or data.