Creating a Parser

Security Requirements and Controls

A parser consists of a script, plus a few related settings. The parser script is the main part of the parser, as this defines how a single incoming event is transformed before it becomes a searchable event. LogScale has built-in parsers for common log formats like accesslog.

Note

The goal for a parser script is to extract the correct timestamp from the incoming event and set fields that you want to use frequently in your searches.

The following diagram provides an overview of where parsers fit in the configuration flow to ingest data using LogScale.

graph LR; A["Install and Configure LogScale"]--> B B["Create a Repository"]--> C C["Configure Data Ingest"]--> D D["Parse and Filter Data"]--> E E["Enrich Data"]--> F F["Query Data"] style D fill:#A6A0D2

Figure 43. Flow


If you have checked the available options for parsers to select, and found that you would like to create your own (or edit an existing one perhaps), then this guide will help you understand how to do so best.

Creating a New Parser

This section describes how to create a parser from scratch.

Parser Overview

Figure 44. Parser Overview


  1. Go to Repositories and views page and select the repository where you want to create a parser.

  2. Click Parsers to reach the parser overview, and then click + New Parser, see Figure 44, “Parser Overview”.

  3. In the New parser dialog box, enter a name for you parser: only alphanumeric characters, underscore and hyphen are allowed, and the name must be unique inside the repository.

  4. Select how to create the parser:

    • Empty Parser – Select Empty parser and click Create.

    • Clone Existing – Select Duplicate existing, select a parser from the Duplicate Template list and click Create.

    • From Template – Select From template, browse for or drag and drop a parser and click Create.

    • From Package – Select From package and click Create.

      Clicking Create will open a code editor where you can write a script for the parser.

Writing a Parser

Once you have created your parser, you will be presented with an editor.

Parser Editor

Figure 45. Parser Editor


On the left side is the script, and on the right are the test cases you can run the script on. The script is written in the LogScale Query Language, the same language you use for searching. The main difference between writing a parser and writing a search query is that you cannot use aggregate functions like groupBy().

Validation checks are applied to the parser code, for example, arrays must be contiguous and have a 0th index and fields that are prefixed with # must be configured to be tagged (to avoid falsely tagged fields).

Input for the Parser

The parser is run once per ingested event. The way the parser script refers to the data on the event, is similar to when you run searches, as the ingested data is available as fields on the event.

The main text of the ingested event is present in the field @rawstring, and many functions used for parsing will default to using @rawstring if no field is specified, so a parser may easily parse the incoming data without ever referring explicitly to @rawstring in the script.

Other fields may also be present though, depending on how logs are sent to LogScale. For example, Falcon Log Collector will add a few fields such as @collect.timestamp which are present and usable in the parser. In other words, an input event for a parser is really a collection of key-value pairs.

The main key is @rawstring, but others can be present from the beginning as well, and the parser can use those as it would do with any other fields.

The contents of @rawstring can also be any kind of text value. It's common to see e.g. JSON objects or single log lines, but @rawstring doesn't require any specific format, and you can send whatever data you like.

Starting with Sample Test Data

Writing a good parser starts by knowing what your logs look like. So it is best to gather some sample log events. Such samples may be taken from log files for example, or if you are already ingesting some data, you may be able to use the contents of the @rawstring fields. As @rawstring can be changed during parsing, and different methods of sending logs may imply different data formats, you should verify that your samples are representative, as soon as you can start sending real data.

Given some samples, you can add them as tests for your parser. Each test case represents an ingested event, with the contents of the test case being available in the @rawstring field during parsing. This means that any fields other than @rawstring are not available for testing, and will have to be verified on live data.

Test Case for a Parser

Figure 46. Test Case for a Parser


Once you have added some tests to your parser, click the Run tests button to see the output that your parser produced for each of them. See Figure 47, “Test Case Output after Parsing” for an example.

Test Case Output after Parsing

Figure 47. Test Case Output after Parsing


Assertions

In addition to seeing the concrete output the parser produced, you can also add Assertions on the output. This means that you can specify that you expect certain fields to contain certain values, and the tests will fail if that does not hold.

This is especially useful if you need to apply regular expressions, as they can be tricky to get right the first time, but also tricky to change over time without introducing errors. See Figure 48, “Assertions” for an example.

Assertions

Figure 48. Assertions


Writing a Parser Script

With sufficient samples in hand, you can start working on the parser script. A good script should:

  • Extract the correct timestamp from the event

  • Set the fields you want to use in your searches

You can find examples of parser scripts further down.

Setting the correct timestamp is important, as LogScale relies on that to find the right results when you search in a given time interval. You do this by assigning the timestamp to the @timestamp field, in the form of a UNIX timestamp. Using parseTimestamp() will help you a lot here.

On the other hand, setting fields you want to search for can be considered optional, though we recommend still doing it. That's because fields can also be extracted at search time, so the parser does not need to meticulously set every field you might want to use. However, searching on fields which have been set by the parser is generally easier, in terms of writing queries, and also performs better, in terms of search speed.

Additionally, it's important to consider which fields you want your parser to set, and whether you want to normalize them.

Normalization

Every incoming event generally has a notion of fields in its data, and extracting those fields into actual LogScale fields can range from being easy to complicated. The fields might be explicitly named (as JSON for example), or unnamed but still structured (CSV data for example), in which case extracting them is fairly easy. Or the fields might not look like fields at all, and have little structure or naming in the first place. The latter can happen for events which are written for human consumption. For example, consider a log message like User jane logged in. Here a parser author has to decide whether the user name should be a field or not, and if it should, then the parser has to be very precise about how it extracts that name.

This means that parsing different types of incoming events can mean those events look very different from each other when stored in LogScale. That is, even if three different log sources contain similar information, like a source IP address, they might represent it differently. Each type might name the field containing the address differently (sourceIp, sip, source_ip, etc.), and one type may append the port number to the IP for example. All of this makes it hard to search across different types of logs, because you have to know the names and peculiarities of each type to search correctly.

Solving this problem requires data normalization, and LogScale has different levels where this can be applied at. We recommend two approaches if you wish to apply normalization: doing so in the parser, or using Field Aliasing to rename fields after ingestion.

An important aspect of normalization is to choose what to normalize to. In this case, we recommend the CrowdStrike Parsing Standard (CPS) 1.0 as the standard to adhere to. This is also the standard which our parsers in the LogScale Package Marketplace use, so you can look to them for good examples.

To make it easier to write parsers which need to modify a lot of fields, the editor for the parser script has completions for field names. The suggested field names are taken from the test cases of the parser, so any fields getting outputted for a test case are available for autocompletion:

Autocompletion in Parser Script

Figure 49. Autocompletion in Parser Script


Example: Parsing Log Lines

Assume we have a system producing logs like the following two lines:

ini
2018-10-15T12:51:40+00:00 [INFO] This is an example log entry. id=123 fruit=banana
2018-10-15T12:52:42+01:30 [ERROR] Here is an error log entry. class=c.o.StringUtil fruit=pineapple

We want the parser to produce two events (one per line) and use the timestamp of each line as the time at which the event occurred; that is, assign it to the field @timestamp, and then extract the "fields" which exist in the logs to actual LogScale fields.

To do this, we will write a parser, and we'll start by setting the correct timestamp. To extract the timestamp, we need to write a regular expression like the following:

logscale
@rawstring = /^(?<temp_timestamp>\S+)/ 
| parseTimestamp("yyyy-MM-dd'T'HH:mm:ss[.SSS]XXX", field=temp_timestamp)
| drop(temp_timestamp)

This creates a field named temp_timestamp using a "named group" in the regular expression, which contains every character from the original event up until the first space, i.e. the original timestamp. The regular expression reads from the @rawstring field, but it doesn't modify it; it only copies information out.

With the timestamp extracted into a field of its own, we can call parseTimestamp() on it, specifying the format of the original timestamp, and it will convert that to a UNIX timestamp and assign it to @timestamp for us. With @timestamp now set up, we can drop temp_timestamp again, as we have no further need for it.

In addition to the timestamp, the logs contain more information. Looking at the message

ini
2018-10-15T12:51:40+00:00 [INFO] This is an example log entry. id=123 fruit=banana

We can see:

  • The log level INFO

  • The message This is an example log entry

  • The id 123

  • The fruit banana

To extract all of this, we can expand our regular expression to something like:

logscale
/^(?<temp_timestamp>\S+) \[(?<logLevel>\w+)\] (? <message>.*?)\. (?<temp_kvPairs>.*)/

The events will now have additional fields called logLevel (with value INFO) and message (with value This is an example log entry), which we can use as is. The event also has a temp_kvPairs field, containing the additional fields which are present after the message i.e. id=123 fruit=banana. So we still need to extract more fields from temp_kvPairs, and we can use the kvParse() function for that, and drop temp_kvPairs once we are finished.

As a result, our final parser will look like this:

logscale
@rawstring = /^(? <temp_timestamp>\S+) \[(? <logLevel>\w+)\] (? <message>.*?)\. (? <temp_kvPairs>.*)/
| parseTimestamp("yyyy-MM-dd'T'HH:mm:ss[.SSS]XXX", field=temp_timestamp)
| drop(temp_timestamp)
| kvParse(temp_kvPairs)
| drop(temp_kvPairs)

Example: Parsing JSON

We've seen how to create a parser for unstructured log lines. Now let's create a parser for JSON logs based on the following example input:

javascript
{
  "ts": 1539602562000,
  "message": "An error occurred.",
  "host": "webserver-1"
}
{
  "ts": 1539602572100,
  "message": "User logged in.",
  "username": "sleepy",
  "host": "webserver-1"
}

Each object is a separate event and will be parsed separately, as with unstructured logs.

The JSON is accessible as a string in the field @rawstring. We can extract fields from the JSON by using the parseJson() function. It takes a field containing a JSON string (in this case @rawstring) and extracts fields automatically, like this:

logscale
parseJson(field=@rawstring) 
| @timestamp := ts

This will result in events with a field for each property in the input JSON, like username and host, and will use the value of ts as the timestamp. As ts already has a timestamp in the UNIX format, we don't need to call parseTimestamp() on it.

Next Steps

Once you have your parser script created you can start using it by Ingest Tokens.

You can also learn about how parsers can help speed up queries by Event Tags.