Best Practice: Optimizing string and regular expression (regex) search performance

LogScale supports free term searching (FTS) of events, and also regular-expression based searching for information. This is supported both the original imported event (as @rawstring) and specific fields. LogScale searches information based on the importing of data into specific fields, rather than relying on querying the original string events.

This implies a difference to searching compared other log event systems, in that events can be loaded that involve only the field/value pairs, with no raw event log line. The effect is to improve the performance of searches by enabling searches to occur only within specific fields. However, when a search is performed against the original raw data, and this impact is felt when performing different searches using the language syntax for string or regex searches, or performing a regex() function search.

Consider the following very simple searches and corresponding "Work" values, which correlate with the time taken for the searches. The data searched was Falcon Telemetry over a period of 1 year, against a total database size of 2.8 TB uncompressed. Falcon Telemetry data is sent in JSON format, which means that just about every element of the raw data is extracted into individual fields in LogScale. The "field density" has substantial implications on performance when using various search constructs:

Search Work
mimikatz 4.1k
*mimikatz* 4.1k
@rawstring=*mimikatz* 3.4k
/mimikatz/ 43.7k
@rawstring=/mimikatz/ 8.0k

The different searches operate as follows:

  • The first search is the only true full-text search, as there are no syntactical elements to declare a specific field to be searched. This search will search all fields for the term mimikatz, and will match substrings (i.e. the search has an implied glob operation, meaning it will match wherever the string occurs). This search will search all field values that match the supplied term, including those derived from parsing operations.

  • The second search is identical, but introduces language elements in the form of the wildcard "glob" characters (*). As noted above, this search is identical to the first, and will include matches from all fields.

  • The third search limits the search to the raw event field (@rawstring), and therefore excludes matches with any additional (derived or sent) metadata fields. For this third search, the glob characters are required, and cannot be implied as in the first search. If you omit them, it will search only for the full string mimikatz, which cannot be part of a substring. You see about a 17% improvement in speed due to the fact that not all fields are being searched. For most users, this 17% improvement is not worth giving up the ability to quickly type simple term searches into the search bar, and indeed "true FTS" will have minimal detrimental effect for most use cases.

  • In the fourth search, the slash (/) characters introduce additional language syntax in the form of regex string delimiters. The fourth search is similar in thought to the first one, and is hence referred to as a "Free Regex" search. Though more flexible than simple glob patterns, regex processing without a field declaration searches all parsed fields including the original @rawstring, and is a far more complex operation for any search engine to perform; the extra overhead required to search all fields can be considerable. This can be seen in the Work Value of the fourth search, which is roughly 5.5 times slower than that of the fifth rawstring-specific one. Therefore, for "field-rich" sources such as Falcon Telemetry, it is essential to consider which fields you wish to include in the search when using regex, or better yet revert to glob patterns if those will suit your needs.

  • In the fifth search, the regex query is performed only against the @rawstring field, but not any of the parsed or declared fields. This is slower than a string search, but faster than searching every field.

From this, there some best practices for Free Term and Free Regex searches:

  • Use Free Term search only for simple full strings or substrings. Performance impact will be minimal, even over long timeframes.

  • Always use glob (wildcard) patterns rather than regex if that will meet your needs. There is far less overhead, even when using simple regex patterns as in the searches above. In other words, the fourth and fifth regex-based searches are not examples of good search hygiene as the search parameter is just a string and could be complete.

  • If using a FTS in the collection portion of a larger, more complex, or saved search, consider using field restrictions even with simple terms (the 3rd and 5th examples above).

  • "Free Regex" searches are in general not recommended (unless the dataset you are searching against does not contain @rawstring). Running these against field-rich datasets, such as Falcon Telemetry, will be noticeably slower than field-restricted searches.