How-To: Using Tag Grouping

LogScale stores data in physical partitions called Data Sources. Parsers can be configured to assign events to a particular data source based on specific fields - is called tagging. Tag fields in events start with the # character and can improve search performance.

When tags are used to store and organize the information, LogScale creates different segments according to each tag value. Because LogScale creates a new datasource for each combination of tag values, and the number of data sources is limited, it is possible for a tag field with a high number of unique values (i.e. high-cardinality) to create a large number of datasources which may reduce the performance when searching, since LogScale needs to store and process multiple files.

For example, when processing web logs, if the HTTP response code was used as a tag value, the number of unique values for these codes is over 60.

For example, using the HTTP response codes, with 8 unique values, 8 different datasources would be used:

graph LR; TV[#responsecode] DS1[Datasource 1] DS2[Datasource 2] DS3[Datasource 3] DS4[Datasource 4] DS5[Datasource 5] DS6[Datasource 6] DS7[Datasource 7] DS8[Datasource 8] TV --200--> DS1 TV --201--> DS2 TV --301--> DS3 TV --404--> DS4 TV --405--> DS5 TV --407--> DS6 TV --500--> DS7 TV --503--> DS8

Each datasource here may only contain a few potential values and therefore only a few events, because some codes (404, for example) may have fewer values than others.

To optimize performance and storage of individual segments the number of tag values can be adjusted by creating a tag group. This applies a consistent hashing algorithm to the tag value so that the number of data sources created is controlled.

The tag group defines the number of allowed unique values across the tag values;it is not possible to define the groups and potential values, instead, the tag groups define the number of unique values allowed and LogScale generates the unique hash for each tag value. The hashing algorithm is consistent, so that the same unique tag value will always be applied to the same hashed value, and therefore data with the same tag value will be stored within the same data source.

By defining a tag group and limiting the number of datasources to three, LogScale will automatically organize the data, consistently, into three datasources:

graph LR; TV[#responsecode] TG[Tag Group Hashing] DS1[Datasource 1] DS2[Datasource 2] DS3[Datasource 3] TV --200--> TG TV --201--> TG TV --301--> TG TV --404--> TG TV --405--> TG TV --407--> TG TV --500--> TG TV --503--> TG TG --> DS1 TG --> DS2 TG --> DS3

When searching, the value being searched for is processed using the same consistent hash algorithm (#tag="value") by searching for the corresponding hashed valued. This search optimization happens automatically in the background; the user only needs to searched by using the corresponding tag field #tagname notation.

Fewer datasources can improve the performance of ingestion, because fewer mini-segments and segment files are created, and also improve search performance by reducing the number of segments to be searched.

Tag grouping can be particularly effective if you have a value with a fixed range of distinct values that would be too large to make effective use of tagging alone. For example, if you have a field named EventCode that you would like to have as a tag, but the set of values is more than 20,000 distinct ones. Simply making it a tag would create many datasources, probably too many to make an effective performance improvement. Instead, enabling tag grouping on the field and use a modulus of 15 would ensure that only 15 datasources are created, with each covering the subset of the 20,000 values that hash into each of them.

When searching and tag grouping enabled, a search for:

logscale
#eventcode="117"

LogScale will read the segment files that satisfy all tag predicates, including those that include the value "117" for the tag according to the calculated modulus for the hashed tag group. Hashed tag predicates work the same as tags, but use the hashed value rather than the input from the user directly. That means the segment files can have values other than the exact one from the input predicate, as multiple values hash to the same integer within the calculated modulus.

Using tag groups will not speed up the execution of queries where a wildcard is used. For example the query:

logscale
#eventcode="117*"

Will load all segments to match the potential value.

For more information on setting tag grouping, see Setup Grouping of Tags.

How Tag Groups Affect Query Performance

The use of tags, and tag groups, affects the perfomance of running different queries because the tags and groups enable LogScale to make informed decisions about which segment files needs to be used to return the selected data.

In general, the query performance is affected through a combination of the filter type and the existence of tags, or tag groups.

When performaning an exact filter match, LogScale can usually the bucket or segment required to return the data. For example, each of the following lines will be efficient:

logscale Syntax
#field = "foobar"
#field = "foo" OR #field = "bar"

These are executed as follows:

  • Without tag grouping

    The exact datasource can be identified and the corresponding events returned.

  • With tag grouping

    Tag grouping allows for the segments or buckets to be identified, but each event within the bucket will need to be to be identified to determine whether the event applies. This is because tag grouping combines multiple values of fields into a single bucket.

Other options for efficient matching include functions that rely on a fixed value, such as in().

When using a regular expression, match() or negation, the exact match cannot be used, so every bucket or segment (according to the queries timespan) must be checkeed individually. This is because it is impossible to determine if the supplied regular expression resolves to an exact match. So each field in each event must be evaluated to determine if it applies.

Query Syntax Without Tag Groups With Tag Groups
Exact match filter #field = "value" Efficient datasource identification Efficient datasource identification, but must check each event
Regex #field = /value/ Searches every bucket/event for given time range Searches every bucket/event for given time range

Tag grouping and Auto-sharding

Tag grouping and auto-sharding are different forms of manipulating the number of shards for a given datasource:

  • Autoshards increase the number of datasources for a given combination of tags by values, artificially adding a tag field with a seemingly random number.

  • Tag grouping reduces number of datasources by not storing values in the tagfield, instead storing the data in fields and using a hash for the actual tag field. Searches are automatically rewritten to handle this properly, being efficient on exact-match on value only.

The two systems have been designed so that they interact in a way that is efficient. Automatically generated shards are not grouped, but tag groups can be automatically sharded if this improves performance.

For more information on automatic sharding, see Configure Sticky Auto-Sharding for High-Volume Data Sources.