How-To: Using Tag Grouping

LogScale stores data in physical partitions called Data Sources. Parsers can be configured to assign events to a particular data source based on specific fields - is called tagging. Tag fields in events start with the # character and can improve search performance.

When tags are used to store and organize the information, LogScale creates different segments according to each tag value. Because LogScale creates a new datasource for each combination of tag values, and the number of data sources is limited, it is possible for a tag field with a high number of unique values (i.e. high-cardinality) to create a large number of datasources which may reduce the performance when searching, since LogScale needs to store and process multiple files.

For example, when processing web logs, if the HTTP response code was used as a tag value, the number of unique values for these codes is over 60.

For example, using the HTTP response codes, with 8 unique values, 8 different datasources would be used:

graph LR; TV[#responsecode] DS1[Datasource 1] DS2[Datasource 2] DS3[Datasource 3] DS4[Datasource 4] DS5[Datasource 5] DS6[Datasource 6] DS7[Datasource 7] DS8[Datasource 8] TV --200--> DS1 TV --201--> DS2 TV --301--> DS3 TV --404--> DS4 TV --405--> DS5 TV --407--> DS6 TV --500--> DS7 TV --503--> DS8

Each datasource here may only contain a few potential values and therefore only a few events, because some codes (404, for example) may have fewer values than others.

To optimize performance and storage of individual segments the number of tag values can be adjusted by creating a tag group. This applies a consistent hashing algorithm to the tag value so that the number of data sources created is controlled.

The tag group defines the number of allowed unique values across the tag values;it is not possible to define the groups and potential values, instead, the tag groups define the number of unique values allowed and LogScale generates the unique hash for each tag value. The hashing algorithm is consistent, so that the same unique tag value will always be applied to the same hashed value, and therefore data with the same tag value will be stored within the same data source.

By defining a tag group and limiting the number of datasources to three, LogScale will automatically organize the data, consistently, into three datasources:

graph LR; TV[#responsecode] TG[Tag Group Hashing] DS1[Datasource 1] DS2[Datasource 2] DS3[Datasource 3] TV --200--> TG TV --201--> TG TV --301--> TG TV --404--> TG TV --405--> TG TV --407--> TG TV --500--> TG TV --503--> TG TG --> DS1 TG --> DS2 TG --> DS3

When searching, the value being searched for is processed using the same consistent hash algorithm (#tag="value") by searching for the corresponding hashed valued. This search optimization happens automatically in the background; the user only needs to searched by using the corresponding tag field #tagname notation.

Fewer datasources can improve the performance of ingestion, because fewer mini-segments and segment files are created, and also improve search performance by reducing the number of segments to be searched.

Tag grouping can be particularly effective if you have a value with a fixed range of distinct values that would be too large to make effective use of tagging alone. For example, if you have a field named EventCode that you would like to have as a tag, but the set of values is more than 20,000 distinct ones. Simply making it a tag would create many datasources, probably too many to make an effective performance improvement. Instead, enabling tag grouping on the field and use a modulus of 15 would ensure that only 15 datasources are created, with each covering the subset of the 20,000 values that hash into each of them.

When searching and tag grouping enabled, a search for:

logscale
#eventcode="117"

LogScale will read the segment files that satisfy all tag predicates, including those that include the value "117" for the tag according to the calculated modulus for the hashed tag group. Hashed tag predicates work the same as tags, but use the hashed value rather than the input from the user directly. That means the segment files can have values other than the exact one from the input predicate, as multiple values hash to the same integer within the calculated modulus.

Using tag groups will not speed up the execution of queries where a wildcard is used. For example the query:

logscale
#eventcode="117*"

Will load all segments to match the potential value.

For more information on setting tag grouping, see Setup Grouping of Tags.

Tag grouping and Auto-sharding

Tag grouping and auto-sharding are different forms of manipulating the number of shards for a given datasource:

  • Autoshards increase the number of datasources for a given combination of tags by values, artificially adding a tag field with a seemingly random number.

  • Tag grouping reduces number of datasources by not storing values in the tagfield, instead storing the data in fields and using a hash for the actual tag field. Searches are automatically rewritten to handle this properly, being efficient on exact-match on value only.

The two systems have been designed so that they interact in a way that is efficient. Automatically generated shards are not grouped, but tag groups can be automatically sharded if this improves performance.

For more information on automatic sharding, see Configure Sticky Auto-Sharding for High-Volume Data Sources.