How-To: Using Tag Grouping

LogScale stores data in physical partitions called Data Sources. Parsers can be configured to assign events to a particular data source based on specific fields - is called tagging. Tag fields in events start with the # character and can improve search performance.

Tag grouping allows hashing the value of a tag field, and then reducing the hashed value using a modulus into a fixed number of integer values that are stored with the event. The tag groups are stored and the search engine uses this mapping to search when using an exact match for a value (#tag="value") by searching for the corresponding integers. This search optimization happens automatically in the background; the user only needs to searched by using the corresponding tag field #tagname notation.

Tag grouping can be particularly effective if you have a value with a fixed range of potential distinct values that would be too large to make effective use of tagging alone. For example, if you have a field named EventCode that you would like to have as a tag, but the set of values is more than 20,000 distinct ones. Simply making it a tag would create many datasources, probably too many to make an effective performance improvement. Instead, enabling tag grouping on the field and use a modulus of 15 would ensure that only 15 datasources are created, with each covering the subset of the 20,000 values that hash into each of them.

When searching and tag grouping enabled, a search for:

logscale
#eventcode="117"

LogScale will read the segment files that satisfy all tag predicates, including those that include the value "117" for the tag according to the calculated modulus for the hashed tag group. Hashed tag predicates work the same as tags, but use the hashed value rather than the input from the user directly. That means the segment files can have values other than the exact one from the input predicate, as multiple values hash to the same integer within the calculated modulus.

Using tag groups will not speed up the execution of queries where a wildcard is used. For example the query:

logscale
#eventcode="117*"

Will load all segments to match the potential value.

Getting and Setting Tag Groups

Tag groups are set per repository using a REST API. The value of the tag groups can be set across multiple tag fields at the same time.

Listing Existing Tag Groups

To obtain a list of the current tag groups use the /api/v1/repositories/REPO/taggrouping endpoint with a GET request. For example the following updates the accesslog repository:

shell
$ curl -X GET $YOUR_HUMIO_URL/api/v1/repositories/accesslog/taggrouping

This returns a JSON file with the current tag group configuration (example formatted for clarity):

json
{
  "current": 2,
  "sets": [
    {
      "id": 2,
      "rules": [
        {
          "field": "@source",
          "modulus": 16
        }
      ]
    }
  ]
}
Updating Tag Groups

To set the tag grouping for a repository use POST on the same endpoint, submitting a JSON file with the new tag group map. For example:

json
$ curl $YOUR_LOGSCALE_URL/api/v1/repositories/accesslog/taggrouping \
 -X POST \
 -H 'Content-Type: application/json' \
 -d '[ {"field":"@source","modulus": 16}]'

This will return the new tag group configuration for the repository (example formatted for clarity):

json
{
  "id": 1,
  "rules": [
    {
      "field": "@source",
      "modulus": 8
    }
  ]
}