Create Sample Groups Using Hash

Create consistent sample groups of events using the hash() function

Query

flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[(Function)] 2{{Aggregate}} result{{Result Set}} repo --> 1 1 --> 2 2 --> result
logscale
hash(ip_address, limit=10)
groupBy(_hash, function=count())

Introduction

The hash() function can be used to create consistent hash values from field contents, enabling deterministic sampling and grouping of events.

In this example, the hash() function is used to create sample groups from web server access logs based on IP addresses. This allows for consistent grouping of events from the same IP address while limiting the total number of groups.

Example incoming data might look like this:

bytes_sentip_addressrequest_pathstatus_code@timestamp
1532192.168.1.100/home2002023-06-15T10:00:00Z
892192.168.1.201/notfound4042023-06-15T10:00:01Z
2341192.168.10.100/about2002023-06-15T10:00:02Z
721192.168.15.102/error5002023-06-15T10:00:03Z
1267192.168.1.101/contact2002023-06-15T10:00:04Z
1843192.168.1.103/products2002023-06-15T10:00:05Z
1654192.168.15.100/cart2002023-06-15T10:00:06Z

Step-by-Step

  1. Starting with the source repository events.

  2. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[(Function)] 2{{Aggregate}} result{{Result Set}} repo --> 1 1 --> 2 2 --> result style 1 fill:#ff0000,stroke-width:4px,stroke:#000;
    logscale
    hash(ip_address, limit=10)

    Creates a hash value from the ip_address field and returns the result in a new field named _hash (default). This creates a consistent mapping where the same IP address will always generate the same hash value.

    The limit parameter is set to 10, which ensures the hash values are distributed across 10 buckets (0-9). All events with the same value of ip-address ends in the same bucket.

  3. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[(Function)] 2{{Aggregate}} result{{Result Set}} repo --> 1 1 --> 2 2 --> result style 2 fill:#ff0000,stroke-width:4px,stroke:#000;
    logscale
    groupBy(_hash, function=count())

    Groups the events by the _hash field. For each group, it counts the number of events and returns the result in a new field named _count. This aggregation reduces the data to show how many events fall into each hash bucket.

  4. Event Result set.

Summary and Results

The query is used to create consistent sample groups from large datasets by hashing a field value into a limited number of buckets.

This query is useful, for example, to analyze patterns in web traffic by sampling IP addresses into manageable groups while maintaining consistency - the same IP address will always hash to the same group. This can help identify behavioral patterns or anomalies in subsets of your traffic.

Sample output from the incoming example data:

_hash_count
21
31
61
83
91

Note that the hash values remain consistent for the same input, enabling reliable sampling across time periods.