Calculates a "structure hash" which is equal for similarly structured input.
Hide omitted argument names for this function
Omitted Argument NamesThe argument name for
field
can be omitted; the following forms of this function are equivalent:logscale SyntaxtokenHash("value")
and:
logscale SyntaxtokenHash(field="value")
These examples show basic structure only.
tokenHash()
Syntax Examples
The tokenHash()
tokenizes the incoming
string (separated by spaces), and then creates a hash for each
tokenised elements and adds them together. The hash generated
in this form will therefore consistent, providing each token
in the input is identical, irrespective of the order. For
example, the following two log lines contain the same
information even though the order of each word is different:
valueString |
---|
abc def ghi |
def ghi abc |
Executing tokenHash()
on each will
generate the same hash value:
tokenHash(field=valueString)
This generates the same hash value for both rows, even though the order of each word is different:
_tokenHash |
---|
84edeb8f |
84edeb8f |
This can be useful to compare, filter or deduplicate log lines during parsing or querying even though the order of the individual values within a set of key/value pairs might be different.
tokenHash()
Examples
Click
next to an example below to get the full details.Group Similar Log Lines Using TokenHash
Find patterns in log messages by grouping similar structures using
the tokenHash()
function
Query
h := tokenHash(@rawstring)
groupBy(h, limit=max, function=[ count(), collect(@rawstring, limit=3) ])
Introduction
In this example, the tokenHash()
function is used
to group log messages that share the same structure but contain
different values. This helps identify common log patterns in your data.
Note that the purpose of tokenHash()
is for
grouping related log lines, not for cryptographic use.
Example incoming data might look like this:
@timestamp | @rawstring |
---|---|
2023-06-06T10:00:00Z | User john.doe logged in from 192.168.1.100 |
2023-06-06T10:01:00Z | User jane.smith logged in from 192.168.1.101 |
2023-06-06T10:02:00Z | User admin logged in from 192.168.1.102 |
2023-06-06T10:03:00Z | Failed login attempt from 10.0.0.1 |
2023-06-06T10:04:00Z | Failed login attempt from 10.0.0.2 |
2023-06-06T10:05:00Z | Database connection error: timeout after 30 seconds |
2023-06-06T10:06:00Z | Database connection error: timeout after 45 seconds |
Step-by-Step
Starting with the source repository events.
- logscale
h := tokenHash(@rawstring)
Creates a hash value based on the structure of the log message in the @rawstring field and returns the token hash in a new field named h. The
tokenHash()
function identifies words, numbers, and special characters while ignoring their specific values. - logscale
groupBy(h, limit=max, function=[ count(), collect(@rawstring, limit=3) ])
Groups the events by the token hash in the field h. For each group, it:
Counts the number of events using
count()
.Collects up to three example log messages using
collect()
on the @rawstring field.
The
limit
=max parameter ensures all groups are returned. Event Result set.
Summary and Results
The query is used to identify common log message patterns by grouping similar log lines together, regardless of their specific values.
This query is useful, for example, to discover the most common types of log messages in your data, identify unusual or rare log patterns that might indicate problems and create log message templates for parsing or monitoring.
Sample output from the incoming example data:
h | _count | @rawstring |
---|---|---|
1111b796 | 3 | User admin logged in from 192.168.1.102 User jane.smith logged in from 192.168.1.101 User john.doe logged in from 192.168.1.100 |
356fb767 | 2 | Failed login attempt from 10.0.0.2 Failed login attempt from 10.0.0.1 |
90fadc1e | 2 | Database connection error: timeout after 45 seconds Database connection error: timeout after 30 seconds |
Note that logs with the same structure but different values are grouped together, making it easy to identify common patterns in your log data.
Mask Sensitive SSN Data
Consistently hash social security numbers for privacy using the
tokenHash()
function
Query
tokenHash(ssn)
Introduction
In this example, the tokenHash()
function is used
to hash social security numbers, replacing the original values with
consistent hash values that can still be used for analysis and
correlation.
Example incoming data might look like this:
@timestamp | ssn | transaction_type | amount |
---|---|---|---|
2023-08-06T10:00:00Z | 123-45-6789 | deposit | 1000.00 |
2023-08-06T10:01:00Z | 987-65-4321 | withdrawal | 500.00 |
2023-08-06T10:02:00Z | 123-45-6789 | withdrawal | 200.00 |
2023-08-06T10:03:00Z | 456-78-9012 | deposit | 1500.00 |
2023-08-06T10:04:00Z | 987-65-4321 | deposit | 750.00 |
2023-08-06T10:05:00Z | 123-45-6789 | check | 300.00 |
2023-08-06T10:06:00Z | 456-78-9012 | withdrawal | 400.00 |
2023-08-06T10:07:00Z | 234-56-7890 | deposit | 2000.00 |
2023-08-06T10:08:00Z | 987-65-4321 | withdrawal | 100.00 |
2023-08-06T10:09:00Z | 123-45-6789 | deposit | 500.00 |
Step-by-Step
Starting with the source repository events.
- logscale
tokenHash(ssn)
Creates a consistent hash value for each unique social security number in the ssn field.
The hash value replaces the original SSN while maintaining uniqueness, allowing for analysis of patterns and relationships without exposing sensitive data. The function uses a secure hashing algorithm and returns the result in the same field.
The hash values are deterministic, meaning the same input will always produce the same hash value within the same repository, enabling consistent analysis across multiple queries.
Event Result set.
Summary and Results
The query is used to protect sensitive social security numbers while maintaining the ability to analyze patterns and relationships in the data.
This query is useful, for example, to comply with data privacy regulations while still being able to track user behavior, identify patterns, or investigate suspicious activities across multiple transactions.
Sample output from the incoming example data:
@timestamp | ssn | transaction_type | amount |
---|---|---|---|
2023-08-06T10:00:00Z | a1b2c3d4e5f6g7h8i9 | deposit | 1000.00 |
2023-08-06T10:01:00Z | j9k8l7m6n5o4p3q2r1 | withdrawal | 500.00 |
2023-08-06T10:02:00Z | a1b2c3d4e5f6g7h8i9 | withdrawal | 200.00 |
2023-08-06T10:03:00Z | s2t3u4v5w6x7y8z9a1 | deposit | 1500.00 |
2023-08-06T10:04:00Z | j9k8l7m6n5o4p3q2r1 | deposit | 750.00 |
2023-08-06T10:05:00Z | a1b2c3d4e5f6g7h8i9 | check | 300.00 |
Note that the same SSN values are consistently hashed to the same token values, maintaining the relationships in the data while protecting the original sensitive information.
The hashed data can be used in various dashboard widgets such as tables to show transaction patterns by hashed SSN, or sankey diagrams to visualize transaction flows between accounts. For security monitoring, consider creating alerts based on unusual patterns of activity for specific hashed SSNs.