How-To: Deduplicating Compound Fields

Compound fields contain multiple pieces of information to report and/or search on, contained within a single field. Alternatively, they may be arrays parsed into an array field within events that then must be summarized.

For example, User Agent data in logs contains browser identifiers separated with spaces that define browser and toolkits used to support them:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36

The following solutions use a variety of methods to extract and aggregate the information.

Deduplicating Compound Field Data

Query
flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4
logscale
splitString(field=userAgent,by=" ",as=agents)
|array:filter(array="agents[]", function={bname=/\//}, var="bname")
|array:union(array=agents,as=browsers)
|transpose()
|drop(column)
|groupBy(row[1])
Introduction

Deduplicating fields of information where there are multiple occurrences of a value in a single field, maybe separated by a single character can be achieved in a variety of ways. This solution makes use of array:union(), transpose() and groupBy() to collate and aggregate the information into the final list of values.

For example, when examining the humio and looking for the browsers or user agents that have used your instance, the UserAgent data will contain the browser and toolkits used to support them, for example:

syslog
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36

The actual names are the Name/Version pairs showing compatibility with different browser standards. Resolving this into a simplified list requires splitting up the list, simplifying (to remove duplicates), filtering, and then summarizing the final list.

The process we need to follow is first extract the information into an array of values that we can then simplify and aggregate. It's possible the information in your raw data is already stored in an array of information that needs to be summarised.

Step-by-Step
  1. Starting with the source repository events

  2. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4 style 0 fill:#ff0000,stroke-width:4px,stroke:#000;

    First we split up the userAgent field using a call to splitString() and place the output into the array field agents

    This will create individual array entries into the agents array for each event:

    agents[0]="Mozilla/5.0"
    agents[1]="(Macintosh;"
    agents[2]="Intel"
    agents[3]="Mac"
    agents[4]="OS"
    agents[5]="X"
    agents[6]="10_15_7)"
    agents[7]="AppleWebKit/537.36"
    agents[8]="(KHTML,"
    agents[9]="like"
    agents[10]="Gecko)"
    agents[11]="Chrome/116.0.0.0"
    agents[12]="Safari/537.36"

    logscale
    splitString(field=userAgent,by=" ",as=agents)
  3. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4 style 1 fill:#ff0000,stroke-width:4px,stroke:#000;

    logscale
    |array:filter(array="agents[]", function={bname=/\//}, var="bname")
  4. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4 style 2 fill:#ff0000,stroke-width:4px,stroke:#000;

    Using array:union() we aggregate the list of user agents across all the events to create a list of unique entries. This will eliminate duplicates where the value of the user agent is the same value.

    logscale
    |array:union(array=agents,as=browsers)
  5. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4 style 3 fill:#ff0000,stroke-width:4px,stroke:#000;

    Using the transpose(), we transpose the rows and columns for the list of matching field field names and values, turning:

    browsers[0]browsers[1].browsers[2]
    Gecko/20100101Safari/537.36AppleWebKit/605.1.15

    into:

    columnrow[1]
    browsers[0]Gecko/20100101
    browsers[1]Safari/537.36
    browsers[2]AppleWebKit/605.1.15

    logscale
    |transpose()
  6. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4 style 4 fill:#ff0000,stroke-width:4px,stroke:#000;

    We dont need the column information, just the unique list of browsers in the row[1] column, so we drop the field from the event list.

    logscale
    |drop(column)
  7. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] 4[/Drop Field\] 5{{Aggregate}} result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> 4 4 --> 5 5 --> result style 3 fill:#ffbf00; style 4 fill:#2ac76d; click 4 #kb-dedupe-compound-fields-kb-dedupe-compound-fields-dedupe-arrays-1-4 style 5 fill:#ff0000,stroke-width:4px,stroke:#000;

    Now we can aggregate the list, by the remaining field row[1] to provide the unique list of potential values. This is not a count of the times each has occurred, but a unique list of all the different possible values. The resulting list looks like this:

    logscale
    |groupBy(row[1])
  8. Event Result set

Summary and Results

The resulting output from the query is a summarized list of the unique possible values from the original source fields, even though the source information was originally contained within a single field in the source events.

row[1]_count
AppleWebKit/537.361
AppleWebKit/605.1.151
CFNetwork/1410.0.31
Chrome/116.0.0.01
Darwin/22.6.01
Firefox/116.01
Gecko/201001011
Mobile/15E1481
Mozilla/5.01
Safari/18615.3.12.11.21
Safari/537.361
Safari/604.11
Safari/605.1.151
Version/16.41
Version/16.61

Deduplicating Compound Field Data with array:union() and split()

Query
flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> result style 3 fill:#ffbf00;
logscale
splitString(field=userAgent,by=" ",as=agents)
|array:filter(array="agents[]", function={bname=/\//}, var="bname")
|array:union(array=agents,as=browsers)
split(browsers)
Introduction

Deduplicating fields of information where there are multiple occurences of a value in a single field, maybe seaprated by a single character can be achieved in a variety of ways. This solution uses array:union() and split create a unique array and then split the content out to a unique list.

For example, when examining the humio and looking for the browsers or user agents that have used your instance, the UserAgent data will contain the browser and toolkits used to support them, for example:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36

The actual names are the Name/Version pairs showing compatibility with different browser standards. Resolving this into a simplified list requires splitting up the list, simplifying (to remove duplicates), filtering, and then summarizing the final list.

Step-by-Step
  1. Starting with the source repository events

  2. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> result style 3 fill:#ffbf00; style 0 fill:#ff0000,stroke-width:4px,stroke:#000;

    First we split up the userAgent field using a call to splitString() and place the output into the array field agents

    This will create individual array entries into the agents array for each event:

    agents[0]="Mozilla/5.0"
    agents[1]="(Macintosh;"
    agents[2]="Intel"
    agents[3]="Mac"
    agents[4]="OS"
    agents[5]="X"
    agents[6]="10_15_7)"
    agents[7]="AppleWebKit/537.36"
    agents[8]="(KHTML,"
    agents[9]="like"
    agents[10]="Gecko)"
    agents[11]="Chrome/116.0.0.0"
    agents[12]="Safari/537.36"

    logscale
    splitString(field=userAgent,by=" ",as=agents)
  3. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> result style 3 fill:#ffbf00; style 1 fill:#ff0000,stroke-width:4px,stroke:#000;

    logscale
    |array:filter(array="agents[]", function={bname=/\//}, var="bname")
  4. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> result style 3 fill:#ffbf00; style 2 fill:#ff0000,stroke-width:4px,stroke:#000;

    Using array:union() we aggregate the list of user agents across all the events to create a list of unique entries. This will eliminate duplicates where the value of the user agent is the same value.

    The event data now looks like this:

    browsers[0]browsers[1]browsers[2]
    Gecko/20100101Safari/537.36AppleWebKit/605.1.15

    An array of the individual values.

    logscale
    |array:union(array=agents,as=browsers)
  5. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 0[\Add Field/] 1["Expression"] 2{{Aggregate}} 3[\Update Field Data\] result{{Result Set}} repo --> 0 0 --> 1 1 --> 2 2 --> 3 3 --> result style 3 fill:#ffbf00; style 3 fill:#ff0000,stroke-width:4px,stroke:#000;

    Using the split() will split the array into individual events, turning:

    browsers[0]browsers[1]browsers[2]
    Gecko/20100101Safari/537.36AppleWebKit/605.1.15

    into:

    _indexrow[1]
    <no value>Gecko/20100101
    1Safari/537.36
    2AppleWebKit/605.1.15

    logscale
    split(browsers)
  6. Event Result set

Summary and Results

The resulting output from the query is a list of events with each event containing a matching _index and browser. This can be useful if you want to perform further processing on a list of events rather than an array of values.