Extract URL Page Names and Find Most Common Pages

Extract page names from URLs and count their frequency using regex() function with top()

Query

flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[/Filter/] 2{{Aggregate}} result{{Result Set}} repo --> 1 1 --> 2 2 --> result
logscale
regex(regex="/.*/(?<url_page>\S+\.page)", field=url)
| top(url_page, limit=12, rest=others)

Introduction

The regex() function can be used to extract specific parts of strings using regular expressions with named capture groups. The extracted values are stored in new fields named after the capture groups.

In this example, the regex() function is used to extract page names from URLs, and then top() is used to identify the most frequently accessed pages.

Example incoming data might look like this:

@timestampurlstatus_codeuser_agent
2023-08-06T10:00:00Zhttps://example.com/products/item1.page200Mozilla/5.0
2023-08-06T10:01:00Zhttps://example.com/about/company.page200Chrome/90.0
2023-08-06T10:02:00Zhttps://example.com/products/item2.page404Safari/14.0
2023-08-06T10:03:00Zhttps://example.com/products/item1.page200Firefox/89.0
2023-08-06T10:04:00Zhttps://example.com/contact/support.page200Chrome/90.0
2023-08-06T10:05:00Zhttps://example.com/about/company.page200Safari/14.0
2023-08-06T10:06:00Zhttps://example.com/products/item3.page200Mozilla/5.0
2023-08-06T10:07:00Zhttps://example.com/products/item1.page200Chrome/90.0
2023-08-06T10:08:00Zhttps://example.com/about/company.page200Firefox/89.0
2023-08-06T10:09:00Zhttps://example.com/products/item2.page404Safari/14.0

Step-by-Step

  1. Starting with the source repository events.

  2. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[/Filter/] 2{{Aggregate}} result{{Result Set}} repo --> 1 1 --> 2 2 --> result style 1 fill:#ff0000,stroke-width:4px,stroke:#000;
    logscale
    regex(regex="/.*/(?<url_page>\S+\.page)", field=url)

    Extracts the page name including the .page extension from the url field using a regular expression with a named capture group url_page. The pattern matches any characters up to the last forward slash (.*), followed by any non-whitespace characters (\S+) ending with .page.

  3. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[/Filter/] 2{{Aggregate}} result{{Result Set}} repo --> 1 1 --> 2 2 --> result style 2 fill:#ff0000,stroke-width:4px,stroke:#000;
    logscale
    | top(url_page, limit=12, rest=others)

    Groups the results by the extracted url_page field and counts their occurrences. The limit parameter is set to show the top 12 results, and the rest parameter combines all remaining values into a group named others.

  4. Event Result set.

Summary and Results

The query is used to analyze the most frequently accessed pages on a website by extracting page names from URLs and counting their occurrences.

This query is useful, for example, to identify popular content, monitor user behavior patterns, or detect potential issues with specific pages that receive high traffic.

Sample output from the incoming example data:

url_page_count
item1.page3
company.page3
item2.page2
support.page1
item3.page1

Note that the results are automatically sorted in descending order by count, showing the most frequently accessed pages first.