Best Practice: Comparing Repos and Views

Last Updated: 2022-03-11

Humio organizes data into Repositories. Views are a layer of abstraction that sit on top of a repository, or combination of repositories. The following sections describe some best practices related to repositories and views.

Repositories

The decision to create a repository is influenced by several factors including:

  • User access control

    The primary method of controlling use access to data is at the repository level. Organizations can give users access to repositories via groups (whether created in Humio or inherited from their organization's SSO provider) and groups have roles associated with them that govern the way that users can interact with the data in the repository.

  • Data retention

    Data Retention is set at the repository level and all data sources stored in a single repository will have the same level of retention.

  • Content management

    Repositories hold collections of saved queries, alerts, scheduled searches, actions, parsers, and files. Any user with access to the repository will have access to the content in that repository.

The easiest approach in terms of repository management is to create a single repository for all of an organization's data sources. This approach works when the following conditions exist:

  • All data sources have the same retention requirements;

  • The total number of data sources (Data Sources) in the repository is fewer than 10,000 (see Data Sources for more information).

Access control to the data within the single repository can be controlled universally at the repository level or through the use of Views (see Views for more information).

As covered above the two most common reasons that would require an organization to create additional repositories include:

  • Implementing different levels of retention;

  • A requirement to create more than 10,000 total data sources.

While it isn't possible to apply different levels of retention in a single repository it is possible to design data sources in most cases to avoid generating too many to exist in a single repository (see Data Sources for more information).

Data Sources

A data source in Humio is identified by its unique combination of tags (Event Tags) or fields denoted by the pound, or hash, character (#). In the screenshot below the selected event has two tags:

#repo = aws_vpcflow and #type = vpcflow_raw.
Event fields with tags

Figure 295. Event fields with tags


The #repo tag for each event in a given repository will be the same, e.g. the name of the repository. In this case the repository is named aws_vpcflow so the tag value is: aws_vpcflow. The #type tag for each is set to the name of the parser used to parse the event. In the example above the event was parsed with the vpcflow_raw parser.

You can see the data sources in a repository in Settings --> Data Sources as illustrated in the following screenshot:

Data source list

Figure 296. Data source list


Notice that in the listing of data sources Humio only does not show the #repo tag as it will be the same for every data source.

Data sources are extremely important to Humio because they determine how data is physically stored within the platform. Humio represents each data source as a unique directory within the repository's directory creating separate storage locations on disk for each new data source. Restricting searches to specific data sources using the appropriate tags as search filters (e.g. #type = "vpcflow_raw") can significantly increase the performance of searches as this minimizes the need for Humio to traverse all of the data sources associated with a repository.

There are a couple of important considerations to take into account when thinking about tags and data sources (see the following blog post for more details):

  • At about 10,000 events per second per data source Humio can no longer sequentially process incoming events. Before this happens Humio implements a process called auto sharding that adds tags to a data source (e.g. #humioAutoShard=0, #humioAutoShard=1, #humioAutoShard=2, etc.) to split the data source into manageable chunks. Each new tag creates a new data source. This means that if the initial data source has the following tags #repo = aws_vpcflow and #type = vpcflow_raw and Humio needs to create three auto shards to manage the data velocity that one data source is now actually four data sources as illustrated in the table below:

Data Source Tags
1 #repo = aws_vpcflow, #type = vpcflow_raw
2 #repo = aws_vpcflow, #type = vpcflow_raw, #humioAutoShard = 0
3 #repo = aws_vpcflow, #type = vpcflow_raw, #humioAutoShard = 1
4 #repo = aws_vpcflow, #type = vpcflow_raw, #humioAutoShard = 2
  • Humio has a programmatic limit of 10,000 data sources per repository. This limit is designed to prevent issues related to having too many WIP (Work in Progress) buffers (system memory limitations) and too many directories (host operating system limitations).

If you have additional questions about data sources and how they affect the way that your organization uses repositories please contact you Humio Sales Engineer or Humio Technical Support ().

Views

In Humio a View is a type of repository that contains no data of its own. A view is created by connecting one or more repositories as illustrated in the screenshot below:

View Configuration

Figure 297. View Configuration


Views offer the following benefits:

  • Views allow you to connect multiple repositories to enable searching across them as if they were a single repository;

  • Views allow you to provide users with access to data in repositories customized to their specific needs. For example, in a scenario where an organization has one repository for all data sources users can be given access to their data sources exclusively using a view Event Filter feature (e.g. #type = "vpcflow_raw"). This use of views also helps to keep the content (events, queries, alerts, dashboards, files, etc) associated with specific data sources separate from other data sources (user groups) since anyone with access to a repository or view has access to all of the content.

See Creating a View for more details on how to implement views.