Best Practice: Comparing Repos and Views

LogScale organizes data into Repositories. Views are a layer of abstraction that sit on top of a repository, or combination of repositories. The following sections describe some best practices related to repositories and views.

Repositories

The decision to create a repository is influenced by several factors including:

User access control
The primary method of controlling use access to data is at the repository level. Organizations can give users access to repositories via groups (whether created in LogScale or inherited from their organization's SSO provider) and groups have roles associated with them that govern the way that users can interact with the data in the repository.
Data retention
Data Retention is set at the repository level and all data sources stored in a single repository will have the same level of retention.
Content management
Repositories hold collections of saved queries, alerts, scheduled searches, actions, parsers, and files. Any user with access to the repository will have access to the content in that repository.

The easiest approach in terms of repository management is to create a single repository for all of an organization's data sources. This approach works when the following conditions exist:

All data sources have the same retention requirements;
The total number of data sources (Datasources) in the repository is fewer than 10,000.

Access control to the data within the single repository can be controlled universally at the repository level or through the use of Creating a Repository or View.

As covered above the two most common reasons that would require an organization to create additional repositories include:

Implementing different levels of retention;
A requirement to create more than 10,000 total data sources.

While it isn't possible to apply different levels of retention in a single repository it is possible to design data sources in most cases to avoid generating too many to exist in a single repository.

Data Sources

A data source in LogScale is identified by its unique combination of tags (Event Tags) or fields denoted by the pound, or hash, character (#). In the screenshot below the selected event has two tags:

logscale

#repo = aws_vpcflow and #type = vpcflow_raw.

Figure 2. Event fields with tags

The #repo tag for each event in a given repository will be the same, e.g. the name of the repository. In this case the repository is named aws_netflow so the tag value is: aws_vpcflow. The #type tag for each is set to the name of the parser used to parse the event. In the example above the event was parsed with the vpcflow_raw parser.

You can see the data sources in a repository in Settings → Data Sources as illustrated in the following screenshot:

Figure 3. Data source list

Notice that in the listing of data sources LogScale only does not show the #repo tag as it will be the same for every data source.

Data sources are extremely important to LogScale because they determine how data is physically stored within the platform. LogScale represents each data source as a unique directory within the repository's directory creating separate storage locations on disk for each new data source. Restricting searches to specific data sources using the appropriate tags as search filters (e.g. #type = "vpcflow_raw") can significantly increase the performance of searches as this minimizes the need for LogScale to traverse all of the data sources associated with a repository.

There are a couple of important considerations to take into account when thinking about tags and data sources (see the following blog post for more details):

At about 10,000 events per second per data source LogScale can no longer sequentially process incoming events. Before this happens LogScale implements a process called auto sharding that adds tags to a data source (e.g. #humioAutoShard=0, #humioAutoShard=1, #humioAutoShard=2, etc.) to split the data source into manageable chunks. Each new tag creates a new data source. This means that if the initial data source has the following tags #repo=aws_vpcflow and #type=vpcflow_raw and LogScale needs to create three auto shards to manage the data velocity that one data source is now actually four data sources as illustrated in the table below:

Data Source	Tags
1	#repo=aws_vpcflow, #type=vpcflow_raw
2	#repo=aws_vpcflow, #type = vpcflow_raw, #humioAutoShard=0
3	#repo=aws_vpcflow, #type = vpcflow_raw, #humioAutoShard=1
4	#repo=aws_vpcflow, #type = vpcflow_raw, #humioAutoShard=2

LogScale has a programmatic limit of 10,000 data sources per repository. This limit is designed to prevent issues related to having too many WIP (Work in Progress) buffers (system memory limitations) and too many directories (host operating system limitations).

If you have additional questions about data sources and how they affect the way that your organization uses repositories please contact you LogScale Sales Engineer or LogScale Technical Support (<logscalesupport@crowdstrike.com>).

Views

In LogScale a View is a type of repository that contains no data of its own. A view is created by connecting one or more repositories as illustrated in the screenshot below:

Figure 4. View Configuration

Views offer the following benefits:

Views allow you to connect multiple repositories to enable searching across them as if they were a single repository;
Views allow you to provide users with access to data in repositories customized to their specific needs. For example, in a scenario where an organization has one repository for all data sources users can be given access to their data sources exclusively using a view Event Filter feature (e.g. #type="vpcflow_raw"). This use of views also helps to keep the content (events, queries, alerts, dashboards, files, etc) associated with specific data sources separate from other data sources (user groups) since anyone with access to a repository or view has access to all of the content.

See Creating a Repository or View for more details on how to implement views.

Knowledge Base