Use Case: Data Anonymization Using LogScale

When using LogScale, there may be data ingested that you don't want stored in a LogScale repository. For example, data may contain PII (Personally Identifiable Information), it may be confidential information that should only be available to a limited set of users, or users may have requested their data not be saved. With LogScale you can apply different measures during the ingestion phase to anonymize parts of the ingested messages.

Data Anonymization Approaches

There are five main approaches to anonymizing data when using LogScale:

  • Employ role-based access to provide limits,

  • Use a parser during the data ingest process to prevent the data from being collected,

  • Configure your log shipper to filter out sensitive data,

  • Redact data manually or by using the Redact Events API to delete text contained in an event entry in a repository, or

  • Anonymize data during search and then make the data available in dashboards.

How you approach this issue depends on the data itself — you may want to prevent the data from being sent at all, have LogScale discard the data or not store it, or use role-based access and permissions to limit access to your repositories and views. It's also possible to anonymize data during search and then make the data available in dashboards.

Note

There is currently no way to prevent users from accessing the underlying query and removing the parts that anonymize the data.

All options mentioned above can be managed by GraphQL API calls or via the UI. More information about GraphQL can be found at GraphQL API.

Data Anonymization Challenges

LogScale can anonymize or pseudonymize data during the data ingestion phase. This can have consequences with regards to what the data can tell you. Depending on what is anonymized it may no longer be possible to see who did certain actions, what they did or from where they did it. If for example all user names are anonymized and a hacker gets access to the credentials of one of these accounts, it may not be possible to find out which account has been compromised.

LogScale can't automatically identify what should or should not be anonymized. The task of identifying such data lies with the data owner.

Data Anonymization and Role-based Access

It's generally accepted good practice to have a robust and thoughtful approach to role-based access for data. Configuring LogScale is no exception, and when anonymizing data, it arguably should be your first approach. There are a few options for this approach:

  • Use the copyEvent() function during the ingest phase

  • Use hashRewrite() or replace() at the Group level, preserving the original log and making only certain data available to certain Groups, or

  • Use Query Prefixes to act as a search filter and apply role-based access based on that filter.

Users with greater access can use a parser to copy events using the copyEvent() function during the ingest phase to another repository, then anonymize them. Users with lesser access would have access to only the anonymized data, while those with greater access would maintain non-anonymized events for reference.

Note

You will need to be an Administrator to access the Group/Role settings globally, and other feature access depends on what your Role permissions are.

Query Prefixes

Query prefixes can be applied to data to aid with constraining searches and in this case can be used to limit available data to users with fewer permissions. You can find the Query Prefix option at the Group level, under assigned Roles, by clicking the person icon (top right) → Organizational Settings → Groups → Choose a Group → select a Role assigned to that Group.

Using replace() and Role-Based Access

There are two main methods for using the replace() function: applying it to the entire log, or applying it to a specific field and masking part or all of that field. Assume we have the following log line, and we want to mask some or all of the email:

logscale
2023-05-16 09:45:39 -0500 DEBUG auth=user@test.com remote_ip=19.24.410.103 
session=19zefhqy9Gh6Y

The first method would be to apply replace() to the entire log:

logscale
| replace("auth=(?<email>.+@[^\s]+)\s+", with="####MASKED_EMAIL####")

The section method would be to apply replace() to only a specific field:

logscale
| replace(regex="auth=(?<username>.+)@",with="####MASKED_USERNAME####")

Using a Parser for Data Anonymization

As part of the data ingestion process, data typically passes through a parser relevant to the type of data being ingested (see Parsing Data for more information).

When using a parser to anonymize data, there are three options:

  • Drop the entire event or specific field that contains sensitive data

  • Redact the data containing confidential or personal information from events, or

  • Change or pseudonymize the data with confidential or personal information using hashing.

During data ingestion, raw messages are assigned to the @rawstring field. Then the assigned parser extracts fields and associated values.

For example, look at this event:

logscale
2022-05-24 10.0.0.1 user=Jane action="accessed file xyz"

After the message has been parsed, the following fields may exist:

  • @rawstring

  • @timestamp

  • ipaddress

  • user

  • action

Note

Different events will have different fields. These fields are common examples, but this list is not exhaustive.

Dropping events that contain sensitive data using dropEvent()

dropEvent() can be used during both queries and within the parser pipeline. When the event is dropped, it's removed entirely.

Let's look at an example that uses the function dropEvent() to remove an event:

logscale
|…
| case { (user="Jane" or ipaddress="10.0.0.1") 
| dropEvent(); * }

For more information, see our documentation on dropEvent() Now let's drop a field instead of an entire event using the drop() function.

Dropping fields that contain sensitive data using drop()

The drop() function drops a field from your data set, but the original data is likely still available in the @rawstring field — this means you are required to rewrite the content of that field.

Here's an example:

logscale
| case { (user="Jane" or ipaddress="10.0.0.1")       
| format(format="%s %s %s %s %s ", field=[@timestamp, ipaddress, user, action], as=@rawstring)
| drop([user, ipaddress]); * }

And this is the event after parsing:

logscale
2022-05-24 action="accessed file xyz"

Although the drop() function drops a particular field, the original data is likely to still be available in the @rawstring field, hence the requirement to rewrite the content of that field.

Redacting Data Using eval()

To change the value within a field rather than dropping the event or the field itself entirely, we would use the power of eval() in its shorthand form, :=. Fields before := will be assigned the value of whatever comes after it, including strings, functions, other fields, etc.

Let's take a look at an example:

logscale
…   
| case { password="*" 
| password:="secret" 
| format(format="%s %s %s %s", field=[@timestamp, ipaddress, password, action], as=@rawstring); * }

After the event is parsed, the result will look like this:

2022-05-24 10.0.0.1 password=secret action="accessed file xyz"

Again, the original data will still likely be available via the @rawstring field. Rewriting the content of that field will also be necessary.

Changing or Pseudonymizing Data Using hashMatch() or hashRewrite()

The salt parameter calculates a secure hash of a field for storing in any given event. As the salt value is the same for all ingested data, knowledge of this value means that hashed values can easily be matched against the original value.

If the salt value is known, using the hashMatch() function allows you to find events where a field value has been hashed with hashRewrite(). The pros of this is you can find events matching a particular value even after those values have been hashed. The cons are that the hashed values are not particularly safe.

Here is an example of how to apply hashRewrite() to an example log line for a specific field:

logscale
//hash apply to a specific field
| hashRewrite(field="email", salt="salt1")

Here's an example with hashmatch():

logscale
hashmatch(input=?userid, field=user, salt="salt1")

A query like this creates an input field where users can enter a value that matches the hashed value.

Matching Anonymized Data using salt()

If the salt parameter value is known, using the hashMatch() query function allows finding events where a field value has been hashed with the hashRewrite() function.

For example:

logscale
hashmatch(input=?userid, field=user, salt="salt1")

Creates an input field where users can enter a value that matches the hashed value.

Redacting Data Manually or with Redact Events API

There may be certain data that you don't want stored in a LogScale repository, maybe not whole events, but specific text contained in events. For example, someone's password might have been inadvertently logged and stored in plain text in a repository. Another example could be that someone under the European GDPR has requested all information on them not be saved.

The best practice regarding these situations is either not to send the data to LogScale, or to have LogScale not store the data. For the first preventive measure, you might configure your log shipper to filter out passwords and other sensitive data.

For the second measure, you could configure the parser you assign to a datasource so as not to record specific data. You might configure a parser like so:

logscale
parseJson()
|
case {
  data=sensitive
| dropEvent();
  password=*
| replace(field=password,with="XXXXXX");
}

These measures should help greatly to reduce the amount of sensitive data that is recorded. However, there may still be data that makes it through and is stored in a repository. For those, you'll have to redact the specific text.

You can't use the LogScale User Interface to delete text contained in an event entry in a repository. Instead, you'll have to do this from the command-line, using the Redact Events API.

Below is an example of how to do this using the curl command:

logscale
$ curl -v "https://$YOUR_LOGSCALE_URL/api/v1/repositories/$REPO_NAME/deleteevents" \
  -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"queryString": "password=*", "startTime": 1551074900671, "endTime": 1612219536721}'

In this example, you would replace $YOUR_LOGSCALE_URL either the URL to your own server, or the URL to the LogScale Cloud environment you're using. For more information, see LogScale URLs & Endpoints.

You will also need to replace $REPO_NAME in the example above with the name of the repository from which you want to delete data.

The last variable you would replace is $TOKEN. Replace it with the default API token for the repository. To find this token, go to the Settings tab in the LogScale User Interface. Click on API Tokens to see a list of your tokens (see Figure 9 here). You can copy the default one from the panel there and paste it into the example above, or create an environment variable by which you would access it.

As for the rest of the example, you will need to adjust the last line, preceded with the delete option (i.e., -d). Change password=* to the text for which it should search. Notice the wildcard (i.e., *). That will have it include the password. This will result in both the key and value being deleted. Be sure to change the start and end times to the range of time on which to search the repository for that query string.

Best Practices

Some best practices to consider following when completing data anonymization of any kind include:

  • Administrator access is necessary to access Group/Role settings globally, and other feature access depends on what your Role permissions are.

  • Query Prefix options are at the Group level, under assigned Roles, and can be found by clicking the ‘person' icon at the top right of the UI, then Organizational Settings → Groups → Choose a Group → Select a Role assigned to that Group.

  • You should test hashRewrite() and replace() on data in the Search UI before applying them in a parser, query prefix, or event filter settings.

  • You can create test log lines using the createEvents() function, then running this as a search and uncommenting different lines to see how it works.