Use Case: Hashing, Masking, and Anonymizing Sensitive Data
When using LogScale, there may be data ingested that you don't want stored in a LogScale repository. For example, data may contain PII (Personally Identifiable Information), it may be confidential information that should only be available to a limited set of users, or users may have requested their data not be saved. With LogScale you can apply different measures during the ingestion phase to anonymize parts of the ingested messages.
Data Anonymization Approaches
There are five main approaches to anonymizing data when using LogScale:
Employ role-based access to provide limits,
Use a parser during the data ingest process to prevent the data from being collected,
Configure your log shipper to filter out sensitive data,
Redact data manually or by using the Redact Events API to delete text contained in an event entry in a repository, or
Anonymize data during search and then make the data available in dashboards.
How you approach this issue depends on the data itself — you may want to prevent the data from being sent at all, have LogScale discard the data or not store it, or use role-based access and permissions to limit access to your repositories and views. It's also possible to anonymize data during search and then make the data available in dashboards.
Note
There is currently no way to prevent users from accessing the underlying query and removing the parts that anonymize the data.
All options mentioned above can be managed by GraphQL API calls or via the UI. More information about GraphQL can be found at GraphQL API.
Data Anonymization Challenges
LogScale can anonymize or pseudonymize data during the data ingestion phase. This can have consequences with regards to what the data can tell you. Depending on what is anonymized it may no longer be possible to see who did certain actions, what they did or from where they did it. If for example all user names are anonymized and a hacker gets access to the credentials of one of these accounts, it may not be possible to find out which account has been compromised.
LogScale can't automatically identify what should or should not be anonymized. The task of identifying such data lies with the data owner.
Data Anonymization and Role-based Access
It's generally accepted good practice to have a robust and thoughtful approach to role-based access for data. Configuring LogScale is no exception, and when anonymizing data, it arguably should be your first approach. There are a few options for this approach:
Use the
copyEvent()
function during the ingest phaseUse
hashRewrite()
orreplace()
at the Group level, preserving the original log and making only certain data available to certain Groups, orUse Query Prefixes to act as a search filter and apply role-based access based on that filter.
Users with greater access can use a parser to copy events using the
copyEvent()
function during the ingest phase to
another repository, then anonymize them. Users with lesser access
would have access to only the anonymized data, while those with
greater access would maintain non-anonymized events for reference.
Note
You will need to be an Administrator to access the Group/Role settings globally, and other feature access depends on what your Role permissions are.
Query Prefixes
Query prefixes can be applied to data to aid with constraining searches and in this case can be used to limit available data to users with fewer permissions. You can find the Query Prefix option at the Group level, under assigned Roles, by clicking the person icon (top right) → Organizational Settings → Groups → Choose a Group → select a Role assigned to that Group.
Using replace()
and Role-Based Access
There are two main methods for using the
replace()
function: applying it to the entire
log, or applying it to a specific field and masking part or all of
that field. Assume we have the following log line, and we want to
mask some or all of the email:
2023-05-16 09:45:39 -0500 DEBUG auth=user@test.com remote_ip=19.24.410.103
session=19zefhqy9Gh6Y
The first method would be to apply replace()
to
the entire log:
| replace("auth=(?<email>.+@[^\s]+)\s+", with="####MASKED_EMAIL####")
The section method would be to apply replace()
to only a specific field:
| replace(regex="auth=(?<username>.+)@",with="####MASKED_USERNAME####")
Using a Parser for Data Anonymization
As part of the data ingestion process, data typically passes through a parser relevant to the type of data being ingested (see Parsing Data for more information).
When using a parser to anonymize data, there are three options:
Drop the entire event or specific field that contains sensitive data
Redact the data containing confidential or personal information from events, or
Change or pseudonymize the data with confidential or personal information using hashing.
During data ingestion, raw messages are assigned to the @rawstring field. Then the assigned parser extracts fields and associated values.
For example, look at this event:
2022-05-24 10.0.0.1 user=Jane action="accessed file xyz"
After the message has been parsed, the following fields may exist:
@rawstring
@timestamp
ipaddress
user
action
Note
Different events will have different fields. These fields are common examples, but this list is not exhaustive.
Dropping events that contain sensitive data using dropEvent()
dropEvent()
can be used during both queries and
within the parser pipeline. When the event is dropped, it's removed
entirely.
Let's look at an example that uses the function
dropEvent()
to remove an event:
|…
| case { (user="Jane" or ipaddress="10.0.0.1")
| dropEvent(); * }
For more information, see our documentation on
dropEvent()
Now let's drop a field instead
of an entire event using the drop()
function.
Dropping fields that contain sensitive data using drop()
The drop()
function drops a field from your
data set, but the original data is likely still available in the
@rawstring field — this means you are
required to rewrite the content of that field.
Here's an example:
| case { (user="Jane" or ipaddress="10.0.0.1")
| format(format="%s %s %s %s %s ", field=[@timestamp, ipaddress, user, action], as=@rawstring)
| drop([user, ipaddress]); * }
And this is the event after parsing:
2022-05-24 action="accessed file xyz"
Although the drop()
function drops a particular
field, the original data is likely to still be available in the
@rawstring field, hence the requirement to
rewrite the content of that field.
Redacting Data Using eval()
To change the value within a field rather than dropping the event or
the field itself entirely, we would use the power of
eval()
in its shorthand form,
:=
. Fields before
:=
will be assigned the value
of whatever comes after it, including strings, functions, other
fields, etc.
Let's take a look at an example:
…
| case { password="*"
| password:="secret"
| format(format="%s %s %s %s", field=[@timestamp, ipaddress, password, action], as=@rawstring); * }
After the event is parsed, the result will look like this:
2022-05-24 10.0.0.1 password=secret action="accessed file xyz"
Again, the original data will still likely be available via the @rawstring field. Rewriting the content of that field will also be necessary.
Changing or Pseudonymizing Data Using hashMatch()
or hashRewrite()
The salt
parameter
calculates a secure hash of a field for storing in any given event.
As the salt
value is
the same for all ingested data, knowledge of this value means that
hashed values can easily be matched against the original value.
If the salt
value is
known, using the hashMatch()
function allows
you to find events where a field value has been hashed with
hashRewrite()
. The pros of this is you can find
events matching a particular value even after those values have been
hashed. The cons are that the hashed values are not particularly
safe.
Here is an example of how to apply
hashRewrite()
to an example log line for a
specific field:
//hash apply to a specific field
| hashRewrite(field="email", salt="salt1")
Here's an example with hashmatch():
hashmatch(input=?userid, field=user, salt="salt1")
A query like this creates an input field where users can enter a value that matches the hashed value.
Matching Anonymized Data using salt()
If the salt
parameter value is known, using the
hashMatch()
query function allows finding
events where a field value has been hashed with the
hashRewrite()
function.
For example:
hashmatch(input=?userid, field=user, salt="salt1")
Creates an input field where users can enter a value that matches the hashed value.
Redacting Data Manually or with Redact Events API
There may be certain data that you don't want stored in a LogScale repository, maybe not whole events, but specific text contained in events. For example, someone's password might have been inadvertently logged and stored in plain text in a repository. Another example could be that someone under the European GDPR has requested all information on them not be saved.
The best practice regarding these situations is either not to send the data to LogScale, or to have LogScale not store the data. For the first preventive measure, you might configure your log shipper to filter out passwords and other sensitive data.
For the second measure, you could configure the parser you assign to a datasource so as not to record specific data. You might configure a parser like so:
parseJson()
|
case {
data=sensitive
| dropEvent();
password=*
| replace(field=password,with="XXXXXX");
}
These measures should help greatly to reduce the amount of sensitive data that is recorded. However, there may still be data that makes it through and is stored in a repository. For those, you'll have to redact the specific text.
You can't use the LogScale User Interface to delete text contained in an event entry in a repository. Instead, you'll have to do this from the command-line, using the Redact Events API.
Below is an example of how to do this using the curl command:
$ curl -v "https://$YOUR_LOGSCALE_URL/api/v1/repositories/$REPO_NAME/deleteevents" \
-X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"queryString": "password=*", "startTime": 1551074900671, "endTime": 1612219536721}'
In this example, you would replace $YOUR_LOGSCALE_URL either the URL to your own server, or the URL to the LogScale Cloud environment you're using. For more information, see LogScale URLs & Endpoints.
You will also need to replace $REPO_NAME in the example above with the name of the repository from which you want to delete data.
The last variable you would replace is $TOKEN. Replace it with the default API token for the repository. To find this token, go to the Settings tab in the LogScale User Interface. Click on API Tokens to see a list of your tokens (see Figure 9 here). You can copy the default one from the panel there and paste it into the example above, or create an environment variable by which you would access it.
As for the rest of the example, you will need to adjust the last line, preceded with the delete option (i.e., -d). Change password=* to the text for which it should search. Notice the wildcard (i.e., *). That will have it include the password. This will result in both the key and value being deleted. Be sure to change the start and end times to the range of time on which to search the repository for that query string.
Anonymizing Data During Search
LogScale supports anonymization during search using
hashRewrite()
and replace()
.
These can be applied in three places:
Append by Group and Role using Queries Prefixes (recommended)
Append to all queries in a view via Event Filters
When creating a Parser — this method will permanently change the raw event text
Query Prefixes
Query prefixes are effectively search filters, and can be applied to any search. For example, the Redact Events API uses query prefixes to complete deletion filter queries. Query prefixes will act as a filter when any member of a specified group searches a repository, and allows partitioning of data at search time. It's also possible to define a default query prefix if a default role has been selected.
Note
Default query prefixes are applied to all searches in all repositories unless an exception is defined.
It's also important to remember that Field Aliasing can't be used in conjunction with query prefixes, especially with the Redact Events API- aliased fields are disabled for those queries, and it only operates on parsed fields. If the same query is run on the Search where Field Aliasing is set up, for example to check which events to delete before running the API, the search will produce different results. To avoid discrepancy, disable the field aliasing configuration when running the query on Search, or ensure you are not using aliased fields in the filter query executed via the API.
Event Filters
Event filters are a component of LogScale's Views, which are a component of Dashboards. Event filters are a powerful way of reducing data displayed on a dashboard to allow you to look at a subset of your data. To learn more about event filters, see our documentation for creating a repository or view- Creating a Repository or View
Parsers
A parser consists of a script, plus a few related settings. The parser script is the main part of the parser, as this defines how a single incoming event is transformed before it becomes a searchable event. For more information, see Parsing Data.
Examples of hashRewrite()
and replace()
Using hashRewrite()
might look something like
this:
hashRewrite(field="target.username", salt="hide")
Using replace()
might look something like this:
//sample data
2023-05-16 09:45:39 -0500 DEBUG auth=user@test.com remote_ip=19.24.410.103 session=19zefhqy9Gh6Y
//method 1 - apply to entire log
| replace("auth=(? <email>.+@[^\s]+)\s+", with="####MASKED_EMAIL####")
// applies to entire raw message
// method 2 - apply to a specific field and mask part of that field
| replace(regex="auth=(?<username>.+)@",with="####MASKED_USERNAME####")
Best Practices
Some best practices to consider following when completing data anonymization of any kind include:
Administrator access is necessary to access Group/Role settings globally, and other feature access depends on what your Role permissions are.
Query Prefix options are at the Group level, under assigned Roles, and can be found by clicking the ‘person' icon at the top right of the UI, then Organizational Settings → Groups → Choose a Group → Select a Role assigned to that Group.
You should test
hashRewrite()
andreplace()
on data in the Search UI before applying them in a parser, query prefix, or event filter settings.You can create test log lines using the
createEvents()
function, then running this as a search and uncommenting different lines to see how it works.