How-To: Anonymizing Data

Where data contains PII (Personally Identifiable Information) or other confidential information that should only be available to some users, you can usually deal with the issue using role-based access and apply relevant permissions to LogScale repositories or views. However, some organizations have requirements to make only part of a message unavailable.

How LogScale can help

With LogScale you can apply different measures during the ingestion phase to anonymize parts of the ingested messages.

It is possible to anonymize data during search and then make this data available in dashboards. However, there is currently no way to prevent users from accessing the underlying query and removing the parts that anonymize the data.

One potential option not covered here would be to have a parser copy events (using the copyevent() function) during the ingestion phase to another repository prior to anonymizing them. Using role-based access, some users could then have access to the non-anonymized events and other users would only have access to the anonymized data.

Data anonymization challenges

LogScale can anonymize or pseudonymize data during the data ingestion phase. This can have consequences with regards to what the data can tell you. Depending on what is anonymized it may no longer be possible to see who did certain actions, what they did or from where they did it. If for example all user names are anonymized and a hacker gets access to the credentials of one of these accounts, it may not be possible to find out which account has been compromised.

LogScale can't automatically identify what should or should not be anonymized. The task of identifying such data lies with the data owner.

Data anonymization options

Below are some options which at a high-level show how a parser can anonymize or pseudonymize data during the LogScale data ingestion phase.

As part of the data ingestion process, data typically passes through a parser relevant to the type of data being ingested (see Parsing Data for more information). During the parsing process the following options exist when data should be anonymized:

Drop the whole event containing data with confidential or personal information
Remove the data containing confidential or personal information from events
Change or pseudonymize the data with confidential or personal information

During data ingestion, the raw message is assigned to a field named @rawstring. The assigned parser then typically extracts fields and associated values from the message. Consider this sample event:

2022-05-24 10.0.0.1 user=Jane action="accessed file xyz"

After the message has been parsed, the following fields may exist:

@rawstring
@timestamp
ipaddress
user
action

Dropping events

Partial parser content

logscale

…
| case { (user="Jane" or ipaddress="10.0.0.1") 
| dropEvent(); * }

Dropping fields

Partial parser content

logscale

…
| case { (user="Jane" or ipaddress="10.0.0.1")             
| format(format="%s %s %s %s %s ", field=[@timestamp, ipaddress, user, action], as=@rawstring)
| drop([user, ipaddress]); * }

Sample event after parsing

2022-05-24 action="accessed file xyz"

Although the drop() function drops a particular field, the original data is likely to still be available in the @rawstring field, hence the requirement to rewrite the content of that field.

Change data

Partial parser content

logscale

…      
| case { user="*" 
| user:="secret"  
| format(format="%s %s %s %s", field=[@timestamp, ipaddress, user, action], as=@rawstring); * }

Sample event after parsing

2022-05-24 10.0.0.1 user=secret action="accessed file xyz"

Although changing the value of a particular field, the original data is likely to still be available in the @rawstring field, hence the requirement to rewrite the content of that field.

Pseudonymize data

Partial parser content

logscale

…
| case { user="*" 
| hashRewrite(user, salt="salt1"); * }

Sample event after parsing

2022-05-24 10.0.0.1 user=ZXETIQViONIUP6Fg4dtIwbTMavVylrl9CRKNDIijh6o action="accessed file xyz"

As the salt value is the same for all ingested data, knowledge of the salt value means that the hashed values can easily be matched against the original value. The hashMatch() query function can do this for you. The pros of this is you can find events matching a particular value even after those values have been hashed. The cons are that the hashed values are not particularly safe.

Matching pseudonymized data

If the salt value is known, using the hashMatch() query function allows finding events where a field value has been hashed with hashRewrite().

A query such as:

logscale

hashmatch(input=?userid, field=user, salt="salt1")

Will create an input field where users can enter a value that matches the hashed value.

Knowledge Base