Mask Sensitive SSN Data

Consistently hash social security numbers for privacy using the tokenHash() function

Query

flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[/Filter/] result{{Result Set}} repo --> 1 1 --> result
logscale
tokenHash(ssn)

Introduction

The tokenHash() function can be used to consistently hash sensitive data while maintaining referential integrity. This allows for data analysis while protecting personally identifiable information (PII).

In this example, the tokenHash() function is used to hash social security numbers, replacing the original values with consistent hash values that can still be used for analysis and correlation.

Example incoming data might look like this:

@timestampssntransaction_typeamount
2023-08-06T10:00:00Z123-45-6789deposit1000.00
2023-08-06T10:01:00Z987-65-4321withdrawal500.00
2023-08-06T10:02:00Z123-45-6789withdrawal200.00
2023-08-06T10:03:00Z456-78-9012deposit1500.00
2023-08-06T10:04:00Z987-65-4321deposit750.00
2023-08-06T10:05:00Z123-45-6789check300.00
2023-08-06T10:06:00Z456-78-9012withdrawal400.00
2023-08-06T10:07:00Z234-56-7890deposit2000.00
2023-08-06T10:08:00Z987-65-4321withdrawal100.00
2023-08-06T10:09:00Z123-45-6789deposit500.00

Step-by-Step

  1. Starting with the source repository events.

  2. flowchart LR; %%{init: {"flowchart": {"defaultRenderer": "elk"}} }%% repo{{Events}} 1[/Filter/] result{{Result Set}} repo --> 1 1 --> result style 1 fill:#ff0000,stroke-width:4px,stroke:#000;
    logscale
    tokenHash(ssn)

    Creates a consistent hash value for each unique social security number in the ssn field.

    The hash value replaces the original SSN while maintaining uniqueness, allowing for analysis of patterns and relationships without exposing sensitive data. The function uses a secure hashing algorithm and returns the result in the same field.

    The hash values are deterministic, meaning the same input will always produce the same hash value within the same repository, enabling consistent analysis across multiple queries.

  3. Event Result set.

Summary and Results

The query is used to protect sensitive social security numbers while maintaining the ability to analyze patterns and relationships in the data.

This query is useful, for example, to comply with data privacy regulations while still being able to track user behavior, identify patterns, or investigate suspicious activities across multiple transactions.

Sample output from the incoming example data:

@timestampssntransaction_typeamount
2023-08-06T10:00:00Za1b2c3d4e5f6g7h8i9deposit1000.00
2023-08-06T10:01:00Zj9k8l7m6n5o4p3q2r1withdrawal500.00
2023-08-06T10:02:00Za1b2c3d4e5f6g7h8i9withdrawal200.00
2023-08-06T10:03:00Zs2t3u4v5w6x7y8z9a1deposit1500.00
2023-08-06T10:04:00Zj9k8l7m6n5o4p3q2r1deposit750.00
2023-08-06T10:05:00Za1b2c3d4e5f6g7h8i9check300.00

Note that the same SSN values are consistently hashed to the same token values, maintaining the relationships in the data while protecting the original sensitive information.

The hashed data can be used in various dashboard widgets such as tables to show transaction patterns by hashed SSN, or sankey diagrams to visualize transaction flows between accounts. For security monitoring, consider creating alerts based on unusual patterns of activity for specific hashed SSNs.