Personally Identifiable Information and Google Analytics

Development Strategy
Phil Price
08/07/18

Best Practices Related to Personally Identifiable Information

In preparation for the General Data Protection Regulation (GDPR) laws that took effect in May, we brushed up on best practices related to the handling of Personally Identifiable Information (PII). Keeping PII from leaking is a vast topic that can be discussed from many different angles, but for this article, we will focus specifically on its relationship to Google Analytics and how Google Analytics can be used to assist with the process of PII discovery and suppression. Remember, while we know Google Analytics can be a helpful tool, you should always aim to address problems in your code so that PII never leaks out in the first place.

What is considered PII?

Any data that can be used to identify an individual is considered PII, including:

  • Obvious things like names, phone numbers, addresses, social security numbers, and email addresses
  • GPS coordinates that can pinpoint an area more-specific than one square mile
  • User IDs and any other IDs that can be traced to an individual

How can I use Google Analytics to detect and handle PII in real time?

There is an excellent guide from Brian Clifton that walks through the process of using Google Analytics to detect PII.

Essentially, you can configure Google Tag Manager to run custom Javascript that intercepts all incoming Google Analytics data, detects PII with regular expressions, redacts found PII with string replacement, and finally passes the sanitized data on to Google Analytics for storage and tracking.

It is important to note that the PII detection is only as effective as you’ve programmed it. The example code mentioned in the article is an excellent start, but for maximum efficiency, you should regularly audit your website for PII that may slip through. And, if you find any additional PII,  enhance the Javascript code with cases to handle your particular data patterns.

For example, if you sometimes add a Member ID to the URL like so (https://www.site.com/page?member_id=1234) you will want to add a case to the Javascript to detect and redact any “member_id” GET parameters.

Can I hash/encrypt my PII instead of redacting it?

Google allows you to hash PII and send it, so long as that data is hashed with a minimum of SHA256 and uses a salt with a minimum of 8 characters. Regardless of how you hash or salt the data, you may not send Google Analytics encrypted Protected Health Information (as defined under HIPAA).
(Source: https://support.google.com/analytics/answer/6366371?hl=en)

Redaction vs. removal in your Google Tag Manager Javascript

If you were to completely remove any trace of the PII data from the query string, it would make it a lot tougher to search Google Analytics for potential leaks. However, if you redact PII to leave behind markers like [REDACTED_PHONE_NUMBER] in your Analytics data, you can search through that data periodically looking for these signifiers, pinpoint problem scripts that are leaking PII, and take steps to prevent the data from being exposed in the first place.

How can I search my Google Analytics history for leaked PII or redacted PII markers?

If you’ve made updates to your site to address exposing PII (like implementing the Google Tag Manager Javascript injection code mentioned above), you’ll still want to keep an eye on your data to make sure you’re not inadvertently exposing other PII. You can quickly search for those redacted PII markers in Google Analytics using the Behavior > Overview screen, but if you want to dig a little deeper, there is a nice guide at Cardinal Path that teaches the reader how to detect PII in their Google Analytics data using regular expressions. It has the reader open Google Analytics, create a new “segment” for PII reporting, and add regular expression “conditions” to the segment to detect entries that contain PII. As with the PII sanitization, the PII detection is only as effective as you’ve programmed it. The regular expressions in the article provide a basic start, but you will want to add regular expression conditions for any and all types of PII that may come from your web server.

What happens if PII does make it to Google Analytics?

If Google discovers any PII in your analytics data, they may delete your entire data set. However, word through the grapevine (blog posts) is that if you discover the PII first, carefully quarantine it, and approach Google proactively asking if they can simply trim-away the quarantined data, they may oblige your request.

Conclusion

It is important to prevent PII from leaking to Google Analytics, but it is even more important to prevent PII from leaking at all. Google Analytics is just one of many tools (e.g. server logs, mail filters, data migration tools) that you can use to detect PII leaks to address in your source code.

I hope this information helps you in your journey. As web developers, we shoulder a huge amount of trust and responsibility for handling PII. Let’s stay vigilant and keep it safe and secure.