The General Data Protection Regulation (GDPR) is a regulation that steps up to overcome its predecessor (directive 95/46 EC) and harmonize a unified policy on privacy and data protection for all EU members. It extends beyond the EU territory and affects any entity that stores personal data from an EU resident.
Personal data is defined as: any information related to a natural person or data subject that can be used to identify them directly or indirectly.
With all the data leaks and security breaches we’ve observed lately, the GDPR comes at a convenient time to impose a more active and transparent layer between the individual and the entity that stores or processes the individual’s data.
GDPR stipulates that, in the event of a data breach, individuals must be notified within 72 hours. Failure to comply could result in a heavy fine. Users should also be informed of where and by whom their data is being processed and be able to request to see or erase all of their personal data that the entity holds.
In a time where different parties have an interest in collecting and processing personal data (I’m looking at you Cambridge Analytica), GDPR gives power back to the EU (as in End User – pun intended).
That’s all fine and dandy. We all love privacy and data protection regulations but what exactly does this mean for a software company or a client knocking on our door with the best idea since sliced bread?
First, let’s set some records straight
- GDPR is not a bad thing.
- Canada has been granted partial adequacy status, which means that the EU trusts Canada (and their businesses) enough for international data transfers.
- Not everyone needs to be in compliance with the GDPR.
- A lot of the preparation for the GDPR can make you understand your business data needs better.
- Respecting your user data is respecting your user.
- Minimizing the amount of data you are processing means reducing algorithm complexity, processing and storage costs.
So, what would be the first step?
The first thing we need to define is whether or not we need to be in compliance with GDPR. To get this answer we need to ask ourselves a couple of simple questions:
- Is the system going to store or process data from EU residents/citizens?
- Is there an absolutely need to work with data that defines an individual (non-anonymized data)?
If the answer to either of those questions is a no, then it’s all good. You don’t need to lose any sleep over GDPR.
I work with EU data and can’t anonymize, now what?
Ok, if that’s the case, the most important thing is to understand and map the data you are collecting and processing. This can be done through an easy exercise. First, identify all the personal data provided by your end user that is stored by the system. Let’s just call this User Data. Next, map out all the business actions your user can perform on your platform that generates extra data, eg: making a purchase, manage listings, etc. Let’s call this Business Data. Finally, verify where your system collects metadata. I’m talking about access logs, permissions checks, any place that generates data that can be related indirectly to a user action. Let’s call this System Data.
Great, we’ve categorized all of your data into three categories: User, Business and System data. All of these data types are related to your user but in different levels, and on separate layers.
If you are wondering why this exercise is important, the answer is quite simple. It will help you determine how to manage this data.
One of GDPR’s strongest points is that it empowers an individual to see or delete any personal data that is being stored.
Let’s take a look at how this affects each of our data categories.
No surprises here. User data is one of the most important parts of any system. This data should be easy to track, retrieve and delete/anonymize. Even in a complex data schema, this information is too important to lose track of, so the actions required here are quite simple. To comply with GDPR, we would need to implement:
- Functionality to gather all User Data (The system will probably have this from the get-go)
- Functionality to delete all User Data
Wait! Hold on! Let’s think about system implementation here. If you delete a user (probably the main actor of your system), all of their aggregated data (Business and System layers) will be deleted along with the user data. Is this what we want? To lose track of all the information this user-generated? Think about this on a large scale: a good portion of your users asks you to delete their data, no big deal right? Except now all your business reports that rely on that dataset are compromised.
If the idea of a soft deletion came to your head I don’t blame you (you think like a programmer :D). The issue with a soft deletion is that we can still track all the data for that user, so it’s not really being GDPR compliant. The aggregation of data (including an individual’s purchase orders and location) may compromise the individual’s identity.
Let’s use Mitchell Ganton, TTT’s finest stand up guy’s profile as an example. He is a rare comic book fanatic and is from Orillia, Ontario. If I bump into a rare comic book order shipping to Orillia I’d say there is a 90% chance the order belongs to Mitch (100% if it is an issue from the “Invincible” series).
The solution to this problem is simple in concept but tricky to implement.
We need to mask this data so it can no longer be related to the individual that made the request.
In order to implement this, we will need to verify what data is important to the business and check if we can still maintain the original value after masking it. Keep in mind that sometimes this cannot be followed and we will need to make choices on which data to keep.
Business data is generated by user actions. One user can perform multiple actions, multiple times. Seems a little bit more complicated, but if we consider the fact that business data will only store a reference to the User, then once we mask the User data, we should be fine, right?
Yes, I mean, partially.
One of the things most businesses do is interact with other businesses. Interaction means sharing data. Under GDPR you are still reliable for any third parties that process data for your system. If you share data with another system, you have to make sure you either send anonymized data so there is no chance the third party application can trace it back to your user, or you need to make sure your data processors are in compliance with GDPR.
Another concern is the mapping of any collected analytical data. For the sake of sanity, I’d suggest that all analytical data collected should be anonymized and stored in a data lake exclusively for running analytical operations.
For this layer, you’d need to implement:
- Functionality to gather a particular individual’s business data.
- Functionality to gather non-anonymized data that was sent to a third party data processor
- An integration to any third party data processor to manage an individual’s data, that means: being able to send a delete request to it
One rule we can establish by now is that the more data you have, the harder it is to manage. For this reason, the system data layer is going to be the most complex one to handle from a computational point of view. We are talking about tracking and auditing every single request from a user in your system. In a more complex system, that would mean tracking every redirect, every permission validation, every database operation and so on.
Like it or not, all that data is related to an individual and needs to be taken into consideration. In order to reduce complexity, let’s separate that into two groups: persistent and volatile data. If there is the need to log everything, let’s differentiate between data we need to save as an audit trail (persistent data) and data we need to store for system operations, health checks and debugging reasons (volatile data).
For persistent data, since we are talking to a possible scalable system, we should consider a centralized log module. That way we can manage the user data more easily in a multi-service architecture. For volatile information, we should make sure all user data is masked. All operation logs should be sanitized from any personal data and only refer to request identifiers. Also, a log rotation policy should be applied.
So here’s the list of features we should implement:
- Centralized persistent audit service, that aggregates system operation logs in one place (Exposing an API for managing that log)
- Log sanitization and rotation
Another important step that surpasses the scope of this post but is very important to mention is educating and creating the proper structure for your Sys Admin, DBA, and DevOps team. Your team should understand that any generated data needs to be in compliance with the GDPR rules, that means: backups, server environments, databases, and so on. Take some time to get the infrastructure team together to define a strategy.
As I mentioned earlier, understanding the data your system will process from your users is the most important job you have when pursuing a GDPR compliance. As long as you approach this step by step, there’s no need to lose sleep over it. I hope this post sheds some light on how to approach GDPR.
If you need further assistance please don’t hesitate to contact us. We have been developing an extensive list of server libraries with GDPR in mind, and would be happy to support you in becoming compliant.
Microsoft and KPMG GDPR Virtual Conference, March 28, 2018.