In the Age of Data Privacy
In today’s landscape with data privacy laws cropping up worldwide, some data teams have become paralyzed by uncertainty in how to navigate this new world.
This guidebook is meant to help data teams negotiate tricky waters, assisting everyone - from the data analyst to data scientist and IT professional - in developing insights that move their organization forward in getting value out of data.
Jeremy is the product manager for the data privacy compliance features launched in Dataiku 5.1 and was responsible for putting GDPR compliance in place for Dataiku as a wider company. He is passionate about collaboration and knowledge sharing in the data space, especially in the context of today's regulatory environment.
Lynn follows and writes about technological and regulatory trends and developments in the world of data. She is passionate about making data science, machine learning, and other complex topics accessible to everyone, including those in traditionally non-technical roles and for those without a deep background in computer science.
With some industry experts naming 2019 as the year of increased data regulation, the use of data across roles and industries will only become increasingly restricted. But that doesn’t have to mean a pause or paralyzation in data use.
Companies that are organized for these changes will be able to continue moving forward in their machine learning and AI efforts with minimal disruption. This guidebook walks through the myths & realities of the following topics and suggests data team processes for compliance with each:
It also explores more generally how data science platforms can help, including a walk through of specific data privacy regulation features in Dataiku.
It is true that there are specific provisions under GDPR for anonymized personal data and that, if done correctly, it can provide more flexibility in working with data because it renders it outside of the scope of GDPR or other data regulations.
Anonymizing personal data is a good way to allow lines-of-business as well as data teams (from analysts to data scientists) to work freely with data. However, true anonymization is extremely difficult to achieve, and companies looking to use anonymization as a solution should be aware of this and ensure that data is actually completely anonymous before allowing it to be freely used across the business.
It is so difficult to completely anonymize data that even big companies with tons of resources (like Netflix) make mistakes. When they first introduced the Netflix Prize in 2006 - the competition to design the best recommendation engine - the company released 100 million “anonymized” ratings with a unique subscriber ID, move title, year of release, and date of rating. However, several researchers at the University of Texas were able to identify some of these users just a few weeks later.
There are several techniques for anonymization, but note that not all of them work for all cases and that the optimal solution should be considered and executed on a case-by- case basis depending on the type and source of data.