Practical Privacy: How to Increase Data Privacy and Grow Machine Learning

Concur Labs |

Data privacy is a huge topic right now for any companies using personal data, and recent legislative activities including the possibility of a new federal privacy law has brought it to the forefront. Consumer concerns are also growing with IBM reporting 78 percent believe a company's ability to keep their data private is very important.

 

At the same time, machine learning improves products, delivering user benefits such as improved personalization, tailored experiences, and less time manually filling in forms. But machine learning requires data to train the system - without data, it can’t function. So, businesses say they face a conundrum: how can they increase user privacy while still building products powered by machine learning?

 

As the IT decision makers for their organizations, CIOs must embrace the idea that privacy is not just an on/off switch where they either collect and use all or none of the data. There are new methods that allow increased user privacy while still preserving the accuracy of machine learning systems. Here are three practical options IT leaders can introduce to increase user privacy.

 

Limit the personal data you collect

One of the simplest ways to increase user privacy is to limit the amount of personal data that is collected in the first place. My team and I created an internal prototype that is based on the principle that privacy should be a sliding scale, not just an on-off switch.

 

Our idea is an adjustable software feature – a privacy dial – that lets users or their companies increase or decrease the type of information gathered by removing different levels of personally identifiable information. Developers can provide users a button for how much privacy they want, accompanied by an explanation of the benefits of each option. By understanding how levels of data sharing impact their user experience, users have greater knowledge and control.

The Privacy Dial gives a range of options for increasing user privacy.

 

 

At the lower dial settings, the personal data that can be used to directly identify a person is removed. As the setting increases, the data that is removed cannot be used to directly identify a single person, but it can still provide additional information about an individual. In most cases, personally identifiable information is not useful for a model’s predictions, so removing it does not affect the accuracy of the final model.

 

Federated learning is an excellent, but more complex, option for limiting the amount of data collected from users: here, a model is trained on a user’s device, then the trained model is passed to the central storage. This means the raw data never leaves a user’s personal device, but it still allows for high accuracy.

 

Only use a subset of the data

It’s also possible to increase user privacy at the stage where data is selected to train a machine learning model. One way to do this is to use k-anonymity to make users indistinguishable from others.

 

 

Making one person indistinguishable from 5 others: k-anonymity with k=6.

 

 

K-anonymity is achieved by aggregating or removing data that could indirectly reidentify a person (for example, the location of a business expense) until a certain number of entries are identical. “K” refers to the number of identical people in a dataset, so if k=3, then three entries in the dataset have identical combinations of sensitive data. However, this method can cause a large decrease in the accuracy of a machine learning model and does not provide a strong guarantee of privacy.

 

Prevent data leaks in the model’s predictions

Machine learning models can expose rare examples from their training data in their predictions, causing a possible loss of privacy to users. Differential privacy can prevent this. Differential privacy is a mathematical definition that guarantees that for any transformation of data, the probability of any specific result being returned is nearly the same, whether an individual is in a dataset or not. So, a differentially private machine learning model makes virtually the same predictions whether a person’s data is included or not - it learns about the population, not the individual.