We have all more or less accepted that we are living in some kind of dime-store George Orwell novel where our every movement is tracked and recorded in some way.
Everything we do today, especially if there’s any kind of gadget or electronics involved, generates data that is of interest to someone. That data is constantly being gathered and stored, used by someone to build up a picture of the world around us.
The average person today is much more aware of the importance of their own data security. We all understand that the wrong data in the wrong hands can be used to wreak havoc on both individuals and society as a whole.
Now that there is a much greater general awareness of the importance of data privacy, it is much more difficult for malicious actors to unscrupulously gather sensitive data from us, as most people know not to hand it over.
Data Protection Laws
In most jurisdictions, there are laws and regulations in place that govern how personal data can be collected, stored, shared, and accessed.
While these laws are severely lacking in a number of areas, the trend in recent years has been to increasingly protect individuals from corporate negligence and excess, which has been welcomed by most consumers.
Probably the best-known data protection law is the famed GDPR or the General Data Protection Regulation, which came into force in 2018. Though in theory, it has power only within the EU, in practice, the law applies to every company that deals with EU citizens.
Its strict privacy requirements have made many businesses reconsider how they handle data, threatening misbehavers with fines that can climb into billions of euros (up to 4% of the company’s annual turnover).
Unlike the EU, the US has no single regulation on the federal level to protect the data of its citizens. Acknowledging that, some states have released their own privacy laws.
Probably the most extensive of them to date is the CCPA or the California Consumer Privacy Act.
The act will come into power beginning with 2020 and grant the citizens of California many of the same rights that EU citizens have come to enjoy.
It will allow Californians to know what data is collected about them, where it is used, say no to selling their data, and request to delete it.
One common theme that has emerged in the regulations from different jurisdictions is the notion of anonymized data. As the name implies, this is data that cannot be tied to a specific individual.
A set of anonymized data might be presented as belonging to a particular individual, but the identity of the subject is not revealed in the data.
Data anonymization presents an attractive common ground between the rights of consumers and those that want to make use of their personal data.
After all, information about who we are and what we do has long been the driving force behind many of today’s largest companies, including Google, Facebook, and Amazon.
But private corporations are not the only beneficiaries of our data. Removing any personally identifiable information from a dataset and anonymizing it, researchers are able to work with large and detailed datasets that contain a wealth of information without having to compromise any individual’s privacy.
By anonymizing data, we are also able to encourage people to share data that they would otherwise hold on to. Businesses and governments can access and trade vast amounts of data without infringing anyone’s privacy, thanks to anonymization.
Meanwhile, users don’t have to worry about data they generate being recorded and revealing information about them personally.
Data Anonymization Techniques
There are many ways to anonymize data, varying in cost and difficulty.
Perhaps the easiest technique is simply to remove some of the user’s direct identifiers. This is basically your main personal information. For instance, an insurance company could delete a customer’s name, date of birth, and call the data as good as anonymized.
Another method is to generalize the data of multiple users to reduce their precision. For instance, you could remove the last digits of a postcode or present a person’s age in a range rather than the exact number.
It is one of the methods Google uses to achieve k-anonymity – this elaborate term simply means that a certain number of people (defined by the letter k) should share the same property, such as ZIP code.
One more way is to include noise into the dataset. By noise I mean swapping around the information about certain properties between individuals or groups.
For example, this method could switch your car ownership details with another person. Your profile would change, but the whole dataset would remain intact for statistical analysis.
Finally, you can further protect the anonymized data you need to share by sampling it – that is, releasing the dataset in small batches. In theory, sampling helps to reduce the risk of re-identification.
Even if the data is enough to identify you as an individual, statistically there should be at least several other people with the same characteristics as you. Without having the whole dataset, there is no way to tell which person it really is.
Other data anonymization techniques exist, but these are some of the main ones.
So, anonymization makes everyone a winner, right? Well, not quite.
Anyone who has worked extensively with data can testify as to just how little information is needed to identify a specific individual out of a database of many thousands.
One of the consequences of the massive volumes of data that now exists on all of us is that different data sources can be cross-referenced to identify common elements.
In some cases, this cross-referencing can instantly deanonymize entire data sets, depending on how exactly they have been anonymized.
Researchers were able to recover surnames of US males from a database of genetic information by simply making use of publicly available internet resources.
A publicly available dataset of London’s bike-sharing service could be used not only to track trips but also who actually made them.
Anonymized Netflix movie ratings were mapped to individuals by cross-referencing them with IMDB data, thus revealing some very private facts about users. These are only a few of the many similar examples.
Since the introduction of the GDPR, a number of businesses have been looking for ways of continuing to handle large volumes of customer data without falling afoul of the new regulations.
Many organizations have come to view anonymized datasets as a means of potentially circumventing the regulations. After all, if data isn’t tied to specific individuals, it can’t infringe on their privacy.
No Such Thing as Anonymous
According to new research conducted by researchers from Imperial College London, along with their counterparts at Belgium’s Université Catholique de Louvain, it is incredibly hard to deanonymize data properly.
In order for data to be completely anonymous, it needs to be presented in isolation. You can use a VPN or change your IP address (more information about proxy servers you can find on Proxyway), etc.
If enough anonymized data is given about an individual, all it takes is a simple cross-reference with other databases to ascertain who the data concerns.
Using their own prediction model, the researchers made a startling discovery: it would take only 15 pieces of demographic information to re-identify 99.98% of Americans.
What is more, only four base attributes (ZIP code, date of birth, gender, and the number of children) would be needed to confidently identify 79.4% of the entire state of Massachusetts. According to the study, releasing data in small samples is not enough to protect an individual from detection.
Bearing in mind that researchers can deanonymize the records of an entire state, data brokers like Experian are selling anonymized data sets that contain hundreds of data points for each individual.
According to the researchers’ work, this data is anonymized in name only, and anyone with the capacity to handle large datasets also has the resources to deanonymize them easily.
It doesn’t matter what methods are used to anonymize data. Even the more advanced techniques like k-anonymity might not be sufficient – not to mention that they are expensive.
In most cases, all that happens is that only immediately identifiable data like names and addresses are removed. This is far from enough.
The researchers’ findings urge us not to fall into a false sense of security. They also challenge the methods companies use to anonymize data in light of the strict regulatory requirements set forth by the GDPR and the forthcoming CCPA.
The long battle to get the average internet user to care about their data and privacy has been a tiring one. Anyone who has worked in cybersecurity over the last couple of decades can testify as to how much things have improved, but there is still a long way to go.
The notion that people’s data can be anonymized and rendered harmless is both incorrect and dangerous. It is important that people properly understand the implications of handing their data over. Don’t give up your data under the false impression that it can’t be tied to you.
Mokhtar is the founder of LikeGeeks.com. He works as a Linux system administrator since 2010. He is responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. He loves writing shell and Python scripts to automate his work.