Several studies have demonstrated the ease with which de-identified datasets, like medical and financial records, can be re-identified. It has all seemed sort of theoretical. Like sure, some researchers at a university can do this. That’s concerning, but is it really likely to happen in real life?
Apparently, it is. A team of researchers from Imperial College London and the University of Louvain have developed an algorithm to estimate the probability with which your anonymized data can be re-identified (linked back to you) by, as they say, your employer or your neighbor, using only 3 simple data points.
With only date of birth, zip code, and sex, data can be re-identified, on average 83% of the time, and “99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes.”
Now that makes it all a bit more tangible, doesn’t it? And the best part is, they developed an online machine learning tool you can use to see how easily your data can be re-identified.
And this is not just an American problem. As the researchers go on to assert, “Datasets such as the NIGMS and NIH genetic data, the Washington State Health Data, the NYC Taxicab dataset, the Transport For London bike sharing dataset, and the Australian de-identified Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Schedule (PBS) datasets have been shown to be easily re-identifiable.”
[Want to know more about what information about you is on the dark net? Read Have You Been Pwned? Probably Yes, So Here’s What You Do]
According to the online tool, the likelihood I could be identified with those three data points was below average. That is, my records are less likely to be identified, based solely on my birth date, zip code, and sex. But add a few more data points (marital status, number of vehicles, employment, home ownership), and the likelihood I could be correctly identified from a record containing that information shot to 100%.
It’s interesting that the researchers frame the concern in terms of what your employer can find out about you through re-identification. With the uptick in employers’ monitoring of employees physically and digitally, I suppose re-identification of data employees thought were anonymous is simply another monitoring tool.
While badges that record employees’ conversations and track their location seems incredibly intrusive, re-identifying anonymized data goes beyond intrusive. Given the researchers’ framing of the issue in the employment realm, I suppose it’s a concern that surpasses what you said on your “anonymous” employee survey.
What You Can Do
Obviously, current standards for de-identifying data are insufficient. Unless you opt to live off the grid, this data will exist in various repositories not under your control. So what can you do?
- Know the data use practices of providers and institutions you entrust with your data, and opt for ones that align with your privacy preferences. Read the privacy policies you receive for online accounts, credit cards, and medical care to see how and with whom your information is being shared. In some cases, your financial information is shared with third parties for marketing purposes. And, according to HIPAA privacy rules, your medical chart can be used for research without your express permission. None of these things are inherently threatening, but you should be aware of how your data is used before you provide it.
- Limit the information you provide to potential bad actors. Don’t overshare online, and limit your required sharing of data to the time it is necessary, for example at the end of an application process (not the beginning). More than once, I have experienced situations where the employee opening an account for me wrote my social security number on a piece of scrap paper during the process. I didn’t think any of those people were necessarily bad actors, but in each case there was no indication of where that paper was going at the end of the transaction. You bet I politely retrieved that piece of paper in every case.
- Close accounts you don’t use to limit your digital footprint and decrease the amount of information vulnerable to breach. An account you neglect can become a real liability. For more information on why and how to deal with zombie accounts, read How Zombie Accounts Are Killing Your Cybersecurity.
Minimizing the data you share and being conscientious of limiting with whom you share it is about all that is under your control. But take that control and assert your privacy and security in any way you can.
Rocher L, Hendrickx JM, de Montjoye Y-A. Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications. 2019;10(1):3069.