Repository logo
  • Log In
    Log in via Symplectic to deposit your publication(s).
Repository logo
  • Communities & Collections
  • Research Outputs
  • Statistics
  • Log In
    Log in via Symplectic to deposit your publication(s).
  1. Home
  2. Faculty of Engineering
  3. Faculty of Engineering
  4. Estimating the success of re-identifications in incomplete datasets using generative models
 
  • Details
Estimating the success of re-identifications in incomplete datasets using generative models
File(s)
s41467-019-10933-3.pdf (7.01 MB)
Published version
Author(s)
Rocher, Luc
Hendrickx, Julien
de Montjoye, Yves-Alexandre
Type
Journal Article
Abstract
While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
Date Issued
2019-07-23
Date Acceptance
2019-06-11
Citation
Nature Communications, 2019, 10 (7)
URI
http://hdl.handle.net/10044/1/74787
DOI
https://www.dx.doi.org/10.1038/s41467-019-10933-3
ISSN
2041-1723
Publisher
Nature Research (part of Springer Nature)
Journal / Book Title
Nature Communications
Volume
10
Issue
7
Copyright Statement
© The Author(s) 2019. This article is licensed under a Creative CommonsAttribution 4.0 International License, which permits use, sharing,adaptation, distribution and reproduction in any medium or format, as long as you giveappropriate credit to the original author(s) and the source, provide a link to the CreativeCommons license, and indicate if changes were made. The images or other third partymaterial in this article are included in the article’s Creative Commons license, unlessindicated otherwise in a credit line to the material. If material is not included in thearticle’s Creative Commons license and your intended use is not permitted by statutoryregulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder. To view a copy of this license, visithttp://creativecommons.org/licenses/by/4.0/.
License URL
http://creativecommons.org/licenses/by/4.0/
Subjects
Science & Technology
Multidisciplinary Sciences
Science & Technology - Other Topics
BIG DATA
DISCLOSURE
PRIVACY
HEALTH
DISTRIBUTIONS
ANONYMITY
FAILURE
RISK
Data Analysis
Data Anonymization
Datasets as Topic
Likelihood Functions
Normal Distribution
Personally Identifiable Information
Likelihood Functions
Normal Distribution
Datasets as Topic
Data Anonymization
Personally Identifiable Information
Data Analysis
Publication Status
Published
Article Number
ARTN 3069
About
Spiral Depositing with Spiral Publishing with Spiral Symplectic
Contact us
Open access team Report an issue
Other Services
Scholarly Communications Library Services
logo

Imperial College London

South Kensington Campus

London SW7 2AZ, UK

tel: +44 (0)20 7589 5111

Accessibility Modern slavery statement Cookie Policy

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Cookie settings
  • Privacy policy
  • End User Agreement
  • Send Feedback