Biomedical data, navigating between public health advancements and privacy challenges

By
Mathias Humbert
July 19, 2017
Share this post

A new era for biomedical research

Thanks to the fast progress in sequencing technologies, the cost of molecular profiling has plummeted over the last decade, making a colossal amount of biological data available for research.

Since the first sequencing of the human genome in 2001, more than one million individuals have had their genotype sequenced. This pace should even increase to reach close to one billion genomes sequenced by 2025 . These newly available biomedical data have dramatically changed the medical research domain, and promise to revolutionize the practice of medicine in the near future. Big biomedical data provide the means for a more personalized, predictive and precise medicine.

However, data-driven medicine has a major downside: by transforming Medicine’s trust model, in place since Hippocrates, it creates unprecedented privacy risks that need to be urgently addressed.

Re-identification risks

Privacy risks in biomedical data stem from the correlations entailed in the various dimensions of these data. The first dimension relates to the different biomedical data types (often referred to as “-omic” data). Using a computer-network analogy, we can model the biological system as a stack of different layers: from the genomic (that contains our DNA sequence) to the phenomic layer (that contains our physical traits for instance) via the epigenomic or transcriptomic layers. Various relationships exist between the biomedical layers.

For instance, we have shown that one can re-identify individuals’ genomes by matching them to phenotypic traits of individuals in another database such as an online social network .

Re-identification attack via phenotypic traits

The second dimension relates to the individuals and their relationships. Family members’ data are inherently correlated due to inheritance laws. This also creates means for an attacker with some background knowledge to re-identify personal genomes. Researchers demonstrated that surnames could be inferred from genomic data by querying recreational genealogy databases with short tandem repeats on the Y chromosome . Besides that, probabilistic dependencies also exist between autosomal (i.e., non-sexual) chromosomes that can be used to reconstruct someone’s genomic data by observing genetic information of her relatives . The third dimension of correlations is between different positions/regions in a single genome (or another type of biomedical data), that must be carefully considered when assessing potential privacy risks. Indeed, by having access even to a small subset of biomedical data, the missing parts can be inferred because of these correlations (this is known as imputation in genomics).

Finally, one may think the genome is the only biomedical data vulnerable to re-identification attacks, but it is (unfortunately) not. We have shown that, again due to correlations between different layers of the biological stack, one can match DNA methylation profiles – one of the most important epigenetic element – to individual genomes with a success rate of 97.5% to 100% for databases of thousands of participants .  In addition to epigenomic data, researchers have demonstrated that transcriptomic profiles (gene expression data) could also be re-identified with a matching success rate of 97.1% when matching them to databases of 300 million genomes .

Protection mechanisms

After having thoroughly assessed the privacy risks stemming from biomedical data, we can develop appropriate protection mechanisms by relying on cryptographic techniques (as in ) or on differential privacy methods . We will elaborate on these mechanisms in a subsequent Articles, notably discussing the trade-off between privacy and utility in the biomedical context.

We strongly believe that ensuring data contributors’ privacy is a key step in fostering the data sharing necessary to research progress.

Biomedical data are intrinsically privacy sensitive and the risks of sharing them must be well understood and controlled without destroying the benefits of data-driven medicine, such as the scientific breakthrough on the links between cancer and genetics (see, e.g., BRCA1/BRCA2 genes and breast cancer).

Biography

  1. Z. Stephens et al., Big Data: Astronomical or Genomical?, PLOS Biology, 2015
  2. M. Humbert et al., De-anonymizing Genomic Databases with Phenotypic Traits, PoPETS, 2015
  3. M. Gymrek et al., Identifying Personal Genomes by Surname Inference, Science, 2013
  4. M. Humbert et al., Quantifying Interdependent Risks in Genomic Privacy, ACM TOPS, 2017
  5. M. Backes et al., Identifying Personal DNA Methylation by Genotype Inference, IEEE S&P, 2017
  6. E. Schadt et al., Bayesian method to predict individual SNP genotypes from gene expression data, Nature Genetics, 2012
  7. F. Tramer et al., Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies, ACM CCS, 2015

About the author

Biomedical data, navigating between public health advancements and privacy challenges
Mathias Humbert
Sr. Data Scientist

Mathias received his Ph.D. in computer and communication sciences from EPFL in 2015. He then spent two years as a post-doctoral researcher in the Center for IT-Security, Privacy, and Accountability (CISPA) at Saarland University, Germany, where he worked on genomic privacy and privacy in social networks. His current research interests lie at the intersection of privacy and machine learning, with a special application focus on biomedical data. He is currently the lead scientist for the SDSC of the PHRT project “DPPH: Data Protection in Personalized Health”. He is also co-principal investigator of a project funded by the Leenaards Foundation on evaluating and preventing privacy risks in biomedical databases.

Share this post

More blog posts

October 31, 2023

Street2Vec | Self-supervised learning unveils change in urban housing from street-level images

Street2Vec | Self-supervised learning unveils change in urban housing from street-level images

It is difficult to effectively monitor and track progress in urban housing. We attempt to overcome these limitations by utilizing self-supervised learning with over 15 million street-level images taken between 2008 and 2021 to measure change in London.
Blog
September 23, 2022

What you see is what you classify: black box attributions

What you see is what you classify: black box attributions

The lack of transparency of black-box models is a fundamental problem in modern Artificial Intelligence and Machine Learning. This work focuses on how to unbox deep learning models for image classification problems.
Blog

More news

April 30, 2021

Dense representation learning

Dense representation learning

Can we extend the concept of word embeddings to any collection of items, possibly unordered? More precisely, can we learn representations from item sets, such as the product baskets in online retail or music playlists on streaming platforms?
Blog

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!