Biomedical data, navigating between public health advancements and privacy challenges

Data-driven medicine has a major downside: by transforming Medicine’s trust model, in place since Hippocrates, it creates unprecedented privacy risks that need to be urgently addressed.
By
Mathias Humbert
July 19, 2017
Share this post

A new era for biomedical research

Thanks to the fast progress in sequencing technologies, the cost of molecular profiling has plummeted over the last decade, making a colossal amount of biological data available for research.

Since the first sequencing of the human genome in 2001, more than one million individuals have had their genotype sequenced. This pace should even increase to reach close to one billion genomes sequenced by 2025 . These newly available biomedical data have dramatically changed the medical research domain, and promise to revolutionize the practice of medicine in the near future. Big biomedical data provide the means for a more personalized, predictive and precise medicine.

However, data-driven medicine has a major downside: by transforming Medicine’s trust model, in place since Hippocrates, it creates unprecedented privacy risks that need to be urgently addressed.

Re-identification risks

Privacy risks in biomedical data stem from the correlations entailed in the various dimensions of these data. The first dimension relates to the different biomedical data types (often referred to as “-omic” data). Using a computer-network analogy, we can model the biological system as a stack of different layers: from the genomic (that contains our DNA sequence) to the phenomic layer (that contains our physical traits for instance) via the epigenomic or transcriptomic layers. Various relationships exist between the biomedical layers.

For instance, we have shown that one can re-identify individuals’ genomes by matching them to phenotypic traits of individuals in another database such as an online social network .

Re-identification attack via phenotypic traits

The second dimension relates to the individuals and their relationships. Family members’ data are inherently correlated due to inheritance laws. This also creates means for an attacker with some background knowledge to re-identify personal genomes. Researchers demonstrated that surnames could be inferred from genomic data by querying recreational genealogy databases with short tandem repeats on the Y chromosome . Besides that, probabilistic dependencies also exist between autosomal (i.e., non-sexual) chromosomes that can be used to reconstruct someone’s genomic data by observing genetic information of her relatives . The third dimension of correlations is between different positions/regions in a single genome (or another type of biomedical data), that must be carefully considered when assessing potential privacy risks. Indeed, by having access even to a small subset of biomedical data, the missing parts can be inferred because of these correlations (this is known as imputation in genomics).

Finally, one may think the genome is the only biomedical data vulnerable to re-identification attacks, but it is (unfortunately) not. We have shown that, again due to correlations between different layers of the biological stack, one can match DNA methylation profiles – one of the most important epigenetic element – to individual genomes with a success rate of 97.5% to 100% for databases of thousands of participants .  In addition to epigenomic data, researchers have demonstrated that transcriptomic profiles (gene expression data) could also be re-identified with a matching success rate of 97.1% when matching them to databases of 300 million genomes .

Protection mechanisms

After having thoroughly assessed the privacy risks stemming from biomedical data, we can develop appropriate protection mechanisms by relying on cryptographic techniques (as in ) or on differential privacy methods . We will elaborate on these mechanisms in a subsequent Articles, notably discussing the trade-off between privacy and utility in the biomedical context.

We strongly believe that ensuring data contributors’ privacy is a key step in fostering the data sharing necessary to research progress.

Biomedical data are intrinsically privacy sensitive and the risks of sharing them must be well understood and controlled without destroying the benefits of data-driven medicine, such as the scientific breakthrough on the links between cancer and genetics (see, e.g., BRCA1/BRCA2 genes and breast cancer).

Biography

  1. Z. Stephens et al., Big Data: Astronomical or Genomical?, PLOS Biology, 2015
  2. M. Humbert et al., De-anonymizing Genomic Databases with Phenotypic Traits, PoPETS, 2015
  3. M. Gymrek et al., Identifying Personal Genomes by Surname Inference, Science, 2013
  4. M. Humbert et al., Quantifying Interdependent Risks in Genomic Privacy, ACM TOPS, 2017
  5. M. Backes et al., Identifying Personal DNA Methylation by Genotype Inference, IEEE S&P, 2017
  6. E. Schadt et al., Bayesian method to predict individual SNP genotypes from gene expression data, Nature Genetics, 2012
  7. F. Tramer et al., Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies, ACM CCS, 2015

About the author

Biomedical data, navigating between public health advancements and privacy challenges
Mathias Humbert
Sr. Data Scientist

Mathias received his Ph.D. in computer and communication sciences from EPFL in 2015. He then spent two years as a post-doctoral researcher in the Center for IT-Security, Privacy, and Accountability (CISPA) at Saarland University, Germany, where he worked on genomic privacy and privacy in social networks. His current research interests lie at the intersection of privacy and machine learning, with a special application focus on biomedical data. He is currently the lead scientist for the SDSC of the PHRT project “DPPH: Data Protection in Personalized Health”. He is also co-principal investigator of a project funded by the Leenaards Foundation on evaluating and preventing privacy risks in biomedical databases.

Share this post

More blog posts

December 12, 2024

SDSC News - December 2024 Newsletter

SDSC News - December 2024 Newsletter

Dear SDSC Community, we are excited to introduce SDSC News, a platform to keep you informed about the latest developments, projects,...
SDSC Newsletters
November 22, 2024

The SDSC Establishes Permanent Presence at Biopôle with Support from Canton Vaud

The SDSC Establishes Permanent Presence at Biopôle with Support from Canton Vaud

Press Release: The Swiss Data Science Center Establishes a Permanent Presence at Biopôle with Support from the Canton of Vaud.
Our News
November 5, 2024

Insights from the "ORD for the Sciences" Hackathon

Insights from the "ORD for the Sciences" Hackathon

Discover the highlights from the ORD for the Sciences Hackathon that took place Oct. 24-25, 2025 at EPFL.
Our News

More news

November 5, 2018

Deepsphere | A neural network architecture for spherical data

Deepsphere | A neural network architecture for spherical data

Not all datasets are images and we need architectures that adapt to other types of data, encoding both domain specific knowledge and data specific characteristics. For instance, at the SDSC, we deal with spherical data, i.e. curved images on a sphere, but without clear borders and arbitrary orientation.
Blog
January 15, 2020

The Swiss Data Custodian | Part 1: a necessary concept in the field of data privacy

The Swiss Data Custodian | Part 1: a necessary concept in the field of data privacy

It is vital to design services and analytics that enable individual privacy protection and self-determination over their data, in such a way that enhances data synergy and empower weaker parties to access and consume data in a fair and responsible way.
Blog
May 2, 2019

ACE-DATA | Antarctic circumnavigation expedition – delivering added value to Antarctica

ACE-DATA | Antarctic circumnavigation expedition – delivering added value to Antarctica

Understanding the complexity of the Earth systems and our climate is important to be able to make predictions about how they may change in the future. To do this, scientists use models which describe the relevant processes.
Blog

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!