A new era for biomedical research
Thanks to the fast progress in sequencing technologies, the cost of molecular profiling has plummeted over the last decade, making a colossal amount of biological data available for research.
Since the first sequencing of the human genome in 2001, more than one million individuals have had their genotype sequenced. This pace should even increase to reach close to one billion genomes sequenced by 2025 . These newly available biomedical data have dramatically changed the medical research domain, and promise to revolutionize the practice of medicine in the near future. Big biomedical data provide the means for a more personalized, predictive and precise medicine.
However, data-driven medicine has a major downside: by transforming Medicine’s trust model, in place since Hippocrates, it creates unprecedented privacy risks that need to be urgently addressed.
Privacy risks in biomedical data stem from the correlations entailed in the various dimensions of these data. The first dimension relates to the different biomedical data types (often referred to as “-omic” data). Using a computer-network analogy, we can model the biological system as a stack of different layers: from the genomic (that contains our DNA sequence) to the phenomic layer (that contains our physical traits for instance) via the epigenomic or transcriptomic layers. Various relationships exist between the biomedical layers.
For instance, we have shown that one can re-identify individuals’ genomes by matching them to phenotypic traits of individuals in another database such as an online social network .
Re-identification attack via phenotypic traits
The second dimension relates to the individuals and their relationships. Family members’ data are inherently correlated due to inheritance laws. This also creates means for an attacker with some background knowledge to re-identify personal genomes. Researchers demonstrated that surnames could be inferred from genomic data by querying recreational genealogy databases with short tandem repeats on the Y chromosome . Besides that, probabilistic dependencies also exist between autosomal (i.e., non-sexual) chromosomes that can be used to reconstruct someone’s genomic data by observing genetic information of her relatives . The third dimension of correlations is between different positions/regions in a single genome (or another type of biomedical data), that must be carefully considered when assessing potential privacy risks. Indeed, by having access even to a small subset of biomedical data, the missing parts can be inferred because of these correlations (this is known as imputation in genomics).
Finally, one may think the genome is the only biomedical data vulnerable to re-identification attacks, but it is (unfortunately) not. We have shown that, again due to correlations between different layers of the biological stack, one can match DNA methylation profiles – one of the most important epigenetic element – to individual genomes with a success rate of 97.5% to 100% for databases of thousands of participants . In addition to epigenomic data, researchers have demonstrated that transcriptomic profiles (gene expression data) could also be re-identified with a matching success rate of 97.1% when matching them to databases of 300 million genomes .
After having thoroughly assessed the privacy risks stemming from biomedical data, we can develop appropriate protection mechanisms by relying on cryptographic techniques (as in ) or on differential privacy methods . We will elaborate on these mechanisms in a subsequent article, notably discussing the trade-off between privacy and utility in the biomedical context.
We strongly believe that ensuring data contributors’ privacy is a key step in fostering the data sharing necessary to research progress.
Biomedical data are intrinsically privacy sensitive and the risks of sharing them must be well understood and controlled without destroying the benefits of data-driven medicine, such as the scientific breakthrough on the links between cancer and genetics (see, e.g., BRCA1/BRCA2 genes and breast cancer).
Mathias Humbert, Sr Data Scientist, Swiss Data Science Center
 Z. Stephens et al., Big Data: Astronomical or Genomical?, PLOS Biology, 2015
 M. Humbert et al., De-anonymizing Genomic Databases with Phenotypic Traits, PoPETS, 2015
 M. Gymrek et al., Identifying Personal Genomes by Surname Inference, Science, 2013
 M. Humbert et al., Quantifying Interdependent Risks in Genomic Privacy, ACM TOPS, 2017
 M. Backes et al., Identifying Personal DNA Methylation by Genotype Inference, IEEE S&P, 2017
 E. Schadt et al., Bayesian method to predict individual SNP genotypes from gene expression data, Nature Genetics, 2012
 F. Tramer et al., Differential Privacy with Bounded Priors: Reconciling Utility and Privacy in Genome-Wide Association Studies, ACM CCS, 2015