MSEI

Molecular structure elucidation by integrating different data mining strategies

Started
January 4, 2019
Status
Completed
Share this project

Abstract

The overall goal of this project is to develop and implement advanced data-driven programming tools, enabling a superior insight into ultra-high performance liquid chromatography coupled to high- resolution mass spectrometry (UHPLC-HRMS) data. While HPLC has been used as the first level of analyte separation since the 1960s, HRMS is a relatively new and powerful analytic technique used for discovery of molecular species based on their exact mass to charge ratio (m/z). The instrumentation applied is capable of separating mass fragments at the fourth or fifth decimal place. The additional information narrows down the possible chemical formulas of a molecule and thus allows an unprecedented unambiguous qualitative and quantitative assessment of the composition of various types of samples. Not surprisingly, HRMS has found applications across a broad spectrum of scientific fields.

Although we can routinely discern hundreds to thousands of molecular ‘features’ in complex samples such as blood, aerosols, soil, or biofuels, the complexity of the resulting data stream increases proportionally, producing millions of data points per second in multidimensional space. Thus post-processing and data reduction methods followed by data mining and innovative visualization techniques are required to yield meaningful information from HRMS. The project is about developing semi-automatic methods to confidently pinpoint each unknown molecular structure. It is a unique opportunity to expand the applicability of both HRMS and the Kendrick Mass Defect (KMD) approach beyond their current state-of-the-art applications, as well as beyond the capabilities of other analytic methods such as NMR and X-ray crystallography tools that typically require pure samples in relatively large amounts.

People

Collaborators

SDSC Team:
Eliza Harris
Lilian Gasser
Michele Volpi
Tanja Käser
Fernando Perez-Cruz
Guillaume Obozinski

PI | Partners:

PSI, Catalytic Process Engineering Research Group:

  • Dr. Saša Bjelić

More info

description

Motivation

Non-targeted screening of organic compounds in complex mixtures typically relies on liquid chromatography coupled with tandem mass spectrometry (UHPLC-HRMS/MS). Despite recent instrumental advancements that have improved data quality and quantity, current analytical methods can only identify structures for a small percentage of compounds in typical mixtures, creating a significant gap in our ability to fully characterize complex samples.

Proposed Approach / Solution

We developed a novel data analysis pipeline that leverages data science methodologies to enhance structural identification from tandem mass spectrometry data. The pipeline calculates feature vectors directly from mass spectra, substantially reducing computational costs, and employs an optimized fingerprint comparison methodology that accounts for uncertainty. The system builds upon initial compound identifications using targeted training and tailored molecular fingerprints, predicting a custom 75-digit molecular fingerprint through random forests. Kendrick mass defects and lost fragments proved valuable for fingerprint prediction, with potential matches filtered using a machine learning-based retention time prediction method.

Impact

The developed models are a major step forward in addressing the analytical challenges of non-targeted screening, potentially expanding our ability to identify unknown compounds in complex environmental, biological, and chemical samples.

Gallery

Annexe

Additional resources

Bibliography

  1. Wu et al. (2021) Valence Photoionization and Energetics of Vanillin, a Sustainable Feedstock Candidate, The Journal of Physical Chemistry A, doi: 10.1021/acs.jpca.1c00876
  2. Dührkop et al. (2020) Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra, Nature Biotechnology, doi: 10.1038/s41587-020-0740-8
  3. Arturi et al. (2019) Molecular footprint of co-solvents in hydrothermal liquefaction (HTL) of Fallopia Japonica, Journal of Supercritical Fluids, doi: 10.1016/j.supflu.2018.08.010
  4. Roach et al. (2011) Higher-Order Mass Defect Analysis for Mass Spectra of Complex Organic Mixtures, Analytical Chemistry, doi: 10.1021/ac200654j

Publications

Harris, E.; Gasser, L.; Volpi, M.; Perez-Cruz, F.; Bjelić, S.; Obozinski, G. "Harnessing data science to improve molecular structure elucidation from tandem mass spectrometry" Structural Chemistry 34 5 1935-1950 2023 View publication

Related Pages

More projects

AI-Driven Political Monitoring

Completed
Legislative tracking for labor advocacy at Kaufmännischer Verband Schweiz
Digital Society
Private sector

LUCID National Data Stream

In Progress
Low Value of Care in Medical Hospitalized Patients - a National Data Stream on Quality of Care in Swiss University Hospitals
Health & Biomedical

Syngenta: Steam consumption optimization

Completed
Reliable strategies to save energy in Syngenta’s Kaisten plant
Energy & Sustainability
Private sector

Pilot project ENERBAT

Completed
Data-Driven Pathways to Net Zero for the Canton of Vaud’s Building Portfolio
Energy & Sustainability
Climate & Environment
Public sector

News

Latest news

Coding the Future: Energy Data Hackdays Expand to French-speaking Switzerland
May 7, 2026

Coding the Future: Energy Data Hackdays Expand to French-speaking Switzerland

Coding the Future: Energy Data Hackdays Expand to French-speaking Switzerland

Held at the SDSC headquarters at Biopôle, the Energy Data Hackdays gather 100 experts to tackle 5 energy and grid challenges.
Science des données : le SDSC et le Canton de Vaud soutiennent quatre projets appliqués
April 30, 2026

Science des données : le SDSC et le Canton de Vaud soutiennent quatre projets appliqués

Science des données : le SDSC et le Canton de Vaud soutiennent quatre projets appliqués

Le SDSC et le Canton de Vaud ont retenu quatre projets parmi les 57 soumissions reçues lors de leur deuxième appel à projets.
Le Swiss Data Science Center inaugure son siège au Biopôle de Lausanne
March 12, 2026

Le Swiss Data Science Center inaugure son siège au Biopôle de Lausanne

Le Swiss Data Science Center inaugure son siège au Biopôle de Lausanne

Le SDSC inaugure aujourd'hui son siège au campus Biopôle de Lausanne, dans le cadre d'un partenariat stratégique avec l'État de Vaud.

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!