MSEI

Molecular structure elucidation by integrating different data mining strategies

Started
January 4, 2019
Status
Completed
Share this project

Abstract

The overall goal of this project is to develop and implement advanced data-driven programming tools, enabling a superior insight into ultra-high performance liquid chromatography coupled to high- resolution mass spectrometry (UHPLC-HRMS) data. While HPLC has been used as the first level of analyte separation since the 1960s, HRMS is a relatively new and powerful analytic technique used for discovery of molecular species based on their exact mass to charge ratio (m/z). The instrumentation applied is capable of separating mass fragments at the fourth or fifth decimal place. The additional information narrows down the possible chemical formulas of a molecule and thus allows an unprecedented unambiguous qualitative and quantitative assessment of the composition of various types of samples. Not surprisingly, HRMS has found applications across a broad spectrum of scientific fields.

Although we can routinely discern hundreds to thousands of molecular ‘features’ in complex samples such as blood, aerosols, soil, or biofuels, the complexity of the resulting data stream increases proportionally, producing millions of data points per second in multidimensional space. Thus post-processing and data reduction methods followed by data mining and innovative visualization techniques are required to yield meaningful information from HRMS. The project is about developing semi-automatic methods to confidently pinpoint each unknown molecular structure. It is a unique opportunity to expand the applicability of both HRMS and the Kendrick Mass Defect (KMD) approach beyond their current state-of-the-art applications, as well as beyond the capabilities of other analytic methods such as NMR and X-ray crystallography tools that typically require pure samples in relatively large amounts.

People

Collaborators

SDSC Team:
Eliza Harris
Lilian Gasser
Michele Volpi
Fernando Perez-Cruz
Guillaume Obozinski

PI | Partners:

PSI, Catalytic Process Engineering Research Group:

  • Dr. Saša Bjelić

More info

description

  • Molecular clustering based on UHPLC-HRMS/MS data reflecting chemical “families” based on the presence of similar functional groups.
  • Within-cluster prediction of functional groups and molecular structure for unknown compounds.
  • Predictive modelling of molecular fragmentation patterns, retention time, and other features.
Figure 1: Fragmentation spectra for two dicarboxylic acids illustrating clear differences in fragment patterns and intensities despite similar structures.

Gallery

Annexe

Publications

  • Harris, E., Gasser, L., Volpi, Pérez-Cruz, F., Saša Bjelić, S., Obozinski, G. (2023) Harnessing data science to improve molecular structure elucidation from tandem mass spectrometry. Struct Chem 34, 1935–1950 .

Additional resources

Bibliography

  1. Wu et al. (2021) Valence Photoionization and Energetics of Vanillin, a Sustainable Feedstock Candidate, The Journal of Physical Chemistry A, doi: 10.1021/acs.jpca.1c00876
  2. Dührkop et al. (2020) Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra, Nature Biotechnology, doi: 10.1038/s41587-020-0740-8
  3. Arturi et al. (2019) Molecular footprint of co-solvents in hydrothermal liquefaction (HTL) of Fallopia Japonica, Journal of Supercritical Fluids, doi: 10.1016/j.supflu.2018.08.010
  4. Roach et al. (2011) Higher-Order Mass Defect Analysis for Mass Spectra of Complex Organic Mixtures, Analytical Chemistry, doi: 10.1021/ac200654j

Publications

Related Pages

More projects

ML-L3DNDT

Completed
Robust and scalable Machine Learning algorithms for Laue 3-Dimensional Neutron Diffraction Tomography
Big Science Data

BioDetect

Completed
Deep Learning for Biodiversity Detection and Classification
Energy, Climate & Environment

IRMA

In Progress
Interpretable and Robust Machine Learning for Mobility Analysis
No items found.

FLBI

In Progress
Feature Learning for Bayesian Inference
No items found.

News

Latest news

Smartair | An active learning algorithm for real-time acquisition and regression of flow field data
May 1, 2024

Smartair | An active learning algorithm for real-time acquisition and regression of flow field data

Smartair | An active learning algorithm for real-time acquisition and regression of flow field data

We’ve developed a smart solution for wind tunnel testing that learns as it works, providing accurate results faster. It provides an accurate mean flow field and turbulence field reconstruction while shortening the sampling time.
The Promise of AI in Pharmaceutical Manufacturing
April 22, 2024

The Promise of AI in Pharmaceutical Manufacturing

The Promise of AI in Pharmaceutical Manufacturing

Innovation in pharmaceutical manufacturing raises key questions: How will AI change our operations? What does this mean for the skills of our workforce? How will it reshape our collaborative efforts? And crucially, how can we fully leverage these changes?
Efficient and scalable graph generation through iterative local expansion
March 20, 2024

Efficient and scalable graph generation through iterative local expansion

Efficient and scalable graph generation through iterative local expansion

Have you ever considered the complexity of generating large-scale, intricate graphs akin to those that represent the vast relational structures of our world? Our research introduces a pioneering approach to graph generation that tackles the scalability and complexity of creating such expansive, real-world graphs.

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!