SPEEDMIND

Improving species biodiversity analyses and citizen science feedback through machine learning

Started

December 1, 2017

Status

In Progress

Share this project

The Speedmind project addresses some of the most important challenges in biodiversity monitoring at large scale. These include the preferential (or opportunistic) sampling aspect of presence-only data in the absence of full surveys (inventories), and the fact that species distribution maps (SDMs) are often constructed for one species at a time (no joint modeling of multiple species). The preferential biased sampling challenge arises from the plant species sightings provided by InfoFlora, the national data and information center of the Swiss flora, and the way it relies on citizen science/crowdsourcing for plant sightings. To address these challenges, we explore two approaches: one borrowing from recommender systems, and one based on spatial point processes. Both approaches leverage the presence data of multiple species taken together and yield usable SDMs, including realistic predictions at unmonitored locations.

People

Collaborators

SDSC Team:

Fernando Perez-Cruz

Former Deputy Executive Director & Chief Data Scientist

Fernando Perez-Cruz received a PhD. in Electrical Engineering from the Technical University of Madrid. He is Titular Professor in the Computer Science Department at ETH Zurich and Head of Machine Learning Research and AI at Spiden. He has been a member of the technical staff at Bell Labs and a Machine Learning Research Scientist at Amazon. Fernando has been a visiting professor at Princeton University under a Marie Curie Fellowship and an associate professor at University Carlos III in Madrid. He held positions at the Gatsby Unit (London), Max Planck Institute for Biological Cybernetics (Tuebingen), and BioWulf Technologies (New York). Fernando Perez-Cruz has served as Chief Data Scientist at the SDSC from 2018 to 2023, and Deputy Executive Director of the SDSC from 2022 to 2023

Fernando Perez-Cruz

Izabela Moise

Sr. Data Scientist

Izabela holds a PhD degree in Computer Science from University of Rennes 1, France and the National French Institute for Research in Computer Science and Automatics (INRIA), France. Before joining the SDSC, she was a postdoctoral researcher at the Chair of Computational Social Science at ETH Zurich and a lecturer for the “Data Science in Techno-Socio-Economic Systems” course at ETH Zurich. Her main research focus is on big data analytics, tools and platforms, machine learning and data mining, large scale network analysis, in the particular setting of social data mining.

Izabela Moise

William Aeberhard

Sr. Data Scientist

William obtained a PhD in Statistics in 2015 jointly from the University of Geneva and the University of Sydney. He then worked as a post-doctoral research fellow at Dalhousie University as part of a Canadian Statistical Sciences Institute collaborative research team. He was an Assistant Professor of Statistics at Stevens Institute of Technology in Hoboken, New Jersey, before joining the SDSC in September 2020. His research interests include robust statistics, non-parametric methods, and spatio-temporal modeling. His recent cross-disciplinary collaborations involve applications in marine biology, volcanology, and fisheries science.

William Aeberhard

PI | Partners:

Dynamic Macroecology Group:

Prof. Niklaus Zimmermann

Dr. Patrice Descombes

Dr. Philipp Brun

Dr. Damaris Zurell

More info

Biodiversity and Conservation Biology Group:

Dr. Dirk N. Karger

More info

description

Goals & Challenges:

Applicable to any species group world-wide, Speedmind focuses on flora data from Switzerland as a pilot system which builds plant species distribution maps (SDMs) by incorporating an extensive amount of data. These data are pooled from various sources in novel ways for the domain science, fostering the link between ecological sciences and citizen science. Notably, Speedmind has a strong citizen science component as it relies on plant species sightings from InfoFlora, the national data and information center of the Swiss flora.

The goals include:

integrating all data sources (environmental, traits, sightings) into a single data platform;
developing SDMs for 3500+ species in Switzerland;
allowing for automated quality checks in citizen science data;
real-time guidance of observer efforts in citizen science-based data collection.

Constructing SDMs in Switzerland from presence-only data poses some important challenges:

species sampling is highly biased (preferential sampling);
modeling should include all species together (joint modeling);
integrating all available data into a single framework.

The preferential biased sampling challenge arises from how plant species sightings are being recorded: sightings across Switzerland are reported by thousands of citizen scientists in the field with the freely available InfoFlora mobile app (FloreApp). However, most citizen scientists are not botany experts and generally only collect observations of species from easily accessible areas in the landscape (e.g. close to roads and paths). More information can be found in this blog post about the project.

Impact:

The development of SDMs jointly for 3500+ plant species over Switzerland enables the following:

better monitoring of potentially invasive plant species;

improved study of rare species and their habitat;

revised biodiversity management at the national scale, with possible implications for land use.

Approaches:

Prior to the development of new models for plant SDMs, we have integrated large amounts of heterogeneous data sources (environmental data streams, maps, trait data, phylogenies, species occurrence data) in a standardized warehouse, representing an important contribution as it brings in one place disparate ecological, spatial and thematic information. In particular, Speedmind developed new types of predictors sets at a very high spatial resolution (93 m), that are highly gained in precision and enable a better description of the species ecological niche.

First, we generated highly computational demanding maps of climate (temperature and precipitation) by downscaling CHELSA climate layers (Karger et al. 2017) from 1 km to 93 m spatial resolution in Switzerland. This pipeline will be further made available as an online tool to generate world level climate maps at 93 m resolution, which represents a major step in domain science for high resolution biodiversity studies.

Second, by combining a massive amount of plant data occurrences with expert-based ecological indicators of the plant ecology (classification of species ecological preferences for several climate and soil parameters), we used machine learning models (random forest) to generate eight ecologically meaningful predictors of plants (e.g. soil acidity, soil moisture, etc.). The resulting predictors outperformed traditional predictors used in ecology and increased our ability to predict the distribution of plant species in Switzerland, with an average improvement of 7.7% in model performance (delta TSS). In particular, species growing into moist habitats (e.g. Gentiana pneumonanthe) and growing along gradients of soil acidity (e.g. Oxytropis jacquinii, Ophrys holosericea) strongly benefited from this new set of predictors (Figure 2). While getting more informative predictors of species distribution is a first step, the machine learning algorithms developed within Speedmind will further enable to improve the modelling process and our understanding of the species ecological niche in a new and challenging way for domain science.

While modelling rare plant species is a main challenge using traditional SDMs (because of low data availability), we aim at achieving this by jointly modelling rare species with the more widespread ones and by integrating information on species ecological and morphological similarities. More precisely, we are using two separate approaches to build joint species distribution models. The first is a hierarchical Poisson factorization approach, a form of recommender system where the most likely location-species pairs are identified and distinct latent weights represent preferences of locations and prevalence of species. The second approach is a log-Gaussian Cox process where environmental information is introduced as smooth non-linear effects. This point process is further enhanced by including predicted intensity fields from other species, which achieves a joint modeling.

Presentation

Download Presentation



Gallery

Annexe

Additionnal resources

Read our blog about SPEEDMIND

—



Bibliography

Publications

More projects

ML4FCC

In Progress

Machine Learning for the Future Circular Collider Design

Big Science Data

CLIMIS4AVAL

In Progress

Real-time cleansing of snow and weather data for operational avalanche forecasting

Energy, Climate & Environment

SEMIRAMIS

Completed

AI-augmented architectural design

Energy, Climate & Environment

4D-Brains

In Progress

Extracting activity from large 4D whole-brain image datasets

Biomedical Data Science

All projects

News

Latest news

March 20, 2024

Efficient and scalable graph generation through iterative local expansion

Have you ever considered the complexity of generating large-scale, intricate graphs akin to those that represent the vast relational structures of our world? Our research introduces a pioneering approach to graph generation that tackles the scalability and complexity of creating such expansive, real-world graphs.





March 6, 2024

RAvaFcast | Automating regional avalanche danger prediction in Switzerland

RAvaFcast is a data-driven model pipeline developed for automated regional avalanche danger forecasting in Switzerland. It combines a recently proposed classifier for avalanche danger prediction at weather stations with a spatial interpolation model and a novel aggregation strategy to estimate the danger levels in predefined wider warning regions, ultimately assembled as an avalanche bulletin.





February 6, 2024

PassGPT | Using language models to enhance password security

PassGPT is a Large Language Model for password generation trained on leaked passwords, which can outperform existing methods based on generative adversarial networks by guessing twice as many unseen passwords.





All news

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!

Contact our team

SPEEDMIND

Abstract

People

Collaborators

PI | Partners:

Dynamic Macroecology Group:

Biodiversity and Conservation Biology Group:

description

Goals & Challenges:

Impact:

Approaches:

Presentation

Gallery

Annexe

Additionnal resources

Bibliography

Publications

Related Pages

More projects

ML4FCC

CLIMIS4AVAL

SEMIRAMIS

4D-Brains

News

Latest news

Efficient and scalable graph generation through iterative local expansion

Efficient and scalable graph generation through iterative local expansion

RAvaFcast | Automating regional avalanche danger prediction in Switzerland

RAvaFcast | Automating regional avalanche danger prediction in Switzerland

PassGPT | Using language models to enhance password security

PassGPT | Using language models to enhance password security

Contact us

Let’s talk Data Science