Summary
The regulation of chemicals traditionally involves animal testing, which is mainly performed on fish and crustaceans for ecotoxicological hazard assessment. Algae are often used as model organisms for herbicides and have the potential to be used as animal alternatives. Applying machine learning to ecotoxicology could help reduce the number of animal tests, costs, and animals sacrificed while preserving the accuracy of the in vivo tests. We need a comprehensive dataset for an accurate comparison of models. We therefore introduce ADORE, a dataset on acute mortality of fish, crustaceans, and algae, which we have equipped with additional information on chemicals and species. It is intended as a benchmark dataset, which research groups can use to compare model performances in a standardized manner.
The need for applied ML research in ecotoxicology
The regulation of chemicals aims to protect both human health and the environment. Ecotoxicology is a scientific field often used to study the harmful effects of chemicals on animals and the environment. In vivo tests on species from the taxonomic groups of fish, crustaceans, and algae have traditionally determined toxicity; however, there are high hopes that the development and application of machine learning (ML) methods to predict (eco)toxicology [1, 2] will reduce the need for animal testing. In this use case, the focus lies on the application of ML models using the performance of a model as evidence for a scientific claim, in contrast to the research focused on the refinement of ML methods. Applied ML research aims to find the most suitable model for a use case.
In ecotoxicological hazard assessments (i.e., determining the toxicity of a chemical to the environment by testing on specific species, which act as surrogates for the environment), quantitative structure-activity relationship (QSAR) models have a long history of use. QSARs predict biological activity, such as toxicity, from chemical properties and structures [3]. However, by definition, QSARs are limited to chemical features that describe a molecule and its structure. They are also typically relatively simple and explainable models, such as linear regression on one to a few independent variables. In contrast, when applied to ecotoxicological questions, machine learning is not limited to chemical features alone. It can integrate many other data types, although the higher complexity of ML models comes with the caveat of decreased explainability.
Learn from other fields: benchmark data
The performance of models trained on ecotoxicological data is only comparable when obtained from well-understood datasets with comparable chemical space and species scope. Ideally, models should be trained on the same data, allowing for straightforward model comparison.
For this reason, benchmark datasets have been introduced in other fields to facilitate model comparisons, such as the well-known examples of CIFAR and ImageNet in computer vision. In hydrology, Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) was introduced to enable progress in studying catchment similarity, model parameter estimation, and model benchmarking on geophysical datasets [5]. Hence, the successful application of machine learning models depends on the availability of relevant, well-described datasets that provide a common ground from which to train, benchmark, and compare models.
A benchmark dataset for ecotoxicology: ADORE
To serve as a benchmark dataset for machine learning in ecotoxicology and to foster progress in ecotoxicology, we have developed ADORE, a dataset on acute mortality in aquatic species from the taxonomic groups of fish, crustaceans, and algae [4]. Acute mortality is determined from in vivo experiments and measured as the concentration at which half of the population dies, known as the effective concentration 50 (EC50) or lethal concentration 50 (LC50). The generation and compilation of the dataset from different sources is shown in Figure 1.
Challenges: several species and taxonomic groups
Besides the necessity for similarity between datasets to properly evaluate model performances, the complexity of the data also poses limits. For example, higher accuracies and smaller errors are easier to achieve on a dataset focused on a single species (given the number of data points is sufficiently high) than on a dataset with dozens of species. The presence of several species introduces additional variability to the dataset as species' sensitivity to different chemicals varies.
The ADORE dataset contains challenges covering three levels of complexity to assist in answering research questions of varying complexity (Figure 2). The most complex challenges are based on the full dataset, including all three taxonomic groups (fish, crustaceans, and algae), which allows learning from algae and invertebrates as a surrogate for toxicity in fish. At an intermediate level of complexity, challenges focus on a taxonomic group. In contrast, the least complex challenges are restricted to single, well-represented test species such as rainbow trout (O. mykiss), fathead minnow (P. promelas), and water flea (D. magna), which are all species already used in the regulatory use case.
Representing chemicals and species
When training models for ecotoxicological use cases, the input data, both chemical and taxonomic, must be translated so that a model can understand it. Common approaches to presenting chemical information are either in the form of basic properties (molecular weight, water solubility, measures of lipophilicity, etc.) or as molecular representations, defined as descriptions of a molecule that can be understood by an algorithm [6]. Various molecular representations exist, including four common fingerprints (MACCS, PubChem, Morgan, ToxPrints), the molecular embedding mol2vec, and the molecular descriptor Mordred. Providing six molecular representations enables researchers to investigate how a molecule can best be represented to achieve a good model performance.
Meanwhile, adequately representing a species, including variables such as how to distinguish it from other species and how it might react to a chemical, is not trivial, and to complicate matters further, we are restricted by data availability for many species. In ADORE, firstly, we include information on ecology, life history, and pseudo-data used for dynamic energy budget (DEB) modeling to describe habitat, feeding and migratory behavior, anatomy, and life expectancy. Secondly, we added phylogenetic distance information describing how closely two species are related based on the time since their lineages diverged from their last common ancestor species (Figure 3). We include phylogenetic information on the assumption that more closely related species share similar sensitivity profiles than less closely related species.
The importance of appropriate data splittings
Crucially, model performance depends on splitting the data into a training set, from which the model learns, and a test set, which is used to evaluate the model's generalization ability on unseen data by using data the model has not accessed during training. Several approaches exist to create a train-test-split; however, their usefulness varies depending on the particular use case. For many applications, randomly distributing data points between training and test sets is sufficient. The ADORE dataset, however, contains several repeated experiments which are data points overlapping in chemical, species, and experimental conditions. Since these in vivo experiments are biological, they have inherent variability. Consequently, a repeated experiment, although performed on the same species and the same chemical, shows variability in its outcome (Figure 4).
By randomly dividing all data points between the training and test set, data points from repeated experiments end up both in the training and the test set. A model trained on such a split would result in data leakage, i.e., the model performance does not reflect the model’s ability to generalize to unseen examples, but it would assess the model’s ability to memorize patterns in the training data, which are also present in the test data. Data leakage is a common problem in applied ML research [7]; consequently, we discuss different approaches to obtaining train-test-splits, such as splits, which ensure that chemicals either end up in the training or the test set and have provided fixed splittings for further research.
The way forward to better reproducibility and comparability
We envision that the ADORE benchmark dataset will help introduce ecotoxicology to machine learning experts, bringing the field closer to established methods for better reproducibility, comparability, and explainability. We have equipped the dataset with chemical background information such as functional uses and ClassyFire categories [8], which are not intended as modeling features but to make models more explainable.
Currently, we are finalizing a modeling paper on the fish challenge. We are also inviting other researchers to work with us to find the best models for different challenges and to employ ADORE for their ecotoxicological research questions.
The generation and compilation of ADORE is described in this Nature data descriptor.
If you have any questions, we will be delighted to get in touch, so please reach out to us.
The MLTox project
The ADORE dataset has been compiled as part of the MLTox project.
A first version of this blog post has been published on Nature Communities.
Co-authors
- Christoph Schür, Eawag (Swiss Federal Institute of Aquatic Science and Technology)
- Marco Baity-Jesi, Eawag
- Kristin Schirmer, Eawag, ETH Zürich
- Fernando Perez Cruz, Swiss Data Science Center, ETH Zürich
References
- Hartung, Thomas. (2023) “ToxAIcology – The Evolving Role of Artificial Intelligence in Advancing Toxicology and Modernizing Regulatory Science”, ALTEX - Alternatives to animal experimentation, 40(4), pp. 559–570.
- Hartung, Thomas, and Aristides M Tsatsakis. (2021) “The State of the Scientific Revolution in Toxicology”, ALTEX - Alternatives to animal experimentation, 38(3), pp. 379–386.
- Muratov, Eugene N., Jürgen Bajorath, Robert P. Sheridan, Igor V. Tetko, Dmitry Filimonov, Vladimir Poroikov, Tudor I. Oprea, et al. (2020) “QSAR without Borders.” Chemical Society Reviews 49 (11): 3525–64.
- Schür, Christoph, Lilian Gasser, Fernando Perez Cruz, Kristin Schirmer, and Marco Baity-Jesi. (2023) “A Benchmark Dataset for Machine Learning in Ecotoxicology”, Scientific Data 10 (1): 718.
- Addor, Nans, Andrew J. Newman, Naoki Mizukami, and Martyn P. Clark. (2017) “The CAMELS Data Set: Catchment Attributes and Meteorology for Large-Sample Studies”, Hydrology and Earth System Sciences 21 (10): 5293–5313.
- Cartwright, Hugh M, ed. (2020) “Machine Learning in Chemistry: The Impact of Artificial Intelligence.” Theoretical and Computational Chemistry Series. Cambridge: Royal Society of Chemistry.
- Kapoor, Sayash, and Arvind Narayanan. (2023) “Leakage and the Reproducibility Crisis in Machine-Learning-Based Science”, Patterns, August, 100804
- Djoumbou Feunang, Yannick, Roman Eisner, Craig Knox, Leonid Chepelev, Janna Hastings, Gareth Owen, Eoin Fahy, Christoph Steinbeck, Shankar Subramanian, Evan Bolton, Russel Greiner, and David S. Wishart. (2016) “Automated chemical classification with a comprehensive, computable taxonomy”, J. Cheminformatics 8, 61.