Lake ecosystems consist of abundant amounts of spatial and temporal dynamics that are influenced by and affect many ecologically relevant phenomena. Current practices in environmental science rely on manual analysis and can therefore greatly benefit from big data and accurate depiction of the lake heterogeneity. The Datalakes project creates a user-friendly online platform that allows spatial and temporal analysis of lakes through hydrological and ecological data consisting of in-situ recordings and simulations. Furthermore, the project led to development of a novel Bayesian 3D hydrological model that has been deployed on one of the largest lakes in Europe.
The lakes of Switzerland are essential for numerous reasons including drinking water, both heat source and sink for energy, fishing and recreational activities. At the same time, lakes are at risk by changes in fish yield, micro-pollutants, harmful algae blooms and greenhouse gas emissions. Despite their rich spatial and temporal dynamic behavior, lakes are still often represented as maps of uniform properties. Consequently, significant resources are required to conduct manual analysis in order to stay ahead of catastrophes [Müller’14, Raymond’13, Bonvin’11]. Developments in data science can help minimize the risks and allow for a more optimal use of the resources.
Datalakes creates an open access online data platform that holds scientific data for Swiss lakes. This data ranges from, observational local (in-situ), remote satellite measurements of Swiss lakes, to Bayesian simulations and machine learning-based hydrological lake features. Throughout the project, a novel 3D hydrological model that can also provide uncertainty estimates faithful to underlying physics was developed and evaluated on Lake Geneva. The project is led by the Swiss Federal Institute of Aquatic Science and Technology (EAWAG), the Swiss Data Science Center (SDSC), the Swiss Federal Institute of Technology Lausanne (EPFL), the University of Lausanne (UNIL), the University of Geneva (UNIGE), and the Alpine Center for research on trophic networks and limnic ecosystems at INRA (CARRTEL).
A look into the data portal and its dissemination
While the list of publicly available data is vast (See here), these datasets can be filtered into different groups based on the scientific nature (physical, biological, chemical), means of acquisition (in-situ, simulated, remote sensing), responsible source (i.e., entity that produced the data), geographical location, time interval of interest, or even based on feature of interest (e.g., water pressure). Naturally, some measurements only exist for a single geospatial location while others are spread throughout a region (e.g., surface of a lake). A subset of the sensory data used in the Datalakes project is shown in Figure 1.
Beyond providing a platform to interact with various sources of data, a major deliverable of Datalakes is creation of a scalable hydrological model that can infer 3D numerical simulations of lakes [Safin’21, Sukys’21]. We developed a Bayesian hydrological model that can exploit observations of different forms, both spatial and temporal, spread across the surface and depths of lake Geneva (see Figure 2).
Figure 2: Hydrodynamic model makes use of a multitude of measurements spread across lake Geneva.
As opposed to a sparse set of various sensors around a lake, the developed Bayesian model can infer the entire hydrological state of a lake as well as forecasting its future. This is achieved through physically modeling the entire lake, followed by calibration based on meteorological features and a handful of in-situ stations on the lake.
The hyper-parameters of the developed model are calibrated through ensemble affine invariant sampler (EMCEE) and particle filtering (PF) algorithms [Andrieu’10] using observations from 2019. These computations are run using the Scalable Uncertainty Quantification (SPUX) library [Sukys’21], developed at EAWAG. The model has a 60s temporal resolution over 1000m horizontal profile and 50 vertical layers for simulations reaching to 50m below lake surface.
The idea behind calibration can be explained in simple terms:
- The EMCEE sampler draws N sets of the hydrological model parameters based on the current lake state; i.e., priors.
- For each set of parameters, M lake simulations (i.e., particles) are initialized and their forecasts are simulated until an observation is available. At that point, the posterior for each particle is computed based on their likelihood given the observation(s).
- Based on their posterior, particles are resampled, effectively multiplying likely particles while disposing unlikely ones.
- These particles can then be used as observations in a Bayesian inference notion to compute the posterior (i.e., an update) for the hydrological model parameters.
Simply put, the hydrological model simulates a multitude of probable states for the hydrological features of the lake simultaneously. Depending on a sparse set of sensory observations, the probabilities of different simulations are adjusted, as shown in Figure 3.
A great advantage of this method is that the hydrological model is completely driven by the physics behind hydrodynamics. The observations from sensors are used to reweight the likelihood of particles, but they are not used directly to update the model states.
Figure 3: Calibration of the developed hydrological model.
Machine learning to overcome bottlenecks
One main observation source for the calibration of the hydrological model are the lake surface water temperature estimations obtained by an infrared satellite imagery covering the complete lake surface. Unfortunately, directly making use of this data comes with its challenges. It is known that the lake surface temperature readings between satellite imagery and in-situ lake surface temperature (at about 1cm depth) already pose temperature differences above 2°C during a regular day. This divergence is due to an undetermined function of heat transfer between multitude of layers of water and atmosphere. Furthermore, the hydrological model computes lake dynamics between about 1m to 50m depth. Estimating the lake temperature closer to the water surface requires extreme resolution of simulations, which is computationally prohibitive.
In order to overcome this bottleneck, we use machine learning to find a data-driven mapping function between temperature simulations at 1m depth (bulk temperature) and lake surface temperature. Since the disparity between surface and bulk temperature arises from turbulence and heat transfer near the water surface, we hypothesize that a mapping from a bulk temperature to surface temperature would be a function of additional meteorological components. For this purpose, we rely on auxiliary features such as wind velocity, air temperature, humidity, and solar irradiance, which can be obtained either for specific in-situ meteo-stations or over a whole lake through MeteoSwiss probabilistic estimates. Accordingly, we optimize a deep learning model with long short-term memory (LSTM) architecture that not only utilizes a set of features, but also exploits temporal patterns for learning an accurate mapping to the lake surface temperature [Safin’21] (see Figure 4). Our results on separate test evaluations show that albeit extreme sparsity within sensory data, our model can significantly improve lake surface temperature estimations compared to solely relying on bulk temperature.
An important aspect when modeling this mapping function is to retain the Bayesian nature for predicted values of the lake surface temperature, that is, to provide an estimate of uncertainty. We combine multiple methodologies from the literature [Kendall’17, Tagasovska’19, Tagasovska’21] to quantify uncertainties for model predictions based on different sources. Consequently, we can adjust the influence of a given satellite imagery on the likelihood of hydrological model simulation particles.
Figure 4: A schematic of the developed deep learning model.
Open access and reproducible research
The Datalakes project provides three different means of accessing, visualizing and downloading its geospatial data. In order to understand the content and its importance for a particular audience, one should first look into the map viewer. A next step would be either further understanding and downloading of a dataset through the data portal or using the API for automated pipelines.
The map viewer interface allows environmental and citizen scientists to trivially observe the available data on the datalakes platform without needing to download and parse it (see Figure 5). This is an incredibly useful feature to get an insight on the scope of a dataset that can alleviate significant efforts in understanding a dataset.
The data portal allows for a human friendly display of further details of the dataset, such as features contained within, information on spatial and temporal coverage. It allows for trivially plotting data over a desired time interval, downloading and even launching a docker environment containing the script for reproducing the parsed data representations on renkulab. Furthermore, one can check additional details such as contact details of the responsible for a given dataset and type of license around the dataset.
Figure 6: A snapshot from the Datalakes data portal for an example dataset showing the ease of reproducibility, where a) data and scripts can be locally downloaded, b) data and scripts can be run on an interactive renkulab environment, c) preprocessing steps for the dataset are provided.
The datalakes API allows for automatization of fetching data from the datalakes portal, mitigating the burden of manually downloading them. This is particularly important for online systems that rely on regularly downloading recent datasets, such as frameworks that would rely on the simulations from the hydrological model developed within the Datalakes project.
Throughout the Datalakes project, we developed an open platform to facilitate use of data mining within environmental sciences. On the data science front, we developed a novel Bayesian data assimilation model with sophisticated particle filtering algorithms for 3D hydrodynamic simulations. This approach generates physically realistic trajectories, the first of its kind to be shown successful on a large scale at Lake Geneva. In addition, we optimized a data-driven model to find a mapping between bulk temperature and lake surface temperature that was used during the calibration of the hydrological model. Naturally, the lake surface temperature predictions of this model will also be made available on the datalakes platform.
* Affiliations throughout the Datalakes project
Andrieu, Christophe, Arnaud Doucet, and Roman Holenstein. “Particle markov chain monte carlo methods.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72.3 (2010): 269-342.
Kendall, Alex, and Yarin Gal. “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?” NeurIPS 2017.
Müller, Beat, René Gächter, and Alfred Wüest. “Accelerated water quality improvement during oligotrophication in peri-alpine lakes.” Environmental science & technology 48.12 (2014): 6671-6677.
Raymond, Peter A., et al. “Global carbon dioxide emissions from inland waters.” Nature 503.7476 (2013): 355-359.
Safin, Artur, Damien Bouffard, Cintia L Ramon, Firat Ozdemir, James Runnalls, Fotis Georgatos, Camille Minaudo, and Jonas Šukys. “Calibration of 3D hydrodynamic model of Lake Geneva using a Bayesian data assimilation framework”, submitted to Geoscientific Model Development, 2021.
Šukys, Jonas, and Marco Bacci. “SPUX Framework: a Scalable Package for Bayesian Uncertainty Quantification and Propagation.” arXiv 2105.05969 (2021).
Tagasovska, Natasa, Damien Ackerer, and Thibault Vatter. “Copulas as high-dimensional generative models: Vine copula autoencoders.” NeurIPS 2019.
Tagasovska, Natasa, Axel Brand, and Firat Ozdemir. “Reclaiming uncertainties of deterministic deep models with Vine Copulas.” RobustML workshop, ICLR 2021.