Open and reproducible environmental science: from theory to equations and algorithms

By
Stan Schymanski
June 29, 2017
Share this post

Computer models for generation and use of scientific understanding

Mathematical and numerical models are increasingly important for our understanding and prediction of complex interactive processes. Consider, for example, our climate system. Current understanding of the physical processes underlying air movement, heat exchange, and evaporation-condensation of water is not sufficient to predict possible effects of elevated atmospheric CO2 concentrations on wind, temperature, humidity and precipitation patterns around the globe. We need complex models that accurately represent the feedbacks between different processes and compartments to inform us how a perturbation in one component may affect other components of the coupled climate-earth surface system that are relevant to us.

Process understanding encoded in mathematical equations and algorithms

How can our process understanding be transferred into such models? More fundamentally, what is our quantitative process understanding and where does it come from? Our understanding is ultimately gained from a growing body of observations, including experiments and environmental monitoring, and logical reasoning. The induction of general laws from data (inductive reasoning) usually leads to the formulation of mathematical equations, e.g. Newton’s laws of motion. The body of established, general laws can then be used to deduce additional equations that can help predict a process of interest (deductive reasoning). These equations are then translated into algorithms that represent various processes in a model and contribute to enabling predictions about quantities of interest, e.g. stream-flow trends for the next 50 years. See Figure 1 for a graphical illustration of the process. In this blog, we will focus on the steps of deducing equations from an existing body of knowledge and translating these into algorithms (see magnifying glasses in Figure 1).

Figure 1: Scientific method for hydrologic model development. Magnifying glasses indicate steps discussed here. (Modified from Clark et al., 2016)

Do not let understanding get lost in translation!

The seemingly straightforward steps of building a system of equations on prior knowledge and transferring these equations into models is often not transparent and easy to reproduce. Many mathematical derivations in papers have the famous “it follows that” statement somewhere, introducing an equation whose origin remains entirely mysterious to the readers, even after multiple re-reading of the preceding sections. Obviously, since this has passed peer-review, it MUST be correct, so trusting the collective mental capacity of the authors, editors and reviewers, the readers proceed to assess the utility of the equations by studying the model output, which presumably looks very reasonable. Next, the readers would like to test the utility of the equations in their own model on their own data. In order to do this, it is important to understand the context in which the equations were actually used in the original model. Here comes the next problem. Even if the model code is available, the equations in the code are usually not recognisable to the general readers. If the readers are lucky, the code documentation creates connections between specific lines of code and the original equations in the paper.

The tragedy of missing details

So far so good. For the equations or the code to be re-usable, the readers must be able to substitute their own parameter values and compute results for their own problems. This is the most tragic part of the work flow, as the meaning of the model parameters and especially their units of measurement are often not readily accessible, due to omission or implicit use of discipline-specific conventions, which may change over time. The readers use their own intuition about the meaning of parameters and the units in which they need to be entered, and if the results look plausible, they trust the model and their assumptions. However, wrong assumptions about units are often not immediately obvious and have led to epic failures in the past (e.g. http://edition.cnn.com/TECH/space/9909/30/mars.metric.02/). Furthermore, explicit consideration of units of measurement in published equations sometimes reveals mismatches that may indicate a fundamental problem in the equations and/or variable definitions (see example described below).

Enabling transparent and traceable conversion of knowledge

At the Swiss Data Science Center, we have developed an Environmental Science using Symbolic Math (ESSM) package that allows transparent propagation of metadata about variables and equations from papers to the final code. When using this package, the user can easily access important information such as variable definitions, descriptions and units of measurement at any time. The package is built for the free software SageMath and makes use of the intuitive programming language Python. It is available on PyPI and the source code can also be accessed on zenodo and github.

Among various methods for dealing with variable and equation metadata, the ESSM package also provides a built-in algorithm to check for consistency of units when formulating equations. This task, rightfully considered self-evident in any derivations of physical equations, is often omitted in the literature, as seen for example in the famous paper by Priestley and Taylor (1972), which we use below to illustrate the utility of the framework along the line of thought presented above.

Example: Derivation of the Priestley-Taylor equation (Priestley and Taylor, 1972)

A key step in the derivation of the Priestley-Taylor equation is Equation 3, shown in a screen shot of the paper (Fig. 2).

Figure 2: Extract from Priestley and Taylor (1972).

Since the units of the variables were not specified in the paper, we make informed guesses based on the description in the text and widespread literature conventions:

View raw vardef.ipynb hosted with ❤ by GitHub

Our variable definitions in a table:

Figure 3: Snippet from jupyter notebook using ESSM.

Using the variables defined above, we can write Equation 3 in Priestey and Taylor (1972) as a symbolic expression and verify visually that it is consistent with the formulation shown in the screen shot above:

Figure 4: Snippet from jupyter notebook using ESSM.

Now, we will try to use the above expression to define a physical equation representing Equation 3 in the paper:

Figure 5: Snippet from jupyter notebook using ESSM.

The package returns an error informing us that the left-hand-side of the equation is non-dimensional, while the right-hand-side has units of kg m-3. Clearly, the units of Equation 3 do not match if we use our assumptions about the units of L , s and cpa. Either the equation is missing a division by a density term (units of kg m-3) on the right-hand-side, or one of our assumptions about the units involved was different to what the authors had in mind. In any case, if we were not aware of the problem and just substituted values for the symbols in the equation to estimate latent or sensible heat flux, we would likely get a result that has no physical meaning. It is left to the reader to investigate how the Priestley-Taylor equation was interpreted and used in the literature (over 3’000 citations!). An automated extraction and analysis of equations and variable definitions from such a high number of papers is a separate problem that can be tackled with data science methods, but this is outside the scope of this Articles.

Become part of the new movement for open and re-usable science!

The scientific community is becoming more and more aware of the advantages of open and re-usable science. (Just search the web for “open science” to get an impression). Whereas many initiatives focus on open data, the initiative presented here focuses on open and re-usable encodings of theory. The general workflow of (re-)producing algebraic derivations in a traceable way and injecting the resulting equations into quantitative computer models has already been used in scientific publications (e.g. Schymanski and Or, 2017; Schymanski, Breitenstein and Or, 2017), which are freely available online, along with the underlying data and code (https://doi.org/10.5281/zenodo.241259, https://doi.org/10.5281/zenodo.241217). The ESSM package is designed to greatly facilitate this approach and provide a blueprint for self-consistent and traceable analysis of quantitative problems involving physical variables. Please try it out and give feedback (bug reports, feature requests, questions) at https://github.com/environmentalscience/essm.

Co-authors

With intellectual input by:

Bibliography

  1. Clark, M. P., Schaefli, B., Schymanski, S. J., Samaniego, L., Luce, C. H., Jackson, B. M., Freer, J. E., Arnold, J. R., Moore, R. D., Istanbulluoglu, E. and Ceola, S.: Improving the theoretical underpinnings of process-based hydrologic models, Water Resour. Res., 52(3), 2350–2365, doi:10.1002/2015WR017910, 2016.
  2. Priestley, C. H. B. and Taylor, R. J.: On the Assessment of Surface Heat Flux and Evaporation Using Large-Scale Parameters, Monthly Weather Review, 100(2), 81–92, doi:10.1175/1520-0493(1972)100<0081:OTAOSH>2.3.CO;2, 1972.
  3. Schymanski, S. J., Breitenstein, D. and Or, D.: Technical note: An experimental setup to measure latent and sensible heat fluxes from (artificial) plant leaves, Hydrol. Earth Syst. Sci. Discuss., 2017, 1–40, doi:10.5194/hess-2016-643, 2017.
  4. Schymanski, S. J. and Or, D.: Leaf-scale experiments reveal an important omission in the Penman–Monteith equation, Hydrol. Earth Syst. Sci., 21(2), 685–706, doi:10.5194/hess-21-685-2017, 2017.

About the author

Share this post

More blog posts

March 29, 2019

The "Deep Dive" of natural language processing | Part 1

The "Deep Dive" of natural language processing | Part 1

Natural language processing, i.e. the automated processing of human language with computers, is certainly not a new discipline. Some date it back to 1950, with Alan Turing’s famous test which a machine would pass by holding a convincingly “human” conversation.
Blog
December 21, 2018

A trip through Swiss politics and history

A trip through Swiss politics and history

Our aim is to create a database of who said what and when in both chambers of the Swiss parliament over the past 127 years. The Swiss Federal Archives recently carried out the digitalization of the proceedings of both the National Council and the Council of States. Thanks to these efforts, we can now openly access over 40,000 documents pertaining to all votes, speeches, laws, amendments to laws, etc., from 1891 to the present day.
Blog

More news

November 6, 2023

Climate-smart agriculture in sub-Saharan Africa: optimizing nitrogen fertilization with data science

Climate-smart agriculture in sub-Saharan Africa: optimizing nitrogen fertilization with data science

Food insecurity in sub-Saharan Africa is widespread, with crop yields much lower than in many developed regions. The project aims to use laser spectroscopy to measure fluxes and isotopic composition of N2O from maize and potato crops subjected to a range of fertilization levels.
Blog
October 8, 2018

Heart tissue analysis in a heartbeat

Heart tissue analysis in a heartbeat

PSI, ETH Zurich, and the SDSC are jointly working on automatically analysing such three-dimensional images to ease the burden of the clinicians and pathologists in identifying healthy tissue from sick ones.
Blog
June 2, 2020

AI trends & use cases in the pharmaceutical industry

AI trends & use cases in the pharmaceutical industry

Artificial Intelligence (AI) is not an alien word anymore nowadays. We see both academic and industrial institutions adopting AI topics as a part of their curriculum and use cases to accelerate existing processes. The pharmaceutical industry is one of them.
Blog

Contact us

Let’s talk Data Science

Do you need our services or expertise?
Contact us for your next Data Science project!