1) Model-Based Prediction and Inference
On the heels of the data revolution, Earth and environmental science is undergoing a model revolution. This revolution brings together disparate researchers to harvest scientific understanding from the data deluge. This is certainly true at the intersection of geospatial data science, environmental science, and statistics/machine learning. The next generation of models will be developed collaboratively by domain-experts and data scientists/statisticians, and must scale to massive data, leverage spatiotemporal coherence in heterogeneous data, embed scientific knowledge, and answer science questions.
Long-term monitoring data indicate that the largest wildfires are getting larger on average each year in the United States.
To understand why, an interdisciplinary team of fire ecologists, climate scientists, geomorphologists, and data scientists at Earth Lab and other institutions collaborated to build a predictive model of extreme wildfire events. We built the model with one simple idea: the size of the largest fire in a region is affected by two things: 1) the total number of wildfires ignited, and 2) the distribution of fire sizes within the region. We found that temperature and dryness nonlinearly relates to the likelihood of an extreme fire via these ignition and size pathways. Further, we found that because these effects are nonlinear, even a small amount of drying in an already dry region can lead to a substantial increase in the chance of an extreme wildfire event. For more, see Joseph et al. (2019).
Orographic Controls on Subdaily Rainfall Statistics and Flood Frequency in the Colorado Front Range, USA (Matt Rossi)
Models allow us to link observed data back to the hidden rules of nature, and predict what might happen in the future. In the age of big data, we are now ushering in the next phase of scientific research that leverages big models to extract insight from the raw material provided by Earth observation data.
2) Deep Learning in Earth Science
Deep learning has revolutionized computer vision and natural language processing in the past decade. Neural networks that approximate functions are the workhorse of deep learning, and have been tailored to a wide variety of applications. Often, deep learning relies on large amounts of labeled data for training.
Deep learning applications in science are beginning to open up new insights into earth and environmental data, especially in the remote sensing domain. High availability of large amounts of remote sensing data and “off-the-shelf” models and code now makes deep learning practical for remote sensing tasks such as image classification, object detection, scene segmentation, data fusion, transfer learning, and time series analysis.
An example of how a deep learning based image segmentation model can be applied to remote sensing imagery.
Earth Lab has used very high resolution multispectral imagery from DigitalGlobe to map impervious surfaces in urban areas. Traditionally a product generated from a series of supervised image classification tasks, image segmentation models from the deep learning community have been applied to train a model to segment impervious surfaces using any subset of spectral bands available from DigitalGlobe’s WorldView-2 multispectral imaging system. A collaboration with the Department of Environmental Design at CU Boulder enabled the training of the image segmentation model [BLOG].
A UNet image segmentation model was trained on a DigitalGlobe WorldView-2 image from 2016, and also applied to an image acquired in 2015, using a GIS polygon dataset as shown in top left as training data. The resulting segmentation mask is for all surfaces marked as ‘impervious’ in the training dataset (bottom), with the continuous prediction surface available before thresholding (top).
Neural networks are increasingly being used in science to infer hidden dynamics of natural systems from noisy observations. Earth Lab is at the forefront of science-based deep learning in environmental science. While typical deep learning applications are focused on image classification, sequence prediction, or regression tasks without reference to any particular scientific model or dynamical system, we are particularly interested in ways to embed science knowledge in deep learning models. As an example, we are building neural hierarchical models of ecological populations that combine the function approximation power of deep learning with well-known ecological models for occupancy, capture-recapture, an animal movement data (Joseph 2020).
Earth Lab draws upon the rich set of methods provided by deep learning to gain new insights from spatiotemporal data streams, and new applications in earth and environmental science are emerging at an increasing pace. As these applications develop, we believe that building strong links back to science is critical.
3) Scalable Scientific Computing in the Cloud
Modern science sometimes requires computers with much more power than is available from your average personal computer. At Earth Lab we develop and deploy our analyses in cloud environments to more easily scale our science.
Cloud computing can be intimidating, but by deploying familiar development environments such as RStudio and Jupyter Notebooks in the cloud, we are lowering the barrier to entry for our scientists. By working in familiar environments, we can get our work done more quickly, with less time and effort invested in learning totally new computational workflows. As a side effect of this decision, we ensure that each user has access to their own resources, which means that they have total control over what is installed in their computational environment.
To make it easy to spin up these environments, Earth Lab curates a set of Docker images that have commonly used packages pre-installed. These Docker images are hosted and built on Docker Hub, and are free to use: https://hub.docker.com/u/earthlab/
By using Docker containers, we can run identical development environments and workflows locally and in the cloud with minimal configuration pain.
4) Open Science and Reproducible Research
Earth Lab embraces open science by publishing open data, developing open source software, releasing reproducible workflows, and championing open access publication. Open data enables a broader community to extract insight from earth observations. Open source software provides a mechanism for community-driven tools that make data easier to work with, expediting the research process. By constructing reproducible scientific workflows from open data and open source software, analyses can be chained together to track the provenance of scientific conclusions that are made more readily available through open access publication.
EarthPy was originally designed to support the Earth Analytics Education program at EarthLab - University of Colorado, Boulder, but is a general purpose open source Python library that simplifies spatial data exploration by building on functions in existing Python libraries including Rasterio and GeoPandas. EarthPy allows the user to streamline common geospatial data operations in a modular way.This reduces the amount of repetitive coding required to open and stack raster datasets, clip the data to a defined area, and in particular, plotting data for investigation.
Example of a hillshade plotted with a digital elevation model, created using the EarthPy Python library.
Earth Lab invests in reproducible tools to automate article generation. We use R Markdown and/or GNU Make to orchestrate data analysis, model fitting, figure generation, and paper compilation. This ensures that we can easily track the provenance of our science down, and avoid error-prone manual entry of our results in scientific papers. Furthermore, this facilitates testing of the computational environment used in our analyses, so that we can have confidence that others can reproduce the same analysis and more easily build upon our work. One such project can be found on GitHub (https://github.com/mbjoseph/wildfire-extremes), for the paper “Spatiotemporal prediction of wildfire size extremes with Bayesian finite sample maxima” (Joseph et al. 2019, https://doi.org/10.1002/eap.1898).
We believe that by supporting open reproducible scientific research, scientific inference becomes more reliable, easier to verify, and more readily extensible. Ultimately these features of open reproducible research have the potential to shorten the path from data to discovery.