In almost every field (and especially in any kind of data science), there will be times when we wish to estimate data we don’t have by using the data that we do have. Given the plethora of machine learning algorithms and the similarly-daunting number of implementations of these algorithms (in different languages, packages, etc), it can be difficult to know where to start. This post shares my biggest takeaways from diving into using some of these tools, in the context of estimating air pollution from wildfires over the western US.

**Remotely sensed image of the 2015 Washington Wildfires, showing smoke mixed with cloud cover. **

## Science Motives

Since September 2017 I’ve worked on the Environmental Health Team in Earth Lab, investigating the impacts of exposure to air pollution from wildfires on human respiratory and cardiovascular health in the western US. To study the health impacts of air pollution, we have to know how much air pollution there was at each time and location of interest. (For the health analysis, we can then determine statistical relationships between quantities of air pollution and medical records.) However, a relatively sparse network of regulatory-standard air quality monitors does not offer continuous measurements of air quality across the western US.

**Figure 1: Sparse network of regulatory-standard air quality monitors in the western US. **

To estimate air pollution (specifically PM2.5, fine particulate matter—which is an air pollutant of primary concern for health) across the western US from 2008-2018, my team is employing a collection of of machine learning and spatial modeling techniques (described in the following sections). Read here about the variety of environmental data sets (meteorological variables, satellite imagery, etc) that my team is using, as well as the policy implications of this research.

## Machine Learning with R caret and caretEnsemble packages

In my previous blog post, Wildfire Air Pollution and Health, I introduced the machine learning techniques I was just starting to use for this project, as well as some general machine learning concepts and terminology, such as *cross-validation* and *tuning parameters*. Here, I’ll share some of my experiences from applying these techniques (and others) to our data. My team and I are still finishing processing some environmental variables in our 2008-2018 coverage of the western US. However, I have been experimenting with a subset of the data which covers the 11 western states from early 2008 through the end of 2010, and includes MODIS MAIAC and GASP (satellite sensors), Aerosol Optical Depth (AOD), wind speed vectors, atmospheric pressure, temperature and humidity, elevation, and proximity to major roadways.

I have been working in the programming language R, primarily using the caret and caretEnsemble packages to perform machine learning. Initially, I became familiar with **caret** through the “Machine Learning Toolbox” [in R] module on DataCamp. I also learned lots from Zev Ross’s “Predictive modeling and machine learning in R with the caret package” tutorial.

**Caret** allows me to easily perform the following tasks (with functions bold):

**sample**separate training and testing data (in experimentation thus far, I have used a 90:10 ratio of training to testing data)- build a grid of tuning parameters (iterating through these parameters allows for optimization of the model during training)
**resample**folds of my training data, which are then used in cross validation (this helps avoid overfitting the model on the training data)- build a
**trainControl**object which specifies all the cross validation arguments, including the number of times I want to repeat model fitting **train**the model, specifying the model structure, performance metric (Root Mean Squared Error, Accuracy, etc), preferred machine learning algorithm, and any other parameters relevant for that algorithm. For instance, Ranger (a highly efficient random forest algorithm) accepts arguments for the number of decision trees and the importance rule, which determines how variable importance is calculated.- List variable importance (via the
**varImp**function) **predict**at new locations, given new predictor data at the new locations

The steps above harness one kind of machine learning algorithm at a time. An extension of single-algorithm machine learning is to create a pooled model called an ensemble or stack. In essence, we run a bunch of individual models with different algorithms, and then run another algorithm (which I refer to as the “stacker algorithm”) on the output from the individual models. By blending the results from several machine learning algorithms together, an ensemble is able to capture more variation in the data than any single algorithm could, leading to higher prediction accuracy.

**Figure 2: Diagram of a machine learning stack.**

The **caretEnsemble** package is especially user-friendly because it provides ensemble methods which are compatible with normal **caret** methods, such as using a **trainControl** object. The **caretEnsemble** package allows me to do the following (with functions, again, bold):

- Run a set of machine learning algorithms (typically two to four algorithms) on the data, and store the resulting set of models in a
**caretList**object - Blend models using a generalized linear model (glm) as the stacker algorithm using
**caretEnsemble** - Blend models using any caret algorithm (glm, gbm, random forest, etc) as the stacker algorithm using
**caretStack**

Here is a more detailed Introduction to caretEnsemble. I also recommend this code example for how to build a stack using **caretEnsemble**. This tutorial is for classification (predicting discrete classes), but adapting it for regression (predicting numerical values from a continuous distribution) is quite straightforward.

## Fumbling with Ensembles (Some Takeaways from My Experience)

In machine learning, *classification* refers to sorting data into discrete classes, whereas *regression* refers to estimating values from a continuous distribution. My project necessitates using regression to estimate the exact concentration of PM_{2.5}. Thus, to start my machine learning odyssey, I examined the whole list of **caret** algorithms that support regression (as opposed to classification only). There are 136 in total, which I found using the following code:

Eventually, I selected eleven algorithms, refining the list by trying to get at least one from each well-known category of machine learning regression algorithms, and researching which algorithms within each category performed the best. Here is the list of algorithms I used, along with the name of each algorithm as it is called in the **caret** package:

- Generalized Linear Model = “glm”
- Generalized Additive Model using Splines = “bam”
- Stochastic Gradient Boosting = “gbm”
- Random Forest = “ranger”
- Supervised Principal Component Analysis = “superpc”
- Multi-Layer Perceptron, with multiple layers = “mlpML”
- Gaussian Process with Radial Basis Function Kernel = “gaussprRadial”
- Support Vector Machines with Radial Basis Function Kernel = “svmRadial”
- eXtreme Gradient Boosting = “xgbTree”
- GLM with Elastic Net Regression = “glmnet”
- Bagged MARS (Multivariate Adaptive Regression Splines) = “bagEarth”

Observations from my experimentation with these algorithms on my team’s air quality and environmental predictors data include:

- Regarding individual algorithms:
- Ranger always performs the best, followed by gbm-gaussprRadial-xgbTree (these three are very close), and svmRadial
- Then bagEarth
- Then glmnet and glm—these perform about the same (not well), but they are quick so it wouldn’t hurt much to include them in an ensemble

- Regarding stacker algorithms:
- Glm performs worst (basically achieves the R
^{2}of the best individual algorithm) - Ranger performs best (increases the R
^{2}several percentages above the best individual algorithm), but takes a long time - Reducing the number of repetitions (which often results in the same choice of tuning parameters) makes this a more viable option
- Gbm performs nearly as well as Ranger in the role of the stacker, and takes less time (even with three repetitions)

- Glm performs worst (basically achieves the R
- Regarding tuning:
- Increasing the number of resamples (more than ten) in the
**trainControl**object helps a bit, but takes a lot longer - The tuneLength parameter doesn’t seem to change the results that much (there are only slight improvements in R
^{2}and RMSE)

- Increasing the number of resamples (more than ten) in the

Because I observed that a solo Ranger model always performed nearly as well (and ran much faster) compared to a stack of three or four algorithms, I decided to just use Ranger in subsequent testing. Other modelers with different data sets and computational restrictions may well choose a different strategy.

## Report Cards: How Each Machine Learning Model Performed

To evaluate each model, I look at both the Root Mean Squared Error (RMSE) in predicted versus actual PM_{2.5} concentrations and the R^{2} value (which represents the percentage of variation in the training data that the model is explaining) on both the training and testing data sets. When modeling, we strive for lower RMSE and higher R^{2}, both of which mean that our results are more accurate. Not surprisingly, the training RMSE and R^{2} values were lower and higher respectively (better) than their testing counterparts, which shows that the kind of cross-validation we can easily implement with **caret** is still not as honest as holding out an entirely separate testing set (as I did at the beginning using the **sample** function). Read more here about the practice of using separate training and testing data.

Another way scientists often evaluate such models is by calculating “spatiotemporal R^{2} ”. This metric describes the model performance when we hold out entire geographic regions or time periods from the training data, and then test on these held-out regions. The argument is that this procedure more realistically evaluates the model’s performance when predicting at locations for which we have no past or present monitoring data. I implemented this procedure using caret’s **groupKFold** function. Unsurprisingly, the spatiotemporal R^{2} values are usually lower (worse) than their regular R^{2} counterparts.

Another key way of evaluating model performance, especially when dealing with spatial and/or temporal data, is to visualize the output of the model using a relevant plot. I have not included specific numerical results or plots here due to the in-progress status of this modeling project.

## Slice and Dice: Spatiotemporal Subsetting

Although fitting one huge (“global”) model to the eleven western states from 2008-2018 could be useful, from early on it became clear that smaller (“local”) models focusing on different geographic regions and time periods would yield much higher accuracy. Thus far, I’ve experimented with models on geographic quarters of our study region, the four seasons, and combinations of these spatial and temporal subsets. Due to a high density of both air pollution and air quality monitors, summers in the southwestern quarter of our study region (which ends up being California plus some fractions of the surrounding states) make for the best modeling results. Other regions and seasons prove more challenging due to greater variability and lack of data. However, typically the more data we have, the better the algorithm can learn significant patterns, so training on more data (once we’ve finished processing everything) will likely improve the results in other spatiotemporal subsets.

## Kriging on the Residuals

Another way that I attempted to improve our air pollution predictions and understanding of spatiotemporal trends in the data was to use kriging on the residuals (differences between predicted and true PM_{2.5} concentrations) from our machine learning model. I will attempt here to provide an overview of this method.

Kriging is a spatial interpolation technique which allows us to model a Gaussian process with a previously-determined covariance function. *Covariance* in this setting is a measure of how correlated data values are depending on how close together the observations are in space. Once we select a form for the covariance function (exponential, Matérn, etc), we can estimate the parameters of the covariance function (range, smoothness, nugget effect, etc) by using either an empirical variogram or the method of maximum likelihood (the most common approach).

**Figure 3: Generic example of spatial kriging.**

Performing this maximum likelihood estimation had two theoretical uses. First, by estimating the parameters of a covariance function over time, I could identify temporal changes in patterning of the residuals from the machine learning model. Second, I could use the resulting covariance functions (with optimized parameters) to interpolate the residuals from the machine learning model, and further improve our predictions for air pollution concentrations in locations where we don’t have monitoring data.

To experiment with these spatial modeling techniques, I performed maximum likelihood estimation on the residuals from the Ranger (random forest) model, pooled at daily, biweekly, weekly, bimonthly, monthly, and seasonal intervals. (Note: for the kriging experiment, I only used data from 2009 and 2010 in the southwestern quarter of the study region—centered over California.) Unsurprisingly, there were clear visual patterns in the plots of the maximum likelihood parameters estimated on residuals of the machine learning model pooled at seasonal, monthly, and bi-monthly intervals. The overarching conclusion I drew is that it is much harder to estimate air pollution in this region during the winter than the summer. In other words, the residuals are less predictable during the winter. The plots of the maximum likelihood parameters estimated on residuals of the machine learning model pooled at shorter timespans (weekly, biweekly, daily) did not display clear patterning.

After using these parameters to krig (interpolate) on a held-out set of testing residuals (I split the machine learning test set residuals 60-40 for training and testing the kriging model), I calculated the RMSE and R^{2} for each model. Based on the median RMSE, kriging on seasonal, monthly and bi-monthly subsets of the data did not help (lower the RMSE compared to the random forest model without kriging). Kriging on weekly and biweekly subsets of the data helped some, but I obtained the lowest RMSE values by kriging for each day separately. I observed a similar, but more extreme, trend in the R^{2} values: improving as the temporal timespan decreased. Based on the median R^{2}, only the daily kriging improved the predictions compared to the random forest model, and only increased the R^{2} by 0.01.

## Next Steps

Although I was able to confirm the existence of temporal patterning in the residuals from the random forest model, my spatial parameter estimation and kriging efforts did not do much to improve the model. In future, instead of doing all this extra work, I plan to try incorporating more explicit spatial and temporal variables into the machine learning model. For example, including categorical variables indicating the US state, the month, or others such as the cosine of the day of the year (scaled so that days 1 and 365 are close to each other) might help us to capture spatio-temporal patterns in the data.

Once my team and I have processed all the environmental data, I will be able to analyze data over our entire study region—the eleven western states from 2008-2018. With this final data set, I also intend to employ more stringent tuning of different algorithms’ parameters in the ensemble models (facilitated by **caretEnsemble**), which may help to improve the performance of the ensemble models relative to an individual Ranger (random forest) model.

## The Big Picture

In my experience (and others’, from what I read online), **caret** and **caretEnsemble** are extremely valuable tools for building machine learning models. Whether and how we extend these models—including running different models on different subsets of the data, spatially modeling the residuals, and incorporating extra spatial and temporal variables—depends on the characteristics of the data, the purpose of the model, and the modeler’s discretion. In any case, I’ve found that thinking through (and experimenting with) many variations in model structure and ways of evaluating my models has helped me gain much deeper understanding of these tools.