Coastal marine ecosystems are of vital interest to the U.S. economy. The California Current System (CCS) provides an ideal natural laboratory for integrating satellite remote sensing, subsurface measurements, and numerical modeling to study physical ecosystem dynamics and the subsequent dynamic biogeography of the marine realm.
Satellite data is generally easier and cheaper to acquire than sonar data. Fisheries can use this satellite data to estimate biomass, in other words the amount of fish in a certain area of the ocean, which enables them to plan fishing routes and business supply stocks. Satellite data can also be used for long term studies of marine populations as indicators of the climate.
Satellite and marine acoustic measurements of the water column can be analyzed together to provide insight into what is happening in a vertical slice of the ocean. I explored the potential for predicting marine acoustic (sonar) data that describe the biomass within the vertical ocean structure. This method uses remotely sensed observations and derived products from satellite measurements along the CCS.
Satellite data typically only provide information on the surface of the ocean, which means only the surface manifestations of complex dynamic and biological processes are available for study. Thus, diagnosing the mechanisms responsible for temporal and spatial variability in a coastal ecosystem, including that of primary productivity, cannot be achieved through the analysis of satellite data alone.
The satellite data can be used to analyze a high-resolution atmosphere-ocean model for periods of time when satellite data and subsurface water column sonar data are available together. This method helps to uncover the physical processes (e.g., upwelling, mesoscale eddies, and mixed layer depth variations) that determine the surface expression (e.g. what can be seen through the satellite imagery) and vertical structure (e.g. what can be seen through sonar) of ocean productivity and biomass.
The CCS is located along the western coast of North America, as depicted below:
Fig 1. Location of CCS on world map of NASA OceanColor average SST for July 2013
To determine whether satellite data can serve as a proxy (or predictor) for the subsurface distribution of marine organisms, I leveraged both water-column sonar data collected in the CCS between May 2013 to August 2013 by the NOAA Northwest Fisheries Science Center and satellite data that were measured by the MODIS instrument onboard the Terra satellite. These data are archived and made available from the NOAA National Centers for Environmental Information and from NASA OceanColor respectively. The CCS location was chosen based on availability of marine sonar data in that region.
Using this large scale dataset, I am evaluating the variability of the marine biography in the CCS. The integrated returned energy between two depths in the water column (or nautical area scatter coefficient, NASC) enables us to identify patterns of acoustic reflectance of marine organisms, in other words how they appear in sonar images, over horizontal lengths and vertical depths within the ocean.
An analysis of the year-by-year variability of the distribution of surface chlorophyll concentration from satellite ocean color measurements and variations in the subsurface NASC measurements illustrates a correlation between the parameters for upwelling regions of the CCS. The ability to determine the extent that surface chlorophyll concentrations measured from satellite can serve as a proxy for subsurface distribution of zooplankton and/or fish will be discussed later. I am also analyzing the influence of other satellite measurements and derived products, such as sea surface temperature and distance from shore.
The Sonar Data
The water column sonar data used for this project was collected by the NOAA National Marine Fisheries Service (NMFS) Fisheries Science Center in summer 2013. The sonar data is separated into five frequencies (18kHz, 38kHz, 70kHz, 120kHz, and 200kHz) and three depth bins (0-250m, 250-500m, 500-750m).
The initial stages of this project involved removing noise from the sonar data so that we could have more accurate values of proxy biomass (the noise adds false values of biomass). A future step will include analyzing how much this noise detracts from the performance of the model.
The NASC values serve as our proxy biomass measurements. The images below show how the NASC measurements calculated from the 18kHz sonar frequency for two different depths are improved after removing noise:
Fig 2. NASC values calculated for individual frequencies at certain depth bins. The quality controlled images show significant decreases in NASC values, particularly along Vancouver Island.
The following graphic illustrates the path of the NOAA ship Shimada that collected the sonar data:
Fig 3. The path of the NOAA ship Shimada that retrieved the sonar data between May and August 2013.
The Satellite Data
The satellite data for this project was collected by the MODIS instrument onboard the Terra satellite, and the data is available for public download from NASA OceanColor. Here is a good source for more information on how satellites are used to study the ocean. Throughout the duration of this project, I experimented with using different spatial resolutions and variable inputs available.
The initial stages of this project used a 9km grid spatial resolution and focused solely on incorporating variables of latitude, longitude, daytime sea surface temperature (SST), and chlorophyll-a. Once the analytical models were improved, however, I decided to use a 4km grid spatial resolution to better match the sonar data spatial resolution.
Additionally, I am now looking at 19 different variables available from the MODIS instrument. Using additional variables adds to the complexity of the model, and can help distinguish whether or not using satellite data improves the analysis of marine biomass characterization. The importance of how each of these variables influences the model will be discussed later.
Data Analysis Using a Neural Network
I used several machine learning techniques to analyze potential relationships between our satellite and sonar data sets. The first of these was a neural network. A neural network is a machine learning model that makes a decision by weighing given evidence, and it is usually used for both regression and classification. Regression techniques were used for the purposes of this project.
To prepare data for a neural network, the data must be split into sets of training, testing, and validation data. A neural network is first trained on part of the data set being analyzed, called the training set, and then the neural network uses the testing data to see how well it identified patterns and relationships within a data set. The validation data is then used to verify results.
The following graphic is a visual depiction of a general neural network:
Fig 4. A visualization of a general neural network structure. Input variables are represented by bubbles on the left, and columns of bubbles in the middle represent hidden layers, consisting of hidden nodes. The connected lines indicate weights and parameters. The final output bubble on the right depicts the relationship between the input variables. It is a supervised learning machine learning method. Source: https://towardsdatascience.com/applied-deep-learning-part-1-artificial-neural-networks-d7834f67a4f6 .
I experimented with different neural network frameworks for this analysis, including Pytorch and Scikit-Learn Multi-layer Perceptron (MLP) Regressor. For our particular data, Scikit-Learn was easier to work with and performed better on withheld validation data. I set up the neural network with five hidden layers, each containing a cascading number of nodes, decreasing from 50 to 10 nodes. This architecture performed the best when analyzing the data.
The plot below shows how well the neural network performed when comparing the values it predicted for biomass (the predicted log NASC values) against the observed measurements for biomass (the observed log NASC measurements):
Fig 5. This plot shows the performance of neural network models using the test data. Each box is a different wavelength and depth bin (see sonar data section). The observed proxy biomass measurements (log NASC) are plotted against the predicted proxy biomass values (log NASC) along a one-to-one line. Mean squared error (MSE) is computed for each model.
The models with shorter wavelengths and shallower depth bins tend to perform better than those with longer wavelengths and deeper depth bins. This is evident visually in the plots and also in the mean squared error (MSE) calculations. The mean squared error looks at how well data fits a line, with lower values indicating a better fit. The lowest calculated MSE is 0.90, and the highest calculated MSE is 9.93. Some of the longer wavelengths and deeper depth bins contain more sonar noise. Additionally, satellite measurements only gather surface information about the ocean conditions, which can be reflections of subsurface life. However, it is less likely that high depths will significantly influence the ocean surface measurements. Therefore, I do not expect to find high correlations between satellite and sonar measurements at high depths or longer wavelengths.
The graph below shows how different satellite variables rank in importance of impacting the overall results:
Fig 6. For each model of different wavelengths and depth bins, the importance of the input variables is ranked. The list of which variables are most important changes based on the wavelength and the depth.
Data Analysis Using Random Forest
The second machine learning technique that I used to analyze this data was random forest. Random forest is an average of many decision trees; decision trees are another technique in machine learning in which root mean squared error (RMSE) is calculated in every branch the algorithm takes. Random forest takes the average of all of these, and it tends to be more accurate than a decision tree algorithm on its own.
The following is an example of a random forest setup:
Fig 7. A visualization of how a random forest algorithm operates. Random forest is the average from many different decision trees. It is a supervised learning machine learning method. Source: https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd .
In order to evaluate the neural network’s performance, I used random forest to analyze our data. The plot below shows how well the random forest algorithm performed when comparing the values it predicted for biomass (the predicted log NASC values) against the observed measurements for biomass (the observed log NASC measurements):
Fig 8. This plot shows the performance of the random forest models using the test data. Each box is a different wavelength and depth bin (see sonar data). The observed proxy biomass measurements (log NASC) are plotted against the predicted proxy biomass values (log NASC) along a one-to-one line. Mean squared error (MSE) is computed for each model.
As with the neural network, the models with shorter wavelengths and shallower depth bins tend to perform better than those with longer wavelengths and deeper depth bins, which is evident in both the plots and in the mean squared error (MSE) calculations. Additionally, the MSE improved for the random forest algorithm. The lowest calculated MSE is 0.70, the highest calculated MSE is 9.06.
The graph below shows how different satellite variables rank in importance of impacting the overall results:
Fig 9. For each model of different wavelengths and depth bins, the importance of the input variables is ranked. The list of which variables are important changes based on the wavelength and the depth.
Comparison of Random Forest and a Neural Network
Using both the neural network (Multilayer Perceptron) regression and the random forest linear regression, I demonstrated that remotely sensed ocean observations can accurately predict subsurface structural properties derived from sonar data, particularly for shallow depths and short sonar frequencies. After comparing the model performances for our data analyzed by a neural network against the performances for our data analyzed using a random forest method, I found that the random forest method outperformed the neural network for nearly all models, across sonar frequencies and depth bins.
The plot below outlines how each model performed, facilitating comparison of the neural network and random forest algorithms:
Fig 10. Top Left: Mean squared error (MSE) for all models as calculated by a neural network and by a random forest method. The random forest outperforms the neural network for nearly all model combinations of sonar frequency and depth.
Bottom Right: A focus on the cluster of models seen in the lower left hand corner of the top graphic. Most of these models include frequencies in the first two depth bins (0-250 meters and 250-500 meters).
When running these models, the same random state was set for reproducibility of results. For nearly every combination of sonar frequency and depth bin, random forest performs better than the neural network. Due to these results I will be using random forest for future analysis of our data.
Since I determined random forest performed better than the neural network, I used it to look at overall variable importance. In Figure 9 the importance of each variable in influencing how biomass is predicted are shown for all frequencies and depths. To get a better idea of which variables are the most important across all models, I calculated the ‘average’ importance for each variable, and this is visualized below:
Fig 11. The variables that contributed the most to our prediction of our proxy biomass (or log NASC values) were latitude, depth_m, longitude, sea surface temperature, time of day, and distance from shore. These are the most important variables across all sonar frequencies and depth bins.
The variables that had the greatest impact on predicting biomass were latitude, depth, longitude, day, sea surface temperature (SST), distance from shore, and the fluorescence line height (the satellite measurement used to derive chlorophyll concentration.
Although I predicted SST and chlorophyll to have the largest influence on biomass in the CCS, my preliminary analysis suggests that other satellite variables can better predict where marine biomass is likely to exist. However, SST and chlorophyll are still influential factors that add to the overall value of the simulation models.
Adding another year of noisy sonar data to compare whether noise needs to be removed or not
Adding more delineations in time of day and depths for the data
Look at anomalies for different variables to see how values compare across decades
1Cooperative Institute for Research in Environmental Sciences, University of Colorado Boulder
2 Department of Atmospheric and Oceanic Sciences, University of Colorado Boulder
3 NOAA National Centers for Environmental Information, Boulder, CO
4 Earth Lab Analytics Hub, University of Colorado Boulder