Machine learning algorithms improve MODIS GPP estimates in United States croplands

Menefee, Dorothy; Lee, Trey O.; Flynn, K. Colton; Chen, Jiquan; Abraha, Michael; Baker, John; Suyker, Andy

doi:10.3389/frsen.2023.1240895

ORIGINAL RESEARCH article

Front. Remote Sens., 02 November 2023

Sec. Image Analysis and Classification

Volume 4 - 2023 | https://doi.org/10.3389/frsen.2023.1240895

This article is part of the Research TopicAdvancements in Agricultural Monitoring with AI Enhanced Remote Sensing TechniquesView all 4 articles

Machine learning algorithms improve MODIS GPP estimates in United States croplands

Dorothy Menefee¹*

Trey O. Lee¹

K. Colton Flynn¹

Jiquan Chen²

Michael Abraha²

John Baker³

Andy Suyker⁴

¹Grassland Soil and Water Research Laboratory, United States Department of Agriculture - Agricultural Research Service, Temple, TX, United States
²Landscape Ecology and Ecosystem Science (LEES), Michigan State University, East Lansing, MI, United States
³Soil and Water Management Research, United States Department of Agriculture - Agricultural Research Service, St. Paul, MN, United States
⁴Institute of Agriculture and Natural Resources, University of Nebraska-Lincoln, Lincoln, NE, United States

Introduction: Machine learning methods combined with satellite imagery have the potential to improve estimates of carbon uptake of terrestrial ecosystems, including croplands. Studying carbon uptake patterns across the U.S. using research networks, like the Long-Term Agroecosystem Research (LTAR) network, can allow for the study of broader trends in crop productivity and sustainability.

Methods: In this study, gross primary productivity (GPP) estimates from the Moderate Resolution Imaging Spectroradiometer (MODIS) for three LTAR cropland sites were integrated for use in a machine learning modeling effort. They are Kellogg Biological Station (KBS, 2 towers and 20 site-years), Upper Mississippi River Basin (UMRB - Rosemount, 1 tower and 12 site-years), and Platte River High Plains Aquifer (PRHPA, 3 towers and 52 site-years). All sites were planted to maize (Zea mays L.) and soybean (Glycine max L.). The MODIS GPP product was initially compared to in-situ measurements from Eddy Covariance (EC) instruments at each site and then to all sites combined. Next, machine learning algorithms were used to create refined GPP estimates using air temperature, precipitation, crop type (maize or soybean), agroecosystem, and the MODIS GPP product as inputs. The AutoML program in the h2o package tested a variety of individual and combined algorithms, including Gradient Boosting Machines (GBM), eXtreme Gradient Boosting Models (XGBoost), and Stacked Ensemble.

Results and discussion: The coefficient of determination (r²) of the raw comparison (MODIS GPP to EC GPP) was 0.38, prior to machine learning model incorporation. The optimal model for simulating GPP across all sites was a Stacked Ensemble type with a validated r² value of 0.87, RMSE of 2.62 units, and MAE of 1.59. The machine learning methodology was able to successfully simulate GPP across three agroecosystems and two crops.

1 Introduction

The use of satellite-derived estimates of ecosystem productivity have become somewhat commonplace in ecosystem and agricultural sciences (Huang et al., 2018; Smith et al., 2019; Ai et al., 2020). Estimating plant growth has utility in a wide variety of ecological and agricultural applications, including carbon uptake estimates, yield forecasting, detection of plant pathologies, and detecting ecosystem changes (Steven, 1993; Kerr and Ostrovsky, 2003; Pettorelli et al., 2017). These estimates generally take advantage of the unique way that photosynthesizing plants reflect near infrared radiation (NIR), which can be easily detected with satellite and aerial sensors (Badgley et al., 2017; Baldocchi et al., 2020). Large networks of sites, such as the Long-Term Agroecosystem Research (LTAR) network, provide unique opportunities to analyze plant productivity across multiple collaborative sites over a long period of time, allowing for a better understanding of large-scale spatio-temporal trends. The LTAR network is a collaboration between 18 long-term agricultural research sites across the United States established by the United States Department of Agriculture (USDA) Agricultural Research Service (ARS) and collaborative land-grant universities. The overarching mission of the LTAR network is to provide sustainable solutions for food and fiber production that are currently facing challenges associated with changing climate and increasing resource demands. The LTAR network has been increasingly turning to technological solutions, including remote sensing, to serve as a large-scale indicator (Spiegal et al., 2018; Browning et al., 2021) and solve its pressing questions regarding agricultural sustainability (Kleinman et al., 2018; Boughton et al., 2021; Goodrich et al., 2021). The LTAR network includes a wide range of cropping systems, management practices, and land use histories. Studying the interactions of cropping system and management with carbon flux can be useful when determining best management practices in a variety of systems.

One commonly used satellite output is the Moderate Resolution Imaging Spectroradiometer (MODIS) Gross Primary Productivity (GPP) product. This output provides a measure of total carbon uptake via photosynthesis (GPP)—a major component of the carbon cycle in terrestrial ecosystems. MODIS is a passive sensing instrument aboard NASA’s Terra and Aqua satellites that collects spectral data among 36 bands with a temporal resolution of 1–2 days. The MODIS GPP estimate is a pre-processed data product available via NASA-based platforms (Maccherone, 2021). The GPP product derived from MODIS data uses a light-use efficiency-based model that is modulated by biome type. MODIS classifies all cropland into a single cropland biome. This method relates GPP to the light-use efficiency of photosynthesizing plants and the availability of light. The method is common for estimating GPP using remote sensing from a wide range of sensors beyond MODIS (Reeves et al., 2005; Running and Zhao, 2015; Huang et al., 2021).

Remote sensing estimates of GPP, such as the MODIS GPP product, have a number of advantages compared to ground-based methods, including lower cost, ease of use, and ability to estimate GPP in regions where ground-based instruments are impractical. However, the MODIS GPP estimate is prone to underestimation due to uncertainties associated with assumptions used in the method, cloud cover, coarse resolution, and others. Including the classification of all cropland as a single biome type is largely due to the limited spatial resolution (Tuner et al., 2006; Sims et al., 2008; Huang et al., 2018). For instance, a vulnerability of the MODIS GPP product is that the default scalars used in the calculation of the maximum light-use efficiency are not well measured and are lacking a distinction between C₃ and C₄ photosynthetic pathways (Tuner et al., 2006; He et al., 2013; Xin et al., 2015; Huang et al., 2021). Moreover, reports of uncertainty are common among the photosynthetically active radiation (PAR) absorption calculations used by MODIS (He et al., 2013; Cheng et al., 2014). Many authors have succeeded in improving the GPP estimates (more in-line with ground truth data) by modifying the efficiency parameters and PAR input parameters (Sims et al., 2008; Gilabert et al., 2015; Huang et al., 2018; Huang et al., 2021). The process of improving satellite estimates of GPP requires reliable ground truth data from in-situ carbon flux measurements.

The most common ground based GPP estimation method is the eddy covariance (EC) method (Novick et al., 2018; Hermes et al., 2019; Baldocchi, 2020). The EC method uses two rapid-response (i.e., 10 Hz) instruments, an infrared gas analyzer (IRGA) that measures the concentration of the gas of interest (in this case, CO₂), and a sonic anemometer that measures the vertical wind speed. The covariance of the simultaneous measurements is gas flux, which in the case of CO₂ is the net ecosystem CO₂ exchange (NEE). GPP is then derived from net ecosystem exchange of CO₂ (NEE) using a variety of flux partitioning methods (Reichstein et al., 2005; Wutzler et al., 2018). This method is widely used to estimate GPP due to its continuous measurement style and accuracy; however, the method has several key drawbacks. The instruments needed for the EC method are expensive, need regular maintenance, and require large flat areas with uniform vegetation for optimal function. Despite these challenges, EC instruments provide a strong control and an in-situ estimation of GPP. EC data is widely available through collaborative research efforts, such as LTAR, and through data-sharing networks, such as AmeriFlux or FLUXNET (Pastorello et al., 2020; Bond-Lamberty, 2018).

The in-situ GPP data bridges the gap between the EC GPP estimates and the MODIS GPP products. There have been numerous successes in bridging this gap using linear regression both to modify parameters used in the MODIS algorithm and to modify GPP outputs (Wang et al., 2012; Fu et al., 2014; Xin et al., 2015). Xin et al. (2015) modified the light use efficiency term using linear regression and in-situ measurements to modify MODIS efficiency terms. Kang et al. (2005) improved MODIS outputs using a cloud correction algorithm. The gap could be more thoroughly overcome through the introduction of more advanced modeling methods. Many models have been developed using MODIS GPP and meteorological data, with machine learning algorithms becoming more common in recent years (Joiner and Yoshida, 2020; Jung et al., 2020; Yu et al., 2021). Machine learning is a method of modeling that uses data to train algorithms that describe the data and allow for the prediction of new data points. This method is increasingly being used for simulating CO₂ and other ecosystem gas fluxes (Yao et al., 2017; Knox et al., 2021; Reed et al., 2021; Shang et al., 2021; Talib et al., 2021). Yang et al. (2007) was able to improve MODIS GPP estimates using support vector machine learning. Similarly, Joiner and Yoshida (2020) estimated global GPP on a yearly time step using MODIS data and Neural Network modeling. Cui et al. (2021) used the support vector machine to improve gap filling and evapotranspiration estimates of EC data.

Here we construct a simple machine learning method by using EC data as a ground-truth (dependent) variable. Whereas the MODIS GPP alongside precipitation, temperature, crop, and agroecosystem serve as the independent variables. Using these variables and the AutoML machine learning function of the h2o package, the objectives of this study are to determine: 1) The feasibility of utilizing combined datasets across multiple LTAR sites to estimate GPP using machine learning algorithms, and 2) establish an estimation of GPP that can be used as part of a carbon balance proxy, to serve as a supporting indicator of the sustainability goals of the LTAR network.

2 Materials and methods

2.1 Site and EC data selection

The LTAR network has 18 sites of which 13 are solely or partly cropland sites (Figure 1). Despite the establishment of EC towers across the network, many are new, thus limiting the amount of data collected to date. Of the 13 LTAR cropland EC agroecoregions, three had enough EC data for use in machine learning algorithms. These three sites include the Kellogg Biological Station (KBS), the Platte River High Plains Aquifer (PRHPA), and the Upper Mississippi River Basin (UMRB) sites (Figure 1; Table 1). All of these sites are part of LTAR’s Common Experiment, where similar methods and management practices are used across multiple sites.

FIGURE 1

FIGURE 1. Spatial locations of the three agroecoregions used in this study, all within the Long-Term Agroecosystem Research (LTAR) network. Created using the LTAR network shapefile, published under CC0-1.0.

TABLE 1

TABLE 1. Information for the 13 LTAR eddy covariance (EC) flux measurements sites. AmeriFlux site ID is provided in parentheses.

The KBS LTAR site is located near Battle Creek, Michigan, United States (42.4376, −85.3287) and is operated as an LTAR site through a partnership between USDA-ARS and Michigan State University (Bean et al., 2021). The average annual temperature is 9.9°C and the average annual precipitation is 1,027 mm. Two EC towers were established in 2009 and remain in operation. Since 2009 had a different crop system compared to the rest of the years, 2009 data was not used in this study. The EC towers are in two different fields; one has been cropland since 1938 and the other was converted from Conservation Reserve Program (CRP) perennial grassland to cropland in 2009. Both sites are managed as no-till with continuous rainfed maize (Abraha et al., 2019; Robertson and Chen, 2021). Each EC tower is equipped with a LI-7500 IRGA (LI-COR Biosciences, Lincoln, NE, United States) and a CSAT3 sonic anemometer (Campbell Scientific, Logan, UT, United States). Air temperature and precipitation were determined with ancillary instruments on the EC tower. Data was processed using the EdiRe system (University of Edinburgh, Edinburgh, Scotland, United Kingdom). This processing included flagging low-quality data, performing corrections for sonic temperature and humidity, planar fit coordinate rotation, and corrections for air density. These are typical corrections used in the processing of Eddy Covariance data (Abraha et al., 2019; Burba, 2022).

The PRHPA site is located near Omaha, Nebraska, United States (41.1651, −96.4766) and is operated under a partnership between USDA-ARS and the University of Nebraska-Lincoln and is part of the Platte River High Plains Aquifer agroecoregion (Bean et al., 2021). The average annual temperature is 10.1°C and the average precipitation is 790 mm. Three EC towers were established in 2001 and remain in operation. The three EC towers are in three no-till fields that are operated with the following production cycles: 1) Irrigated continuous maize; 2) irrigated maize/soybean rotation; and 3) rainfed maize/soybean (Suyker, 2021a; Suyker, 2021b; Suyker, 2021c). Irrigation managements was performed with a center-pivot. LI-7200 IRGA (LI-COR Biosciences) and R3-100 sonic anemometer (Gill Instruments, Hampshire, United Kingdom) were used at the site. On-site raw processing was completed using custom code and included typical corrections for EC data as discussed previously.

The UMRB site is located near Minneapolis-St. Paul, Minnesota, United States (44.7143, −93.0898) and is operated under a partnership between USDA-ARS and the University of Minnesota and is part of the Upper Mississippi River Basin agroecoregion (Bean et al., 2021). The average annual temperature is 6.4°C and the average annual precipitation is 879 mm. Three of the EC towers (UMRB 1, UMRB 2, and UMRB 3) were established in 2003 and were dismantled in 2016 when the site was developed. A new EC tower (UMRB 4) was established in a nearby site in 2017 and is still in operation. All tower sites were managed as rainfed maize/soybean rotation with chisel plow tillage (Baker and Griffis, 2019; Baker and Griffis, 2021). LI-7500 IRGA and CSAT3 sonic anemometer were used at the site. Raw data were processed on-site using custom code prior to data sharing. Data processing involved standard corrections applied to EC data as previously discussed. Air temperature and precipitation were measured at all three sites (KBS, PRHPA and UMRB) with ancillary instruments on the EC towers.

Missing EC (15.7%) data due to power outages, instrument maintenance and failure, and unfavorable weather conditions were gap-filled using an online R-based tool, REddyProc (https://www.bgc-jena.mpg.de/5622399/REddyProc; Version 75, Jena, Germany). REddyProc uses a moving-window-based algorithm to fill gaps in EC data and is one of the widely used methods (Reichstein et al., 2005; Wutzler et al., 2018). An average gap of 15.7% is relatively low compared to other eddy covariance datasets (Falge et al., 2001; Hui et al., 2004; Moffat et al., 2007) NEE fluxes were partitioned into GPP and R_eco in REddyProc using a relationship between nighttime NEE and air temperature to estimate R_eco, assuming nighttime NEE fluxes are equal to R_eco. GPP was then calculated by adding R_eco to NEE (Lloyd and Taylor, 1994; Reichstein et al., 2005). EC GPP will be referred to in this paper as GPP_EC from here outwards.

2.2 MODIS data acquisition and processing

MODIS data were pulled from the MODIS/006/MOD17A2H collection (Running et al., 2015) through Google Earth Engine Code Editor (Gorelick et al., 2017). Quality control (QC) bits 5-7 provided a 5-level confidence quality score where “0” indicated the “very best possible” quality (e.g., absence of the clouds). Images with the value of “0” were used while the remaining scores were all masked. Once the desirable images were selected, they were saved to a list and exported to Google Drive. The imagery data were then subject to geospatial processing methods incorporating averaged zonal statistics representing the area of each individual field using ArcGIS Pro (ESRI, Redlands, CA, United States). The spatial resolution for this product is 1 km². The typical eddy covariance tower has a flux footprint (radius) around 150–200 m², giving it an effective spatial resolution that is similar to that of the MODIS product. Across the studied sites, the average EC flux location consisted of 1.05 MODIS pixels.

2.3 MODIS GPP algorithm

MODIS calculates GPP using a light-use efficiency-based model as follows:

{G P P}_{M O D I S} = A P A R * ε (1)

where APAR is the absorbed photosynthetically active radiation (PAR) and ε is the coefficient of radiation use efficiency (Reeves et al., 2005; Running and Zhao, 2015). ε is calculated using the maximum ε, and terms for water, temperature stress, and other environmental factors that come from the Biome Specific Parameters Look Up Table (BPLUT, https://www.ntsg.umt.edu/files/modis/MOD17UsersGuide2015_v3.pdf). For this dataset, cropland BPLUT was used (Running and Zhao, 2015; Huang et al., 2021). APAR is calculated by modifying incoming PAR using cloud cover, aerosol interference, leaf area, day length, and incident angle (Running and Zhao, 2015). As the MODIS product is an 8-day sum, a daily value was obtained by dividing the value by 8.

2.4 Machine learning model

The h2o package (LeDell et al., 2021) provides the AutoML function, which is an automated, supervised machine learning algorithm. It trains the model by utilizing a variety of other algorithm types such as: Gradient Boosting Machines (GBM), Generalized Learning Models (GLM), and eXtreme Gradient Boosting Models (XGBoost). AutoML uses three pre-specified XGBoost GBM models, a fixed grid of GLMs, a default Random Forest (DRF), five pre-specified H2O GBMs, a near-default Deep Neural Net, and Extremely Randomized Forest (XRT), a random grid of XGBoost GBMs, a random grid of H2O GBMs, and a random grid of Deep Neural Nets (H2O AutoML, 2022). AutoML includes stacked ensembles, which is a type of algorithm that additionally trains with a second-level meta-learner to find the best combination of the base learners (LeDell and Poirier, 2020). The models were trained to predict the daily GPP values based on the combination of MODIS-derived daily GPP (MODIS GPP), Julian day of year (DOY), air temperature, precipitation, agroecoregion, and crop. While MODIS includes biome types, the biomes used by MODIS are very broad (i.e., cropland, grassland, deciduous forest) and all sites in this study fall into the cropland biome, including location that allows for more specific ecoregions to be included. This model was limited to only maize and soybean due to limited data availability for other crop types. Several algorithm types such as Random Forest, k-nearest neighbor, and XGBoost were trained separately and as part of AutoML’s stacked ensemble in R (LeDell and Poirier, 2020). For reproducibility, parameters such as “max_models” and “max_runtime” were set to 500 and unlimited, respectively. This allowed AutoML to generate 500 models for each run with no limitations on time and then select the best-performing ones overall as well as within each algorithm type. The best models were identified by Root Mean Square Error (RMSE) and Mean Absolute Error (MAE). It should be noted that XGBoost is not currently available on Windows, which is why MacOS was used in this step.

Cross-validation (CV) is used to understand how well a model will likely perform in an actual use-case scenario (Friedman et al., 2001). There are several available methods to use for CV, each having its own advantages and disadvantages (Arlot and Celisse, 2010). h2o AutoML uses K Fold CV by default, a method introduced by Geisser (1975), to provide a better estimate of how well the model will perform on new data. K Fold CV removes part of the training data and tests the model on the removed portion. The data is divided into k subsets, and then the training is repeated k times using a different combination of subsets each time. The accuracy metrics are averaged across all versions to produce the CV results. In our case, k = 5, which means that 80% of the data was used to predict on the remaining 20%. While the CV results can give a reasonably accurate view of model performance, we chose to keep some years separate from the training for an independent model validation. Only the validation results from this last step are utilized in this paper to provide a more accurate reflection of the models’ performance instead of the CV results that can sometimes appear inflated. The data were split three ways across entire years to ensure model robustness: 1) Models were trained on older years (2004–2015) and tested on recent years (2016–2019); 2) Models were trained on recent years (2006–2019) and tested on older years (2004–2005); and 3) Models were trained and tested on years selected at random (testing years: 2009, 2011, 2014, 2017). Due to the variability of observations between years, the number of years for each region were chosen to maintain as close to an 80:20 training/testing split as possible for a more realistic view of the model’s performance (see Saeb et al., 2017). A summary of all models run is included in the supplementary materials.

2.5 Statistical analyses

The statistical methods used for comparing modeled GPP to GPP_EC were coefficient of determination (r²; Eq. 2), Root Mean Square Error (RMSE; Eq. 3), and Mean Average Error (MAE; Eq. 4). R-squared measures the amount of variability in the predicted variable that can be explained by the model with values ranging from 0 to 1. RMSE is the sum of the square of prediction error for each observation. MAE is the sum of the absolute value of error. We used r² and RMSE to easily compare across models, while MAE was used for ease of communicability to the general public because it is in a format that is understood by the average farmer (i.e., gC m⁻² day⁻¹ as opposed to a root/squared value or something between 0 and 1 that may not be comparable across fields). These indices were calculated as follows:

r^{2} = 1 - \frac{\sum {(y_{i} - \hat{y})}^{2}}{\sum {(y_{i} - \bar{y})}^{2}} (2)

R M S E = \leq \sqrt{} \frac{1}{n} \sum {(y_{i} - \hat{y})}^{2} (3)

M A E = \frac{1}{n} \sum |y_{i} - \hat{y}| (4)

where n is the number of data points, ŷ is the actual value (GPP_EC), y_i is the predicted value (modeled GPP), and ȳ is the mean value. Linear regression of the GPP_MODIS and GPP_EC was performed using SigmaPlot (Version 14.0, Systat Software, Berkshire, United Kingdom). Differences in model success between management practice and sites were determined by comparing MSE and MAE values.

3 Results and discussion

3.1 Comparison of GPP_MODIS to GPP_EC

The linear regressions between the GPP_MODIS (prior to modeling) and GPP_EC vary greatly among sites (Figure 2). The strongest correlation was found at the PRPHA site with maize; however, this could possibly be an artifact of greater data availability (17 years per EC tower versus 9 and 12 at KBS and UMRB, respectively) for that site/crop combination. Larger datasets, with increased data available for training, are associated with lower error than smaller datasets (e.g., Faber et al., 2016; Schmidt et al., 2017; Zhang and Ling, 2018). Across all sites, GPP_MODIS has a tendency to underestimate GPP during the peak growing season and overestimate GPP in the seedling establishment and senescence phases. Underestimation was most pronounced at the peak of the growing season, often by > 100 gC per 8-day measurement period. Overestimation was most pronounced in the late spring just at the very beginning of the growing season with overestimations of around 50–60 gC per 8-day measurement period being common. Tuner et al. (2006) reported that GPP_MODIS product tended to overestimate during low productivity and underestimate during high productivity (i.e., peak growing season) across multiple biomes including croplands. Other studies also found significant underestimation of GPP_EC by the raw GPP_MODIS product, often with similar magnitudes to our results (Wang et al., 2012; Tang et al., 2015; Huang et al., 2018).

FIGURE 2

FIGURE 2. Relationship between the raw Moderate Resolution Imaging Spectroradiometer (MODIS) gross primary productivity (GPP) and eddy covariance GPP using linear regression at each site. The best fit lines use the same color scheme as the points that the regression is showing the relationship of. All slopes are significant to p < 0.0001. Platte River High Plains Aquifer (PRHPA) had 1,007 and 299, observations for irrigated and rainfed maize, respectively and 285 and 344 for irrigated and rainfed soy, respectively. Kellogg Biological Station (KBS) had 342 and 355 observations for AGR and CRP, respectively. UMRB had 303 and 349 observations for maize and soy, respectively.

A best fit line between GPP_EC and GPP_MODIS was determined for all sites as one dataset using linear regression (Figure 3). The r² of the relationship with all sites pooled is not different from findings of sites being analyzed individually, potentially indicating a universal model for simulating GPP for all sites (i.e., toward a much greater utility than a model that works for one site alone). There are differences in the correlation between GPP_EC and GPP_MODIS between the maize and soybean crops, with a stronger correlation in soybean sites. The efficiency parameter in MODIS is based on C₃ photosynthesis, which has been shown to lead to greater underestimation of GPP in C₄-dominated systems (Wang et al., 2012; Running and Zhao, 2015; Huang et al., 2021). There were also differences between sites in the efficacy of the MODIS GPP at directly estimating crop GPP, which may be related to the coefficients used in the estimation. The coefficients (i.e., light use efficiency (ε), temperature response, vapor pressure response, etc.) used by MODIS were the same for all three sites, regardless of differences in climate (Running and Zhao, 2019).

FIGURE 3

FIGURE 3. Linear regression of the raw Moderate Resolution Imaging Spectroradiometer (MODIS) gross primary productivity (GPP) and eddy covariance GPP at all sites as 8-day sums (A) for maize (red solid) and soybean (blue dashed), and (B) for both maize and soybean in one dataset with the best fit line in red. All slopes are significant to p < 0.0001. The total number of datapoints was 3,030 (2,186 for maize and 844 for soy).

3.2 Model results

All agroecoregion/crop combinations were analyzed first, individually, using the three data training/testing splits (Table 2). To re-iterate, all observations classified as the testing datasets were not included in the training datasets. Supervised machine learning has an unavoidable issue of overfitting, mostly due to the limits of training data or the constraints of algorithms that are too complicated and require an abundance of parameters (Ying, 2019). Training/validation data were kept separate in an effort to minimize overfitting, with only the validation results communicated here. At PRHPA, the maize models had an average validation r² of 0.88, RMSE of 2.82 gC m⁻² day⁻¹, and MAE of 1.71 gC m⁻² day⁻¹; for soybean the average validation r² was 0.78, RMSE was 2.61 gC m⁻² day⁻¹, and MAE was 1.73 gC m⁻² day⁻¹. For both crops combined at PRHPA the model success was similar to single crop models with an average r² of 0.85, RMSE of 2.93 gC m⁻² day⁻¹, and MAE of 1.77 gC m⁻² day⁻¹. At UMRB, the maize only models had an average r² of 0.86, a RMSE of 3.06 gC m⁻² day⁻¹, and a MAE of 2.0 gC m⁻² day⁻¹; with soybean only models the average r² was 0.76, RMSE was 2.06 gC m⁻² day⁻¹, and MAE was 1.38 gC m⁻² day⁻¹. For both crops combined at UMRB the model success was similar to that of single crop models with an average r² of 0.84, an RMSE of 2.6 gC m^-2 day^-1, and an MAE of 1.64 gC m^-2 day^-1. At KBS only maize was grown, and the models had an average validation r² of 0.77, RMSE of 3.14 gC m^-2 day^-1, and MAE of 1.95 gC m^-2 day^-1. All best-fit models were Stacked Ensemble-type models, likely due to their second-level learning approach by utilizing what they learn from the base learners to inform the meta-algorithm or super learner.

TABLE 2

TABLE 2. The average results for machine learning modeling of individual agroecosystem GPP are shown below. Average is across the three temporal validation data splits. Units are gC m^-2 day^-1.

In both agroecosystems with maize and soy, the model success with both crops combined into a single dataset was similar to that of single crop models, indicating that this method can be used with multiple crops with different growth patterns and photosynthetic pathways in the same model. Model validation RMSE ranged from 2.06 to 3.14 gC m^-2 day^-1 depending on location and crop. Given that the typical maximum daily GPP_EC was around 25–30 gC m^-2, the observed error was considerably less than a day’s carbon update and was well within the range of expected values from other studies. Other remote sensing GPP modeling efforts (primarily linear regression) reported daily RMSE values ranging from 2.6 gC m^-2 (Nguy-Robertson et al., 2015), 1.9 to 12.1 gC m^-2 depending on efficiency term (Cheng et al., 2014), 0.5 to 2.0 gC m^-2 depending on ecosystem and methodology (Gilabert et al., 2015), 3.8 gC m^-2 (He et al., 2013), and 0.8 to 7.5 gC m^-2 depending on site and methodology (Huang et al., 2021). The error range found here is well within the range reported by other studies, showing that this method is suitable for simulating cropland GPP.

3.3 Comparison of all site data

The models were then run with all site data combined into a single dataset (Table 3; Figure 4). When all agroecoregions and crops were combined into one dataset, the model validation r² was 0.85, RMSE was 2.77 gC m^-2 day^-1, and MAE was 1.67 gC m^-2 day^-1. As a control, a model set where the only input was MODIS GPP was also created; this model was much weaker (r²: 0.52), showing that the addition of other variables (i.e., climate, location, crop) greatly improved model success. The error shown with the single dataset was within the range seen in individual datasets and similar those seen in other studies (Guo et al., 2023; He et al., 2013; Nguy-Robertson et al., 2015; Reed et el., 2021). Duan et al., 2021 found a similar error using Random Forest when modeling for rice (Oryza sativa), but was more accurate when modeling wheat (Triticum Aestivum). When looking at the regression between modeled and observed GPP, as shown in Figure 4, the slope of the relationship is close to 1.0. The slopes were similar across data splits, 1.01 for early year, 1.05 for late, and 0.96 for random, indicating a near 1:1 relationship between modeled and observed GPP. This 1:1 relationship further indicates that Stacked Ensemble machine learning can reliably estimate GPP across various data splits. However, it is worth noting that there was a greater spread of data points about the slope at higher GPP values (both with machine learning and in the original data comparison), indicating a potential oversaturation of the remote sensing data. While the combined dataset had more training data, it had two crops and three locations to model for, potentially complicating the modeling effort, as a result, it is performance was similar. Maize and soybean have very different growth habits and photosynthetic pathways, likely making best-fit models different for each crop. The MODIS17 product uses a light-use efficiency method but does not correct for differences in C₃ and C₄ photosynthetic pathways. In-situ measurements of light use efficiency have found considerably different efficiency values for maize and soybean due to differences in plant physiology, including canopy structure and photosynthesis biopathways (Gitelson et al., 2015; Xin et al., 2015; Gitelson et al., 2018). However, by including the crop type as a variable, the AutoML algorithms successfully distinguished between the two biopathways.

TABLE 3

TABLE 3. The average results for combined agroecosystem/crop GPP modeling are shown below. Average is across the three temporal validation data splits. Units are gC m^-2 day^-1.

FIGURE 4

FIGURE 4. The relationship between machine learning predicted GPP and eddy covariance GPP is shown above. The top graph shows the model where the earlier years (2004–2005) were used as validation and the later years (2006–2019) were used as training (645 data points). The middle graph shows the model where the later years (2016–2019) were used as validation and the earlier years (2004–2015) were used as training (618 data points). The bottom graph is where random years were used for validation (2009, 2011, 2014, and 2017) with the remainder as training (574 data points).

As with analysis by agroecosystem, all best-fit models for the combined datasets were Stacked Ensemble-type models. As discussed previously, Stacked Ensemble is a machine learning method that combines multiple learning methods (i.e., GBM and XGBoost) by using the output of one model as the input for another (Rajadurai and Gandhi, 2020; Mohebbian et al., 2021). Stacked Ensemble is a robust approach that can work with many data types and uses (Zai and Chen, 2018; Rajadurai and Gandhi, 2020). Stacked Ensemble methods have been found to frequently outperform single models across many data types (Zhai and Chen, 2018; Chowhurdy et al., 2019; Singh et al., 2019; Jangam and Annavarapu, 2021).

3.4 Implications for future research

This study has provided insight into the potential of using machine learning methodology to estimate GPP using readily available inputs (MODIS GPP product, air temperature, precipitation, crop type, and agroecosystem) across the LTAR network and croplands. This framework has the potential to allow for network-wide estimations of carbon uptake across the Common Experiment and other sites, even where EC towers are not present, and to further network goals of understanding cropland carbon dynamics. Combined models (including multiple agroecoregions in the same model) can account for region-specific differences by using agroecosystem region as an input in the training phase. The combined model will allow for more large-scale carbon inventories without compromising on accuracy when compared to site and crop-specific models (combined model r²: 0.85 average site/crop-specific model r²: 0.82), likely owing to the greater pool of training data for larger models.

Croplands cover about 12% of the Earth’s ice-free land area (IPCC, 2019). Sustainable management on these lands can maintain or improve productivity while contributing to climate change mitigation and adaptation goals (Tang et al., 2015; IPCC, 2019; Browning et al., 2021). As a component of the carbon cycle, cropland GPP is an important indicator of productivity and sustainability, and its monitoring can contribute to furthering sustainability goals (Beer et al., 2010; Gilabert et al., 2015; Huang et al., 2018; Browning et al., 2021). GPP monitoring can also be valuable in understanding the short-term effects of extreme weather events on carbon dynamics, including droughts, floods, and intense storms (Ciais et al., 2005; Menefee et al., 2020; Yin et al., 2020). Given that few cropland sites have in-situ GPP monitoring, regional and global GPP estimates rely on remote sensing and modeling to estimate large-scale carbon uptake (Kalfas et al., 2011; Huang et al., 2018; Smith et al., 2019). This method of incorporating machine learning allows for a more flexible model that can apply to broad areas and pick up on trends not seen in process-based modeling (Beer et al., 2010; Jung et al., 2011; Xiao et al., 2014). Expanding this methodology to broader regions and more sites to create LTAR-wide carbon flux estimates is a future goal of this project.

Machine learning methods have already been widely used to successfully estimate global GPP on annual timesteps with various algorithm types (Beer et al., 2010; Jung et al., 2011; Xiao et al., 2014). Machine learning estimated global evapotranspiration, CH₄ emissions, and NEE on global, field, and regional scales (Yao et al., 2017; Knox et al., 2021; Shang et al., 2021; Talib et al., 2021). Applying these methods to agricultural lands can quantify carbon cycle contributions from agriculture and determine best management practices for carbon sequestration. Practically any field of suitable size can leverage the power of machine learning by utilizing methods as described in this paper, which, given the wide accessibility of input data, should make this type of analysis feasible for any large-scale cropping system. The methods employed provide a simple solution that can be followed with minimal experience/knowledge with machine learning. However, further improvement of a given algorithm’s output via adjusting hyperparameters is limited to the range of values disclosed by the h2o package. Currently, the full range required for a sensitivity analysis is undisclosed by the h2o package. Data limitations from EC towers often arise due to the high costs of the equipment; the limited data can possibly result in commonly known issues of overfitting with machine learning. Nevertheless, the extensive reach and robustness of machine learning-based carbon models, as demonstrated here, make it an ideal method for future work in understanding cropland carbon uptake and climate interactions. Future steps aim to apply this methodology to more sites within the LTAR network. The LTAR network has contributed to our collective understanding of cropland biogeochemical cycling across the United States, and it is our hypothesis that the addition of machine learning methods will enhance these network analyses.

4 Conclusion

In this study, we showcased the applicability of machine learning to estimate GPP across LTAR croplands using MODIS satellite imagery, weather, and agroecoregion as input data. The MODIS GPP product, while correlated with in-situ GPP, frequently underestimated GPP during peak growing season and overestimated during seedling establishment and senescence. In simulating GPP at individual tower sites, model performance was best at sites with larger quantities of data available for model training. The machine learning methods also work well with all sites combined into one dataset, particularly for maize. Combined datasets provide more training data for the machine learning algorithm to work with and can thus improve model success over individual site-scale models. The success of machine learning at modeling GPP across three LTAR sites is a first step towards applying this methodology to the network as a whole.

Data availability statement

Author contributions

DM: Writing, results interpretation, and editing. TL: Machine learning work. KF: Remote sensing, results interpretation, and editing. JC: Eddy Covariance data. MA: Eddy Covariance data. JB: Eddy Covariance data. AS: Eddy Covariance data. All authors contributed to the article and approved the submitted version.

Funding

LTAR is supported by the USDA. This work was supported by the USDA-ARS Grassland Soil and Water Research Laboratory, Temple, TX, and supports the USDA-ARS-LTAR network.

Acknowledgments

This research was a contribution to the Long-Term Agroecosystem Research (LTAR) network. The USDA is an equal opportunity employer and provider.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abraha, M., Chen, J., Hamilton, S. K., and Robertson, G. P. (2019). Long-term evapotranspiration rates for rainfed corn versus perennial bioenergy crops in a mesic landscape. Hydrol. Process. 34, 810–822. doi:10.1002/hyp.13630

Machine learning algorithms improve MODIS GPP estimates in United States croplands

1 Introduction

2 Materials and methods

2.1 Site and EC data selection

2.2 MODIS data acquisition and processing

2.3 MODIS GPP algorithm

2.4 Machine learning model

2.5 Statistical analyses

3 Results and discussion

3.1 Comparison of GPPMODIS to GPPEC

3.2 Model results

3.3 Comparison of all site data

3.4 Implications for future research

4 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

3.1 Comparison of GPP_MODIS to GPP_EC