Choosing blocks for spatial cross-validation: lessons from a marine remote sensing case study

Stock, Andy

doi:10.3389/frsen.2025.1531097

ORIGINAL RESEARCH article

Front. Remote Sens. , 21 March 2025

Sec. Data Fusion and Assimilation

Volume 6 - 2025 | https://doi.org/10.3389/frsen.2025.1531097

Choosing blocks for spatial cross-validation: lessons from a marine remote sensing case study

Andy Stock^1,2*

¹NIVA Denmark Water Research, Copenhagen, Denmark
²Norwegian Institute for Water Research, Section for Environmental Informatics, Oslo, Norway

Supervised learning allows broad-scale mapping of variables measured at discrete points in space and time, e.g., by combining satellite and in situ data. However, it can fail to make accurate predictions in new locations without training data. Training and testing data must be sufficiently separated to detect such failures and select models that make good predictions across the study region. Spatial block cross-validation, which splits the data into spatial blocks left out for testing one after the other, is a key tool for this purpose. However, it requires choices such as the size and shape of spatial blocks. Here, we ask, how do such choices affect estimates of prediction accuracy? We tested spatial cross-validation strategies differing in block size, shape, number of folds, and assignment of blocks to folds with 1,426 synthetic data sets mimicking a marine remote sensing application (satellite mapping of chlorophyll a in the Baltic Sea). With synthetic data, prediction errors were known across the study region, allowing comparisons of how well spatial cross-validation with different blocks estimated them. The most important methodological choice was the block size. The block shape, number of folds, and assignment to folds had minor effects on the estimated errors. Overall, the best blocking strategy was the one that best reflected the data and application: leaving out whole subbasins of the study region for testing. Correlograms of the predictors helped choose a good block size. While all approaches with sufficiently large blocks worked well, none gave unbiased error estimates in all tests, and large blocks sometimes led to an overestimation of errors. Furthermore, even the best choice of blocks reduced but did not eliminate a bias to select too complex models. These results 1) yield practical lessons for testing spatial predictive models in remote sensing and other applications, 2) highlight the limitations of model testing by splitting a single data set, even when following elaborate and theoretically sound splitting strategies; and 3) help explain contradictions between past studies evaluating cross-validation methods and model transferability in remote sensing and other spatial applications of supervised learning.

1 Introduction

Supervised learning is a critical tool for mapping environmental variables like marine chlorophyll a, land cover types, and species distributions at broad spatial scales (Elith and Leathwick, 2009; Kerr and Ostrovsky, 2003; Tuia et al., 2022). In supervised learning, training a model involves extracting relationships between output (response) and input (predictor) variables from example data. In this way, supervised learning allows the continuous mapping of variables measured at discrete points in space and time. In marine satellite remote sensing, which serves as a case study here, common supervised learning approaches range from simple linear regression (e.g., Darecki et al., 2005; Kratzer et al., 2003; O'Reilly et al., 1998; O'Reilly and Werdell, 2019) to complicated machine learning methods (e.g., Kattenborn et al., 2021; Yuan et al., 2020; Zhang et al., 2023).

These models typically rely on in situ observations of the response variable for training and validation. A sound sampling design is critical when collecting in situ data for this purpose (Rocha et al., 2020). However, collecting data at sea over broad spatial scales and according to a sound sampling design would be extremely expensive. Therefore, to obtain sufficiently large in situ data sets, many broad-scale marine studies rely on databases that compile measurements from individual field campaigns with different objectives and without an overarching sampling strategy. Such data often have substantial spatial biases, i.e., some places are well-covered by data, whereas others have little or no data (Boakes et al., 2010; Bowler et al., 2022; Stock and Subramaniam, 2020). The spatial biases in such databases pose a critical statistical challenge in supervised-learning-based marine remote sensing (Stock, 2022).

A key question about models intended to generate broad-scale maps is how well they make predictions across the whole region of interest, including data-poor subregions (Peterson et al., 2007; Qiao et al., 2019; Stock and Subramaniam, 2020; Yates et al., 2018). Researchers traditionally evaluate and compare models by randomly splitting the available data into a training set for fitting the model and a test (or validation) set for estimating its prediction accuracy (sometimes, an additional development set is used for model selection and fine-tuning). This split can be done once or repeatedly in cross-validation. However, evaluating models based on random splits produces misleading results in many remote sensing and other environmental applications that involve spatial data (Fourcade et al., 2018; Ploton et al., 2020; Roberts et al., 2017). In particular, environmental variables are often spatially autocorrelated (Legendre, 1993), making nearby observations dependent. Dependence between training and testing data violates a core assumption of many statistical methods (Arlot and Celisse, 2010; Nikparvar and Thill, 2021), causes the selection of too complex models that do not generalize well (Gregr et al., 2019), and is a key driver of data leakage, a common cause of wrong results in scientific applications of supervised learning (Kapoor and Narayanan, 2023).

Two factors exacerbate these statistical problems as the popularity of machine learning as a scientific tool is rising, and machine learning is claimed to be superior to simpler statistical approaches (Pichler and Hartig, 2023). First, machine learning models can easily pick up location-specific relationships that fail to transfer to new locations (Beery et al., 2018), yet such failures are missed when training and testing data come from the same locations (Stock et al., 2023). Second, machine learning methods are rarely tailored to the limitations of typical environmental data, such as autocorrelated observations taken near each other. Ideally, models intended to make predictions for data-poor locations or to yield generalizable insights should be tested with independent, out-of-distribution data (Araújo et al., 2005; Geirhos et al., 2020; Gregr et al., 2019), yet such data are rarely available.

When only a single data set is available for model training and testing, cross-validation can mimic tests with independent data and extrapolation to data-poor regions by separating training and testing data spatially, temporally, or in predictor space (Roberts et al., 2017; Wenger and Olden, 2012). However, separating training and testing data does not guarantee sound error estimates for two reasons. First, if some subregions of the study area have no data, error estimates calculated for held-out subregions with data are not necessarily valid for subregions without data (for a method to estimate the area where a cross-validated error estimate applies, see Meyer and Pebesma, 2021). Second, the data being split might contain non-spatial biases and shortcuts. A sound data separation strategy is therefore necessary, but not sufficient, to avoid data leakage and obtain sound estimates of a spatial model’s prediction accuracy (Kapoor and Narayanan, 2023; Stock et al., 2023).

Two main approaches exist for separating training and testing data spatially. First, one can leave out one observation at a time for testing and withhold all data within a spatial buffer around the test observation from training (Le Rest et al., 2013; 2014; Pohjankukka et al., 2017). Second, one can split the data into blocks based geographical space (block cross-validation; Roberts et al., 2017; Sweet et al., 2023). Spatial cross-validation strategies yield better error estimates under spatial dependence and are hence a key tool in many environmental applications (Bald et al., 2023; Crego et al., 2022; El-Gabbas et al., 2021; Smith et al., 2021; Stock et al., 2018). An R package for spatial cross-validation is available (Valavi et al., 2019). However, spatial cross-validation remains underused in marine remote sensing and requires methodological choices such as the size and shape of spatial blocks.

Here, we explore how such choices affect error estimates with synthetic data that mimic a marine remote sensing application. With this example, we aim to inform the evaluation of predictive models in applications that 1) use supervised learning in satellite remote sensing or to create other broad-scale maps from point data, 2) must split a single data set for training and testing, and 3) rely on point data that were collected without an overarching sampling strategy, e.g., obtained from databases combining measurements from many individual field campaigns. Specifically, we ask: How do block size, shape, the number of cross-validation folds, and assignment of blocks to folds affect prediction error estimates and model selection? Which of these choices is most important? Might such choices explain contradictory results between prior studies comparing spatial cross-validation methods and testing the spatial transferability of models?

2 Materials and methods

2.1 Overview

To answer our research questions, we exploit synthetic data that mimic a remote sensing application in marine biology (Stock, 2022). These data cover the Baltic Sea in northern Europe from 2003 to 2019. They consist of many individual data sets (henceforth, subsets) with geographic points (measurement locations and dates only) extracted from an oceanographic database. Each data point contains a response variable (synthetic chlorophyll a concentration) and satellite-based predictors (remote sensing reflectance in different wavelength bands) for these locations and dates where actual, in situ chlorophyll measurements existed. With each subset, three models of different complexity were trained and evaluated with various cross-validation strategies. Using a synthetic response variable that was generated with a model instead of values measured in situ allowed for calculating the models’ “true” prediction error across the study region and period, which were compared to cross-validated estimates limited to using the subsets, i.e., locations and dates where real in situ data existed (Figure 1). Importantly, “true” error here refers to a model’s prediction error in its intended task (generating daily maps of synthetic chlorophyll a for the whole Baltic Sea), not its skill predicting real-world, in situ chlorophyll a concentration.

Figure 1

Figure 1. Overview of data sources and study design.

2.2 Synthetic data

The synthetic data were developed in four steps outlined below to support the comparison of validation methods in a realistic use case of supervised learning. Additional details are provided in Stock (2022).

First, to create synthetic data with realistic distributions in space and time, we extracted locations and times of in situ chlorophyll a measurements from an oceanographic database (http://ocean.ices.dk/HydChem, accessed 31 August 2020). Such data are typically collected from ships during research cruises over many years. During cruises, researchers choose measurement locations based on the cruise’s scientific objectives instead of an overarching sampling strategy for the database. We excluded in situ measurement locations within 5 km from the coastline, made at depths >2 m, and with chlorophyll a concentrations >30 mg m⁻³.

Second, for predictors, each in situ observation was matched with satellite measurements of remote sensing reflectance in five wavelength bands (412 nm, 443 nm, 490 nm, 555 nm and 670 nm: http://globcolour.info, accessed 4 September 2020). The satellite data came from the GlobColour project, which combines data from several satellite-borne instruments to improve spatiotemporal coverage (Fanton d’Andon et al., 2009; Maritorena et al., 2010). The spatial resolution was 4km, and the temporal resolution was 1 day. Because clouds often obscure satellite views of the sea surface, many field observations had no matching satellite data. This reduces the number of usable observations and can introduce additional spatiotemporal biases due to uneven cloud cover (Stock et al., 2020). We matched the in situ and the satellite data with a same-calendar-day temporal window and bilinear interpolation from the four surrounding pixels, yielding 2,728 in situ observations with matching satellite data (henceforth, matchups: Figure 2A).

Figure 2

Figure 2. Map and selected statistics of the synthetic data used to evaluate cross-validation methods. (A) Locations of in situ observations of chlorophyll a with matching satellite data (which the subsets were sampled from) and the number of subsets that had observations within 50 km. (B, C) Two example subsets. (D) Histogram of synthetic chlorophyll a concentration used as response variable for the data from which the subsets were sampled. (E) Number of observations in the subsets. (F) Percent of the study area with at least one observation within 50 km across the generated subsets.

Third, to compare how well cross-validated error estimates approximated “true” prediction errors for the whole study region and period, the in situ chlorophyll a concentrations were replaced with synthetic values. These values were the weighted average of two sources with 4 km spatial and 1-day temporal resolution: 1) a biogeochemical simulation model of the Baltic Sea with 60% weight (Baltic Sea Biogeochemical Reanalysis, https://marine.copernicus.eu, accessed 31 August 2020), and 2) existing satellite-based maps of chlorophyll a, also from the GlobColour project, with 40% weight (these maps were previously generated with the same remote sensing reflectance data but another algorithm, and hence reflected some spatial patterns of the predictors). The averaging was necessary because simulated chlorophyll a was less correlated with remote sensing reflectance and with the original in situ measurements than in most real applications, whereas the satellite product could have been too easily reconstructed by flexible machine learning methods with remote sensing reflectance as predictors. The weights were chosen manually to correct for these unrealistically small correlations while keeping the biogeochemical simulation dominant (correlation of log₁₀-transformed in situ chlorophyll with simulated values: Pearson correlation coefficient r = 0.16; with satellite chlorophyll from GlobColour: r = 0.49; with weighted average: r = 0.46). The Spearman rank correlation of the band ratio R (a common predictor of chlorophyll a, see section 2.3) with in situ chlorophyll a was ρ = 0.26, with simulated chlorophyll was ρ = 0.03, and with merged chlorophyll was ρ = 0.25. The moderate but significant (p < 0.001) correlations reflect high concentrations of other optical water constituents that make remote sensing of the Baltic Sea tricky (Darecki and Stramski, 2004; Siegel and Gerth, 2008; Stock, 2015). Furthermore, as is typical in real applications, the merged, synthetic chlorophyll a was roughly log-normally distributed (Figure 2D). Therefore, while chosen manually, the selected weights resulted in a synthetic response variable with statistical properties and relationships similar to the in situ measurements it replaced. Henceforth, “synthetic concentrations” refer to this weighted average.

Fourth, to create many synthetic yet realistic data sets with different sizes and spatial biases, 2000 random subsets were sampled from the 2,728 matchups (Figures 2B, C). To mimic oceanographic data collection, whole cruises were sampled (not individual observations). However, the automatic generation of spatial blocks with a common R package (Valavi et al., 2019) included in our test of cross-validation approaches failed for larger blocks in some small subsets (see Section 2.4). These subsets were excluded from the analyses to allow a comparison of all tested cross-validation methods. The remaining 1,426 subsets contained between 200 and 1,500 observations and exhibited different degrees of spatial bias (Figures 2E, F).

2.3 Predictive models

With each subset, we trained and tested three predictive models common in marine remote sensing. The response was always synthetic, log₁₀-transformed chlorophyll a, but the models used different predictors and underlying mathematical structures.

The first model was a simple linear model:

\log 10 ({C h l}_{a}) = a_{0} + a_{1} R

R = \log_{10} ((\max (R R S 443, R R S 490)) / R R S 555)

Here, RRSxxx is the remote sensing reflectance in the respective wavelength band. Such models are called maximum band ratio algorithms and are among the longest-established statistical models for mapping chlorophyll a from satellites (O’Reilly et al., 1998).

The second model was a random forest (RF) using remote sensing reflectances in different wavelength bands and the band ratio R as predictors. Random forests are a basic machine-learning approach. They consist of many regression trees (here: 300) fitted to bootstrap samples of the training data while using only some predictors when fitting each tree (Breiman, 2001). Random forests work well for smaller data sets with correlated predictors and are a common choice in remote sensing applications (Belgiu and Drăgu, 2016).

The third model was a random forest with projected X and Y coordinates as additional predictors (RFXY). These spatial predictors allow the model to harness spatial structures in the data for predictions (Zhang et al., 2023). However, including them risks overfitting the model to these structures and limits its applicability when spatial structures change over time, e.g., because of climate change. Stock (2022) found that including spatial coordinates in a random forest caused large prediction errors that spatial, temporal, and environmental block cross-validation methods underestimated. Hence, the RFXY model is a “worst case” illustrating the limits of estimating prediction errors with spatial block cross-validation.

2.4 Spatial blocks

We tested two kinds of spatial blocks: (1) blocks and folds automatically generated with the R package blockCV (Figures 3A–F; Valavi et al., 2019), and (2) blocks manually created for the Baltic Sea (Figures 3G–I).

Figure 3

Figure 3. Examples of spatial blocks used for cross-validation. The blocks were either created automatically with the R package blockCV [examples in (A–F), with plot headings reflecting key parameters described in the text] or created manually for the Baltic Sea: subbasins (G) and latitudinal blocks reflecting environmental gradients in the study region (H, I).

The blockCV package allows the automatic generation of spatial blocks based on user-provided parameters. Here, we varied the following parameters: 1) block size (2 km–300 km); 2) block shape (squares or hexagons), 3) how blocks were assigned to folds (random, systematically, or in a checkerboard pattern), and 4) the number of folds (5 or 10 for random and systematic assignment, 2 for checkerboard assignment).

In addition, we manually created three sets of blocks. The first set was subbasins of the Baltic Sea, defined by HELCOM (the intergovernmental organization governing environmental issues in the Baltic Sea region). The second and third sets reflected the Baltic Sea’s environmental gradients from its connection with the Atlantic Ocean in the southwest to its northernmost bays, with north-south block sizes of 80 km and 200 km. In these manual designs, each block served as a fold. In each subset, folds with fewer than 20 observations were merged with the next-smallest fold until all blocks had at least 20 observations.

2.5 Spatial autocorrelation

To be considered independent, training and testing data must be farther apart than the autocorrelation range (Trachsel and Telford, 2016). This range is thus critical information for spatial block cross-validation. It is traditionally estimated for residuals of the fitted model (Le Rest et al., 2014). However, fitting the model first precludes model selection, and residuals may be underestimated for flexible models overfitted to spatial structures (Roberts et al., 2017). Furthermore, with three models and 1,426 synthetic subsets, this study involved over 4,000 fitted models. Exploring residual autocorrelation for all was impractical. Consequently, we followed Valavi et al. (2019) and examined spatial autocorrelation of the predictors, assuming that they reflect the spatial structure of relevant environmental variables. Spatial autocorrelation can be examined, e.g., through variograms or correlograms, which provide similar information (Dormann et al., 2007). While variograms are a fundamental tool of geostatistics, correlograms are common in other fields like ecology and can be more robust when data are clustered (Wilde and Deutsch, 2006). Here, some clustering of available predictor data might have occurred because of differences in cloud cover across the study region. We hence calculated variograms as well as correlograms.

Spatiotemporal sample variograms were calculated for each predictor in two selected years (2005 and 2018) with the R package gstat (Gräler et al., 2016; Pebesma, 2012; Pebesma, 2004). For computational efficiency, each variogram calculation used a sample consisting of 5% pixels with data from the respective year. We calculated and averaged spatial correlograms with Moran’s I as a measure of spatial dependence for 100 randomly selected days during the study period with the R package ncf (Bjornstad, 2022).

2.6 “True” errors vs. cross-validation errors

Predictive models should be tested with data reflecting their target application (Kapoor and Narayanan, 2023). Because the target application was to create maps for the whole Baltic Sea, we compared cross-validated error estimates calculated with the spatial block options described in Section 2.4 and with standard 10-fold cross-validation to “true” prediction errors calculated for the whole study region and period. These “true” errors were calculated in three steps, as described below. Importantly, all prediction errors were calculated with the synthetic chlorophyll concentrations (which are known everywhere) as response variable. Hence, “true” refers to errors that are valid for the whole study region and period, not errors that reflect the real-world chlorophyll a concentration (which are only known where in situ data exist).

First, we trained each model (MBR, RF, RFXY) with each complete subset, i.e., without withholding any data from the subset (Kuhn and Johnson, 2013). Each subset contained synthetic chlorophyll a values as the response variable and the predictor variables as described in Section 2.2. This process yielded 4,278 trained models (three kinds of models trained on 1,426 subsets). Because the subsets were sampled from a database of field campaigns (see Section 2.2), training the models relied exclusively on locations and times where real in situ data existed.

Second, we created validation data covering the whole study region and period to calculate the “true” errors. Because making pixel-by-pixel predictions for 18 years of daily satellite data with over 4,000 models was computationally too expensive, we randomly sampled 1% of pixels in each daily satellite image. This sample comprised over 380,000 observations. Each observation contained a synthetic chlorophyll a value as response and predictor variables as described in Section 2.2. Hence, the data used to calculate “true” errors–in contrast to the test sets of the various cross-validation methods–contained observations from randomly sampled locations and times and covering the whole study region and period (as far as cloud cover allowed).

Third, with each of the 4,278 trained models, we made predictions for this test set covering the whole study region and period, yielding “true” error estimates in the sense that they reflected the purpose of broad-scale, satellite-based mapping precisely (making daily maps for the whole study region and period).

Finally, we applied the various cross-validation methods (Section 2.4) to each model and subset, resulting in 4,278 error estimates from each cross-validation method. Comparing these cross-validation estimates to the “true” errors revealed how well each method estimated the models’ prediction accuracy in the intended application.

As error measures, we used the root mean squared error (RMSE) and the absolute percentage difference (APD), calculated with the standard equations (like in Stock, 2022).

3 Results

3.1 Error estimates and model selection

Synthetic chlorophyll a concentrations predicted with the MBR model had smaller “true” errors than those of the random forests (RF and RFXY) in 99% (RMSE) and 97% (APD) of subsets. Prediction errors were highest (1) in the Bothnian Bay, where the fewest training data were available (RMSE and APD) and (2) the eastern Gulf of Finland, the Gulf of Riga, and some smaller areas with very high synthetic chlorophyll a concentrations (RMSE only) (Figure 4). The APD’s comparatively small values in these high-chlorophyll areas might reflect this error measure’s low sensitivity to differences between larger numbers. Moderate “true” errors also occurred in large offshore areas where relatively low chlorophyll a concentrations and sparse data coverage coincided, like the Bothnian Sea (APD and RMSE).

Figure 4

Figure 4. Spatial distribution of mean “true” errors (RMSE and APD of the three model types predicting synthetic chlorophyll a) averaged over 300 randomly sampled, partly cloud-free days across the whole study period. For each of the three model types, each subset yielded a different trained model, and prediction errors where averaged across subsets for this figure. “True” errors refers to errors when predicting synthetic chlorophyll a concentrations for the whole study region and period, not real-world concentrations.

The tested cross-validation methods often underestimated errors, especially for the RFXY model (Figure 5; Table 1). Overall, spatial block cross-validation yielded better error estimates than 10-fold cross-validation but sometimes overestimated errors. Error estimates from the blockCV package depended on the specific options, especially block size (see Section 3.2). They were larger than estimates from 10-fold cross-validation and smaller than estimates from large, manually created blocks (subbasins). Blocks generated with the blockCV package and good options led to a stronger underestimation than large manually created blocks in some cases but avoided an overestimation in others.

Figure 5

Figure 5. Estimated errors generated with different options in the blockCV R package as a function of block size. The solid black lines show the models’ “true” errors (mean error predicting synthetic Chl a concentration for the whole study region and period across all subsets). The dashed black line shows errors estimated with 10-fold cross-validation. The dotted line shows errors estimated using subbasins as spatial blocks.

Table 1

Table 1. “True” errors and estimated errors with different cross-validation approaches. With the blockCV package’s various settings, there were too many combinations to show in the table. Instead, the table shows the best estimate obtained with the package (i.e., the one closest to the “true” error, representing an optimal choice of parameters) and the 25th percentile of absolute difference to the “true” error (P₂₅, representing a good but not optimal choice of parameters). The estimates closest to the “true” errors are highlighted in bold font.

Depending on the model and error measure, 10-fold cross-validation underestimated prediction errors by 5% (RMSE of MBR) to 54% (APD of RFXY). The different block cross-validation methods yielded more accurate error estimates than 10-fold cross-validation, but the RMSE was sometimes overestimated. The best RMSE and APD estimates for RFXY were achieved with subbasins as blocks. The best APD estimates for MBR and RF were achieved by blocks generated with blockCV when optimal options were chosen; with solid but not optimal choice of options, the 80 km north-south blocks estimated the APD of these models best.

When choosing between the MBR and the RF models, all spatial cross-validation methods with large block sizes led to correct model selection for >98% of subsets. Ten-fold cross-validation selected the best model for fewer subsets (APD: 86%, RMSE: 93%). In contrast, model selection failed even with the best methods when choosing between all three models (MBR, RF, and RFXY). Ten-fold cross-validation incorrectly chose RFXY for over 99% of subsets. Spatial cross-validation with subbasins as blocks worked best, but RFXY was still incorrectly chosen in over 50% (APD) and 80% (RMSE) of subsets.

3.2 Options when generating blocks with blockCV

When creating square or hexagonal blocks automatically, choosing a large block size was the most important (Table 2). On average, cross-validation with ten folds yielded slightly better error estimates than five folds, square blocks yielded slightly better error estimates than hexagonal blocks, and systematic or checkerboard assignment of blocks to folds yielded slightly better error estimates than random assignment. However, except for the block size, the differences between the options were small. For example, averaged over all subsets and block sizes≥200 km, the random forest’s APD was underestimated by 25% with hexagonal blocks and 24% with square blocks. Nevertheless, large square blocks with systematic assignment to folds was always among the best choices, and often the best, across models and error measures (Figure 5).

Table 2

Table 2. Percentage of subsets for which different options were in the set of parameters yielding the most accurate blockCV-based error estimate. The highest percentages in each parameter group are shown in bold font.

3.3 Spatiotemporal autocorrelation

Spatiotemporal variograms (Figure 6) showed that all predictors were spatially autocorrelated over several hundred kilometers, yet none of the variograms reached their sill within 500 km (already beyond a practical block size). Variograms calculated for 2005 (not shown) were similar to those for 2018. While correctly suggesting the need for large blocks to achieve independent training and testing data, the variograms did not suggest an optimal block size.

Figure 6

Figure 6. Empirical spatiotemporal variograms of the predictors for 2018.

Correlograms showed a more apparent autocorrelation range of the predictors (Figure 7). The spatial correlation dropped sharply within the first 100 km. It plateaued near 200 km for the 412 nm, 443 nm, and 490 nm wavelength bands and near 300 km for the 555 nm and 670 nm wavelength bands. Hence, the correlograms suggested a sound range for the block size in this application.

Figure 7

Figure 7. Correlograms for the predictors on 100 randomly sampled days (thin gray lines) and their average (thick black lines). On some days, there is no data for the largest distances shown due to cloud cover.

4 Discussion

4.1 Block size and spatial distribution of data explain contradictions between prior studies evaluating spatial cross-validation methods

Several past studies have evaluated cross-validation methods with sometimes contradictory results.

On the one hand, several studies found that separating training and testing data spatially yields higher estimated errors than random data splits (Bahn and McGill, 2013; Karasiak et al., 2022; Meyer et al., 2018; 2019; Stock et al., 2018; Stock and Subramaniam, 2020). For example, Ploton et al. (2020) evaluated a random forest predicting above-ground forest biomass with random splits and two spatial cross-validation approaches. Random splits suggested good predictive skill, but spatial cross-validation suggested no predictive skill, reflecting the known effects of data leakage when training and testing data are insufficiently separated (Kapoor and Narayanan, 2023). Other tests with synthetic, autocorrelated data also show that error estimates from spatial block cross-validation are more accurate than random splits (Roberts et al., 2017; Stock, 2022). Furthermore, models selected with spatial block cross-validation can transfer better to new geographic locations (Tziachris et al., 2023). These prior results are consistent with this study.

On the other hand, several studies found that differences between spatial and random cross-validation were small and supported the same conclusions (Lyons et al., 2018; Valavi et al., 2023; Zhang et al., 2023). For example, Valavi et al. (2023) found that random and spatial block cross-validation yielded a similar ranking of models and that flexible models transferred well to new locations - contrary to, e.g., Gregr et al. (2019), where more flexible models failed when applied to independent data.

These prima facie contradictory results are explained by two aspects of the studies’ design. First, the studies used different block sizes–a critical choice according to our results. For example, Valavi et al. (2023) used a block size of 75 km to mimic extrapolation over comparatively short distances. As these authors correctly argue, results for extrapolation over larger distances might have been different. Second, spatial cross-validation is most important when data are unevenly distributed in space and time. For example, Lyons et al. (2018) compared cross-validation methods in a terrestrial vegetation mapping case study. They had a small study area (50 km²) and collected data specifically for their study with sound spatial sampling methods. Yet, with sound spatial sampling covering the whole study region, the biases of random cross-validation demonstrated in this and other studies become negligible, because randomly held-out test observations are not systematically farther from training observations than locations for which predictions are needed (Ramezan et al., 2019; Stock, 2022; Wadoux et al., 2021). In contrast, with data resembling the synthetic data here (i.e., databases that compile data from various sources without an overarching sampling strategy), cross-validation with random splits or too small blocks yields wrong error estimates.

Together, the importance of block size highlighted here and the spatiotemporal distribution of data adequately explain these contradictions in previously published research.

4.2 How to choose blocks for spatial cross-validation

The most important parameter when automatically generating square or hexagonal blocks for spatial cross-validation was the block size. This choice is implicit but equally important when using existing regions as blocks (for example, when choosing between broad biogeographical regions or finer-scale subregions).

The first step in choosing a block size is analyzing spatial autocorrelation (Le Rest et al., 2014; Roberts et al., 2017). Here, correlograms showed autocorrelation ranges reflecting a suitable block size, whereas sample variograms showed that large blocks were needed but did not allow choosing a specific size. Hence, determining a good block size can require data exploration with several analytical tools. In addition, modelers must choose a cross-validation strategy that reflects the model’s intended application (Christin et al., 2020; Kapoor and Narayanan, 2023; Stock et al., 2023) – especially whether predictions beyond locations that are well-covered by data are needed.

Iterating over a plausible range of block sizes can yield additional insights, for example, exploring how error estimates change with increasing separation distance (Pohjankukka et al., 2017; Stock and Subramaniam, 2022). While a single set of manually crafted blocks is computationally more efficient and can reflect characteristics of the study region (such as biogeographical boundaries), an iterative approach avoids the need to select a block size a priori. Thus, it helps resolve situations where geostatistical analyses and domain knowledge do not clearly suggest which block size to use.

The block shape, the number of folds, and the assignment of blocks to folds were less important here, likely because they did not directly influence how the model testing reflected the target application. For example, while the spatial boundaries of statistical analysis units can affect results (the modifiable areal unit problem; Openshaw and Taylor, 1979), the shape of the blocks had minor effects on whether model testing reflected extrapolation to subregions without data. As another example, the number of folds influences the size of the training sets and, thus, the estimated prediction errors. The smallest data sets in this study had 200 observations. With 10 folds, each training set had 180 observations, and with 5 folds, 160 observations, with minor effects on the error estimates. While these options were unimportant here, they can matter in other applications. For example, it can be best to keep the training set as large as possible for very small data sets by using many folds or spatial buffers around single, held-out observations. Without such special considerations, when using a blocking strategy like those in the blockCV R package, square blocks, 10 folds, and systematic assignment of blocks to folds were good default choices.

4.3 Limitations and generalizability

This study’s main limitation is that it presents a single supervised learning application in one study region. Nevertheless, it can inform other applications because the results are theoretically plausible and sufficiently broad to explain apparent contradictions between prior studies (see Section 4.1). This study’s marine remote sensing example can, therefore, inform other supervised learning applications with spatially biased point data. However, like the conflicting past results discussed above, our recommendations’ relevance must be carefully judged in other applications and data contexts.

Environmental data might be autocorrelated in space and time, but this study tested only spatial blocks. Sweet et al. (2023) found that using clusters in predictor space as blocks worked best in a crop modeling example with spatiotemporal autocorrelation. In contrast, for synthetic chlorophyll a data like those used here, spatial blocks produced better error estimates than blocks in time or predictor space (Stock, 2022). Exploring the nuances of choosing spatial blocks was thus most critical for this study’s example application.

Basing the study on synthetic data allowed the evaluation of error estimates across the whole study region (not only locations where in situ data existed); such “simulation experiments” are a common tool to evaluate statistical methods (e.g., Dormann et al., 2012; Strobl et al., 2007; Roberts et al., 2017). However, simulated data from the biogeochemical model used to build the synthetic data was only weakly correlated with chlorophyll-a and a maximum band ratio, a key predictor in many chlorophyll remote sensing algorithms. This was alleviated by using a weighted average with an independent satellite data product, as opposed to the biogeochemical simulation results alone, as synthetic response variable. The synthetic data represented “real” marine remote sensing applications realistically for three reasons, hence allowing relevant insights into the performance of cross-validation methods. First, remote sensing reflectances and the band ratio serving as predictors were the same data used in many ocean color remote sensing studies. Second, the locations and dates of observations for model training and testing came from actual field campaigns, resampled to reflect the campaign-by-campaign growth of oceanographic databases. Third, the synthetic chlorophyll concentrations (averaged from biogeochemical simulations and a different satellite data product) had statistical properties similar to in situ chlorophyll concentrations. Therefore, the synthetic data were realistic regarding the predictors and the spatial and temporal distribution of data.

While focusing on a single study region, the Baltic Sea is typical for Case 2 waters, where remote sensing often relies on supervised learning with local to regional-scale data (Hafeez et al., 2019). Remote sensing reflectance is the foundation of many satellite algorithms besides mapping chlorophyll a. Therefore, the results are most relevant for other marine remote sensing applications in Case 2 waters.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://figshare.com/s/132c0a410cc2800ca68f.

Author contributions

AS: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing–original draft, Writing–review and editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Araújo, M. B., Pearson, R. G., Thuiller, W., and Erhard, M. (2005). Validation of species–climate impact models under climate change. Glob. Change Biol. 11 (9), 1504–1513. doi:10.1111/j.1365-2486.2005.01000.x

Choosing blocks for spatial cross-validation: lessons from a marine remote sensing case study

1 Introduction

2 Materials and methods

2.1 Overview

2.2 Synthetic data

2.3 Predictive models

2.4 Spatial blocks

2.5 Spatial autocorrelation

2.6 “True” errors vs. cross-validation errors

3 Results

3.1 Error estimates and model selection

3.2 Options when generating blocks with blockCV

3.3 Spatiotemporal autocorrelation

4 Discussion

4.1 Block size and spatial distribution of data explain contradictions between prior studies evaluating spatial cross-validation methods

4.2 How to choose blocks for spatial cross-validation

4.3 Limitations and generalizability

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

References

94% of researchers rate our articles as excellent or good