How does spatial or spatiotemporal correlation affect cross-validation results?

Ever since I attended opengeohub summer school https://opengeohub.org/summer_school_2020 from 2018-2019, where in a lecture it has been stated that the spatial correlations lead to overly optimistic cross-validation results. I have appreciated the thoughts it brought to me about the effects of spatial and spatiotemporal correlations in ensemble tree (random forest or boosting)-based classification methods. Is it Ok to use it for spatial prediction? Is it a problem when the covariates (predictors) are spatially correlated, or when the response is spatially correlated, or both?

My answer is: it is not OK to use spatial validation! And it is not a problem if the samples are spatially correlated!

There is a very important fact related to the problem: spatial is nothing so special, it is one of the features, and the so called “dependency” is in every dimension! Not just in the spatial dimension. After understanding this, the reason to my answer is already almost clear.

In the lecture from opengeohub summer school, the example is not very illustrative to the problem, as it used latitude and longitude as predictors. Having said redundant predictor variables usually won’t affect random forest or boosting kind of method, these two variables are a common pitfall to be threw in a model. The reason is that these variable can easily cause extrapolation! Think about they exactly register the areas of each class. pixels are commonly clustered in a class and are far away from pixels in another class. The pixels that consists a cropland are in general far away from pixels that consists of forest. When the latitude and longitude are used, the model is going to say ok, this area is crop and other forest, no need to used any predictor variables.

The solution it gives, is to use spatial validation, which means to divide data into spatial blocks, and each time use one block of data for validation. In this way, the model has a harder time to predict because the model never see the test area before, and the longitude and latitude of course won’t know what’s there. They can only make the best guess based on the areas closing by.

Then in the lecture, a solution that can automatically get rid of this kind of predictors is given by using spatial validation in model training process, and only select 2 variables each time for growing the tree. This, of course, has a high potential of getting rid of longitude and latitude, also the elevation, for the same reason.

However, till now we have not talked about the influence of correlated data. The key point is that we don’t want to the training and test dataset to overlap.

Let’s firstly look at spatially correlated response. There is another literature, by Brenning (2012), implemented in https://mlr-org.com/docs/2018-07-25-visualize-spatial-cv/, stressing on correlated response. Notably, it also used spatial correlation. This paper, as well as the implementation, again, does not give a very good quantification of how exactly and to what extent the correlation in response causes a biased error estimation.

To delve into the effects of spatiotemporal correlations to an ensemble tree based method, let’s say random forest for regression, we can break it into three parts, the data is split into training and test, the training set is used in 1 and 2 and test in 3:

  1. model fitting using bootstrapped samples from the training set,
  2. obtaining the OOB (out of bag) error from the rest of the training set,
  3. cross validation (CV).
  • For 1: Thinking about growing a single tree using highly correlated data, what problem will it bring? 1) The universal Kriging and spatial autoregression model concerns spatially correlated residuals, but the tree model is very flexible and this may come out a smaller concern. 2) The three model concerns the best splitting point by mean of (y – y_mean)2 within each chunk divided by covariates, if lots of points are sampled close together, the split point may be missed and the best variables may not be chosen. This means even though we will grow hundreds or thousands of trees, it might be beneficial that the data are sampled more randomly over space.
  • For 2: If the bootstraps used for training correlates with OOB samples, it means the OOB error is not accurate. This, however, is less important because in practice we rely on CV to evaluate the model and less on the OOB, as the OOB is for each tree and CV is for the forest.
  • For 3: It is absolutely normal to have correlations between training and test data! Otherwise we have the extrapolating problem! In the another word, the test and training may come from different distributions! As the mentioned studies are happy that the CV result is now looking more realistic as it drops down, they may be unfairly asking the model to test on things it may has never seen before. Though we do many folds, remember we will take the average. The harm the spatial validation can do may be much more than the response is spatially correlated.

References:

Brenning, A. (2012). Spatial cross-validation and bootstrap for the assessment of prediction rules in remote sensing: The R package sperrorest. In 2012 IEEE International Geoscience and Remote Sensing Symposium. IEEE. https://doi.org/10.1109/igarss.2012.6352393