For clustered data and interpolative and predictive use cases, it generally leads to overoptimistic performance metrics when the data has significant spatial autocorrelation.
For random and regular distributed data and interpolative and predictive use cases, correctly ranked the models even when the data has significant spatial autocorrelation.
Generalized-Least-Squares-style Random Gorest (RF–GLS) - Relaxes the independence assumption of the RF model
Accounts for spatial dependencies in several ways:
Using a global dependency-adjusted split criterion and node representatives instead of the classification and regression tree (CART) criterion used in standard RF models
Employing contrast resampling rather than the bootstrap method used in a standard RF model
Applying residual kriging with covariance modeled using a Gaussian process framework
From the spatial proxies paper:
“Outperformed or was on a par with the best-performing standard RF model with and without proxies for all parameter combinations in both the interpolation and extrapolation areas of the simulation study.”
“The most relevant performance gains when comparing RF–GLS to RFs with and without proxies were observed in the ‘autocorrelated error’ scenario for the interpolation area with regular and random samples, where the RMSE was substantially lower.”
Spatial Proxies
Spatial proxies are a set of spatially indexed variables with long or infinite autocorrelation ranges that are not causally related to the response.
They are “proxy” since these predictors act as surrogates for unobserved factors that can cause residual autocorrelation, such as missing predictors or an autocorrelated error term.
Adding distance fields for each of the sampling locations (distance from one location to the other?), i.e. the number of added predictors equals the sample size.
RFsp
Tends to give worse results than coordinates when use of spatial proxies is inappropriate for either interpolation or extrapolation.
But, together with EDFs, it is likely to yield the largest gains when the use of proxies is beneficial.
Factors that could affect the effectiveness of spatial proxies
Model Objectives
Interpolation - There is a geographical overlap between the sampling and prediction areas
The addition of spatial proxies to tree models such as RFs may be beneficial in terms of enhancing predictive accuracy, and they might outperform geostatistical or hybrid methods
For Random or Regular spatial distributions of locations, the model should likely benefit, especially if there’s a large amount of spatial autocorrelation.
For clustered spatial distributions of locations
For weakly clustered data, strong spatial autocorrelation, and when there’s only a subset of informative predictors or no predictors at all, then models can expect some benefit.
For other cases of weakly clustered data, there’s likely no affect or a little worse performance.
For strongly clustered data , it probably worsens performance.
Prediction (aka Extrapolation, Spatial-Model Transferability) - The model is applied to a new disjoint area
The use of spatial proxies appears to worsen performance in all cases.
Inference (aka Predictive Inference) - Knowledge discovery is the main focus
Inclusion of spatial proxies has been discouraged
Proxies typically rank highly in variable-importance statistics
High-ranking proxies could hinder the correct interpretation of importance statistics for the rest of predictors, undermining the possibility of deriving hypotheses from the model and hampering residual analysis
Large Residual Autocorrelation
Better performance of models with spatial proxies is expected when residual dependencies are strong.
Spatial Distribution
Clustered samples frequently shown as potentially problematic for models with proxy predictors
Including highly autocorrelated variables, such as coordinates with clustered samples, can result in spatial overfitting