Modeling

Misc

  • Packages
    • {spatialsample} - {tidymodels} cross-validation
    • {spatialreg} - Various methods of spatial regression, Bivand’s package
      • Spatial Autoregressive Combined (SAC) models combine both a Spatial Autoregression (SAR) model and a Spatial Error (SEM) model
    • {RandomForestsGLS} - Generalizaed Least Squares RF
      • Takes into account the correlation structure of the data. Has functions for spatial RFs and time series RFs
  • Notes from
  • CV
    • Standard CV methods
      • For clustered data and interpolative and predictive use cases, it generally leads to overoptimistic performance metrics when the data has significant spatial autocorrelation.
      • For random and regular distributed data and interpolative and predictive use cases, correctly ranked the models even when the data has significant spatial autocorrelation.
    • Spatial CV Types: Spatial Blocking, Clustering, Sampling-Intensity-Weighted CV, Model-based Geostatistical Approaches, k-fold nearest neighbour distance matching (kNNDM) CV
  • Ways to Account for Spatial Autocorrelation
    • Add spatial proxies as predictors
    • Models
      • Generalized-Least-Squares-style Random Gorest (RF–GLS) - Relaxes the independence assumption of the RF model
        • Accounts for spatial dependencies in several ways:
          • Using a global dependency-adjusted split criterion and node representatives instead of the classification and regression tree (CART) criterion used in standard RF models
          • Employing contrast resampling rather than the bootstrap method used in a standard RF model
          • Applying residual kriging with covariance modeled using a Gaussian process framework
        • From the spatial proxies paper:
          • “Outperformed or was on a par with the best-performing standard RF model with and without proxies for all parameter combinations in both the interpolation and extrapolation areas of the simulation study.”
          • “The most relevant performance gains when comparing RF–GLS to RFs with and without proxies were observed in the ‘autocorrelated error’ scenario for the interpolation area with regular and random samples, where the RMSE was substantially lower.”

Spatial Proxies

  • Spatial proxies are a set of spatially indexed variables with long or infinite autocorrelation ranges that are not causally related to the response.
  • They are “proxy” since these predictors act as surrogates for unobserved factors that can cause residual autocorrelation, such as missing predictors or an autocorrelated error term.
  • Types
    • Geographical or Projected Coordinates
    • Euclidean Distance Fields (EDFs)(i.e. distance-to variables?)
      • Adding distance fields for each of the sampling locations (distance from one location to the other?), i.e. the number of added predictors equals the sample size.
    • RFsp
      • Tends to give worse results than coordinates when use of spatial proxies is inappropriate for either interpolation or extrapolation.
      • But, together with EDFs, it is likely to yield the largest gains when the use of proxies is beneficial.
  • Factors that could affect the effectiveness of spatial proxies
    • Model Objectives
      • Interpolation - There is a geographical overlap between the sampling and prediction areas
        • The addition of spatial proxies to tree models such as RFs may be beneficial in terms of enhancing predictive accuracy, and they might outperform geostatistical or hybrid methods
        • For Random or Regular spatial distributions of locations, the model should likely benefit, especially if there’s a large amount of spatial autocorrelation.
        • For clustered spatial distributions of locations
          • For weakly clustered data, strong spatial autocorrelation, and when there’s only a subset of informative predictors or no predictors at all, then models can expect some benefit.
            • For other cases of weakly clustered data, there’s likely no affect or a little worse performance.
          • For strongly clustered data , it probably worsens performance.
      • Prediction (aka Extrapolation, Spatial-Model Transferability) - The model is applied to a new disjoint area
        • The use of spatial proxies appears to worsen performance in all cases.
      • Inference (aka Predictive Inference) - Knowledge discovery is the main focus
        • Inclusion of spatial proxies has been discouraged
        • Proxies typically rank highly in variable-importance statistics
        • High-ranking proxies could hinder the correct interpretation of importance statistics for the rest of predictors, undermining the possibility of deriving hypotheses from the model and hampering residual analysis
    • Large Residual Autocorrelation
      • Better performance of models with spatial proxies is expected when residual dependencies are strong.
    • Spatial Distribution
      • Clustered samples frequently shown as potentially problematic for models with proxy predictors
      • Including highly autocorrelated variables, such as coordinates with clustered samples, can result in spatial overfitting