Modeling
Misc
- Packages
- {loopevd} - Performs extreme value analysis at multiple locations using functions from the ‘evd’ package. Supports both point-based and gridded input data using the ‘terra’ package, enabling flexible looping across spatial datasets for batch processing of generalised extreme value, Gumbel fits
- {spatialsample} - {tidymodels} cross-validation
- {spatialreg} - OG; Various methods of spatial regression, Bivand’s package
- Spatial Autoregressive Combined (SAC) models combine both a Spatial Autoregression (SAR) model and a Spatial Error (SEM) model
- {RandomForestsGLS} - Generalizaed Least Squares RF
- Takes into account the correlation structure of the data. Has functions for spatial RFs and time series RFs
- {vmsae} - Variational Autoencoded Multivariate Spatial Fay-Herriot model for efficiently estimating population parameters in small area estimation
- {GWnnegPCA} - Geographically Weighted Non-Negative Principal Components Analysis
- {varycoef} - Implements a maximum likelihood estimation (MLE) method for estimation and prediction of Gaussian process-based spatially varying coefficient (SVC) models
- {spatemR} - Generalized Spatial Autoregressive Models for Mean and Variance
- Extends classical methods like logistic Spatial Autoregresive Models (SAR), probit Spatial Autoregresive Models (SAR), and Poisson Spatial Autoregresive Models (SAR),
- Built on top of {gamlss} so splines can be included.
- {sfclust} - Bayesian Spatial Functional Clustering
- {spatialfusion} - Multivariate Analysis of Spatial Data Using a Unifying Spatial Fusion Framework
- {spNNGP} (Vignette) - Provides a suite of spatial regression models for Gaussian and non-Gaussian point-referenced outcomes that are spatially indexed. The package implements several Markov chain Monte Carlo (MCMC) and MCMC-free nearest neighbor Gaussian process (NNGP) models for inference about large spatial data.
- {spVarBayes} (Paper, Tutorial) - Provides scalable Bayesian inference for spatial data using Variational Bayes (VB) and Nearest Neighbor Gaussian Processes (NNGP). All methods are designed to work efficiently even with 100,000 spatial locations, offering a practical alternative to traditional MCMC.
- Models
- spVB-MFA, spVB-MFA-LR, and spVB-NNGP. All three use NNGP priors for the spatial random effects, but differ in the choice of the variational families.
- spVB-NNGP uses a correlated Gaussian variational family with a nearest neighbor-based sparse Cholesky factor of the precision matrix.
- spVB-MFA and spVB-MFA-LR use a mean-field approximation variational family.
- Additionally, spVB-MFA-LR applies a one-step linear response correction to spVB-MFA for improved estimation of posterior covariance matrix, mitigating the well-known variance underestimation issue of mean-field approximation
- Allows for covariates (i.e., fixed effects), enabling inference on the association of these with the outcome via the variational distribution of the regression coefficients. Variational distributions for the spatial variance and random error variance are also estimated, as in a fully Bayesian practice
- Simulation studies demonstrate that spVB-NNGP and spVB-MFA-LR outperform the existing variational inference methods in terms of both accuracy and computational efficiency. both spVB-NNGP and spVB-MFA-LR produce inference on the fixed and random effects that align closely with those obtained from the MCMC-based spNNGP, but at reduced computational cost.
- Models
- Notes from
- Resources
- Spatial Microsimulation with R - Teaches techniques for generating and analyzing spatial microdata to get the ‘best of both worlds’ from real individual and geographically-aggregated data
- Spatial Microdata Simulations are:
- Approximations of individual level data at high spatial resolution by combining individual and geographically aggregated datasets.: people allocated to places. (i.e. population synthesis)
- An approach to understanding multi level phenomena based on spatial microdata — simulated or real
- Geographically aggregated data are called constraints and the individual level survey data are called microdata
- Spatial Microdata Simulations are:
- Applied Machine Learning Using mlr3 in R, Ch.13.5
- Spatial Microsimulation with R - Teaches techniques for generating and analyzing spatial microdata to get the ‘best of both worlds’ from real individual and geographically-aggregated data
- Papers
- INLA-RF: A Hybrid Modeling Strategy for Spatio-Temporal Environmental Data
- Integrates a statistical spatio-temporal model with RF in an iterative two-stage framework.
- The first algorithm (INLA-RF1) incorporates RF predictions as an offset in the INLA-SPDE model, while the second (INLA-RF2) uses RF to directly correct selected latent field nodes. Both hybrid strategies enable uncertainty propagation between modeling stages, an aspect often overlooked in existing hybrid approaches.
- A Kullback-Leibler divergence-based stopping criterion.
- INLA-RF: A Hybrid Modeling Strategy for Spatio-Temporal Environmental Data
- CV
- Standard CV methods
- For clustered data and interpolative and predictive use cases, it generally leads to overoptimistic performance metrics when the data has significant spatial autocorrelation.
- For random and regular distributed data and interpolative and predictive use cases, correctly ranked the models even when the data has significant spatial autocorrelation.
- Spatial CV Types: Spatial Blocking, Clustering, Sampling-Intensity-Weighted CV, Model-based Geostatistical Approaches, k-fold nearest neighbour distance matching (kNNDM) CV
- Standard CV methods
- Ways to Account for Spatial Autocorrelation
- Add spatial proxies as predictors
- Models
- Generalized-Least-Squares-style Random Gorest (RF–GLS) - Relaxes the independence assumption of the RF model
- Accounts for spatial dependencies in several ways:
- Using a global dependency-adjusted split criterion and node representatives instead of the classification and regression tree (CART) criterion used in standard RF models
- Employing contrast resampling rather than the bootstrap method used in a standard RF model
- Applying residual kriging with covariance modeled using a Gaussian process framework
- From the spatial proxies paper:
- “Outperformed or was on a par with the best-performing standard RF model with and without proxies for all parameter combinations in both the interpolation and extrapolation areas of the simulation study.”
- “The most relevant performance gains when comparing RF–GLS to RFs with and without proxies were observed in the ‘autocorrelated error’ scenario for the interpolation area with regular and random samples, where the RMSE was substantially lower.”
- Accounts for spatial dependencies in several ways:
- Generalized-Least-Squares-style Random Gorest (RF–GLS) - Relaxes the independence assumption of the RF model
Spatial Proxies
- Spatial proxies are a set of spatially indexed variables with long or infinite autocorrelation ranges that are not causally related to the response.
- They are “proxy” since these predictors act as surrogates for unobserved factors that can cause residual autocorrelation, such as missing predictors or an autocorrelated error term.
- Types
- Geographical or Projected Coordinates
- Euclidean Distance Fields (EDFs)(i.e. distance-to variables?)
- Adding distance fields for each of the sampling locations (distance from one location to the other?), i.e. the number of added predictors equals the sample size.
- RFsp
- Tends to give worse results than coordinates when use of spatial proxies is inappropriate for either interpolation or extrapolation.
- But, together with EDFs, it is likely to yield the largest gains when the use of proxies is beneficial.
- Factors that could affect the effectiveness of spatial proxies
- Model Objectives
- Interpolation - There is a geographical overlap between the sampling and prediction areas
- The addition of spatial proxies to tree models such as RFs may be beneficial in terms of enhancing predictive accuracy, and they might outperform geostatistical or hybrid methods
- For Random or Regular spatial distributions of locations, the model should likely benefit, especially if there’s a large amount of spatial autocorrelation.
- For clustered spatial distributions of locations
- For weakly clustered data, strong spatial autocorrelation, and when there’s only a subset of informative predictors or no predictors at all, then models can expect some benefit.
- For other cases of weakly clustered data, there’s likely no affect or a little worse performance.
- For strongly clustered data , it probably worsens performance.
- For weakly clustered data, strong spatial autocorrelation, and when there’s only a subset of informative predictors or no predictors at all, then models can expect some benefit.
- Prediction (aka Extrapolation, Spatial-Model Transferability) - The model is applied to a new disjoint area
- The use of spatial proxies appears to worsen performance in all cases.
- Inference (aka Predictive Inference) - Knowledge discovery is the main focus
- Inclusion of spatial proxies has been discouraged
- Proxies typically rank highly in variable-importance statistics
- High-ranking proxies could hinder the correct interpretation of importance statistics for the rest of predictors, undermining the possibility of deriving hypotheses from the model and hampering residual analysis
- Interpolation - There is a geographical overlap between the sampling and prediction areas
- Large Residual Autocorrelation
- Better performance of models with spatial proxies is expected when residual dependencies are strong.
- Spatial Distribution
- Clustered samples frequently shown as potentially problematic for models with proxy predictors
- Including highly autocorrelated variables, such as coordinates with clustered samples, can result in spatial overfitting
- Model Objectives
Potential Modeling
- Similar to interpolation but keeps the original spatial units as interpretive framework. Hence, the map reader can still rely on a known territorial division to develop its analyses
- They produce understandable maps by smoothing complex spatial patterns
- They enrich variables with contextual spatial information.
- {potential}: Spatial interaction modeling via Stewart Potentials. Also capable of interpolation
- There are two main ways of modeling spatial interactions: the first one focuses on links between places (flows), the second one focuses on places and their influence at a distance (potentials).
- Use Cases
- Retail Market Analysis: A national retail chain planning expansion in a metropolitan area would use potential models to identify optimal new store locations. The analysis would calculate the “retail potential” at various candidate sites
- Based on:
- Population density in surrounding neighborhoods
- Income levels (as a proxy for purchasing power)
- Distance decay function (people are less likely to travel farther to shop)
- Existing competitor locations
- Calculate distance matrix from potential site to competitor location
- Result:
- A retail chain might discover that a location with moderate nearby population but situated along major commuter routes has higher potential than a site with higher immediate population but poor accessibility. T
- The potential model would create a continuous surface showing retail opportunity across the entire region, with the highest values indicating prime locations for new stores
- Based on:
- Transportation Planning: Regional transportation authorities use potential models to optimize transit networks.
- They would:
- Map population centers (origins) and employment centers (destinations)
- Calculate interaction potential between all origin-destination pairs
- Weight these by population size and employment opportunities
- Apply distance decay functions reflecting willingness to commute
- Result:
- The resulting model might reveal corridors with high interaction potential but inadequate transit service. For instance, a growing suburban area might show strong potential interaction with a developing business district, but existing transit routes might follow outdated commuting patterns.
- The potential model helps planners identify where new bus routes, increased service frequency, or even rail lines would best serve actual travel demand.
- They would:
- Hospital Accessibility Study: Public health departments use potential models to identify healthcare deserts.
- They would:
- Map existing hospital and clinic locations with their capacity (beds, specialties)
- Calculate accessibility potential across the region
- Weight by demographic factors (elderly population, income levels)
- Apply appropriate distance decay functions (emergency vs. specialist care have different distance tolerances)
- Result:
- The model might reveal that while a rural county appears to have adequate healthcare on paper (population-to-provider ratio), the spatial distribution creates areas with extremely low healthcare potential.
- This analysis could justify mobile clinics, telemedicine investments, or targeted subsidies for new healthcare facilities in specific locations to maximize population benefi
- They would:
- Retail Market Analysis: A national retail chain planning expansion in a metropolitan area would use potential models to identify optimal new store locations. The analysis would calculate the “retail potential” at various candidate sites
- Comparisons ({potential}article)
- GDP per capita (cloropleth)
- Typical cloropleth at the municipality level
- Values have been binned
- Potential GDP per Capita (interaction)
- Stewart Potentials have smoothed the values
- Municipality boundaries still intact, so you could perform an analysis based on these GDP regions
- Smoothed GDP per Capita (interpolation)
- Similar results as the interaction model except there are no boundaries
- GDP per capita (cloropleth)
Interpolation
Misc
- The process of using points with known values to estimate values at other points. In GIS applications, spatial interpolation is typically applied to a raster with estimates made for all cells. Spatial interpolation is therefore a means of creating surface data from sample points.
- Packages
- {gstat} - Has various interpolation methods.
- {intamap} (Paper) - Procedures for Automated Interpolation (Edzer Pebesma et al)
- One needs to choose a model of spatial variability before one can interpolate, and experts disagree on which models are most useful.
- {interpolateR} - Includes a range of methods, from traditional techniques to advanced machine learning approaches
- Inverse Distance Weighting (IDW): A widely used method that assigns weights to observations based on their inverse distance.
- Cressman Objective Analysis Method (Cressman): A powerful approach that iteratively refines interpolated values based on surrounding observations.
- RFmerge: A Random Forest-based method designed to merge multiple precipitation products and ground observations to enhance interpolation accuracy.
- RFplus: A novel approach that combines Random Forest with Quantile Mapping to refine interpolated estimates and minimize bias.
- {mbg} - Interface to run spatial machine learning models and geostatistical models that estimate a continuous (raster) surface from point-referenced outcomes and, optionally, a set of raster covariates
- {BMEmapping} - Spatial Interpolation using Bayesian Maximum Entropy (BME)
- Ssupports optimal estimation using both precise (hard) and uncertain (soft) data, such as intervals or probability distributions
- spatialize (Paper) - Implements ensemble spatial interpolation, a novel method that combines the simplicity of basic interpolation methods with the power of classical geoestatistical tools
- Stochastic modelling and ensemble learning, making it robust, scalable and suitable for large datasets.
- Provides a powerful framework for uncertainty quantification, offering both point estimates and empirical posterior distributions.
- It is implemented in Python 3.x, with a C++ core for improved performance.
- Measurements can have strong regional variance, so the geographical distribution of measurements can have a strong influence on statistical estimates.
- Example: Temperature
- Two different geographical distributions of sensors
- A concentration of sensors in North can lead to a cooler average regional temperature and vice versa for the South.
- Distribution of temperatures across the region for 1 day.
- With this much variance in temperature, a density of sensors in one area can distort the overall average.
- Interpolation evens out the geographical distribution of measurments
- Two different geographical distributions of sensors
- Example: Temperature
Kriging
- Considers not only the distances but also the other variables that have a linear relationship with the estimation variables
- Misc
- Packages
- {psgp} - Implements projected sparse Gaussian process Kriging
- Resources
- Spatial Sampling With R: Chapter 21 Introduction to kriging
- In Ordinary Kriging (OK) it is assumed that the mean of the study variable is the same everywhere
- Packages
- Uses a correlated Gaussian process to guess at values between data points
- Uncorrelated - white noise
- Correlated - smooth
- The closer in distance two points are to each other, the more likely they are to have a similar value (i.e. geospatially correlated)
- Example: Temperature
- Fewer known points means greater uncertainty
- Inputs:
- The measured values at the sampling points,
- The geometric coordinates of the sampling points,
- The geometric coordinates of the target points to interpolate,
- The “calibrated” probabilistic model, with the spatial correlation obtained by data
- Outputs:
- The estimated values at the target points,
- The estimated uncertainty (variance) at the target points.
- Inputs:
- Components
Experimental Variogram - A semivariogram that models the dissimilarity of the study variable at two locations, as a function of the vector.
\[ \gamma(h) = \frac{1}{2n} \sum (z(x_i) - z(x_i + h)) \]
- \(h\): Geographical distance between two observed points
- \(z\): Function calculating the variable you’re interested in. (e.g. Temperature)
- \(x_i\): Location
Theoretical Variogram - Used to model the experimental variogram
- Options
Gaussian
\[ \gamma' = p(1 - e^{-(d^2/ (\frac{4}{7}r)^2)}) + n \]
Exponential
\[ \gamma' = p(1- e^{-3d / r}) + n \]
Spherical
\[ \gamma' = \left\{ \begin{array}{lcl} p\left(\frac{3d}{2r} - \frac{d^3}{2r^3}\right) +n & \mbox{if} \; d \le r \\ p + n & \mbox{if} \; d \gt r \end{array}\right. \]
- Where:
- \(p\): Partial Sill
- \(d\): Distance (same as \(h\))
- \(n\): Nugget
- \(r\): Range
- Options
Process: Optimize the parameters of the parameters of theoretical variogram to fit to the experimental one. Then predict values using the estimated theoretical variogram.