Point Patterns
Misc
- Notes from Ch.9 - Ch.15, An Introduction to Spatial Data Analysis and Statistics: A Course in R
- Packages
- {spatstat} - Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests
- CRAN page has some vignettes, but the manual is a kind of cheatsheet that lists names of functions that perform certain spatial operations
- Loading {spatstat} loads Spatstat-verse:
- {spatstat.data} - Datasets
- {spatstat.univar} - Estimation of one-dimensional probability distributions including kernel density estimation, weighted empirical cumulative distribution functions, Kaplan-Meier and reduced-sample estimators for right-censored data, heat kernels, kernel properties, quantiles and integration.
- {spatstat.geom} - Defines spatial data types and supports geometrical operations on them. Data types include point patterns, windows (domains), pixel images, line segment patterns, tessellations and hyperframes
- {spatstat.random} - Generates random spatial patterns of points
- {spatstat.explore} - Exploratory data analysis and nonparametric analysis of spatial data, mainly spatial point patterns
- {spatstat.model} - Parametric statistical modelling and inference for spatial data, mainly spatial point patterns
- {spatstat.linnet} - Defines types of spatial data on a linear network and provides functionality for geometrical operations, data analysis and modelling of data on a linear network
- {spatstat} - Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests
- Resources
- Spatial Point Patterns: Methodology and Applications with R (R >> Documents >> Geospatial)
- An Introduction to Spatial Data Analysis and Statistics: A Course in R
Terms
- Intensity - The expected number of events per unit area — conventionally denoted by \(\lambda\). In most cases the process is not know, so its intensity cannot be directly measured.
- Density - Empirical estimate of Intensity — \(\hat \lambda = n / a\) , where \(a\) is the area of the region
- Quadrats - Cells of a gridded area representing subregions. Useful for analyzing how density varies across a region
- Rules of Thumb for choosing the number of quadrats
- Each quadrat should have a minimum of two events
- Formula based on the area (A) and number of events (N)
\[ Q = \frac{2A}{N} \]
- Rules of Thumb for choosing the number of quadrats
- Regularity (or Dispersion) - The state at which points tend to be located at similar distances from each other.
Basics
-
::p_load( pacman dplyr, ggplot2, spatstat )data("PointPatterns", package = "isdas") summary(PointPatterns) ## x y Pattern ## Min. :0.0169 Min. :0.005306 Pattern 1:60 ## 1st Qu.:0.2731 1st Qu.:0.289020 Pattern 2:60 ## Median :0.4854 Median :0.550000 Pattern 3:60 ## Mean :0.5074 Mean :0.538733 Pattern 4:60 ## 3rd Qu.:0.7616 3rd Qu.:0.797850 ## Max. :0.9990 Max. :0.999808 ggplot() + geom_bin2d(data = filter(PointPatterns, == "Pattern 1"), Pattern aes(x = x, y = y), binwidth = c(0.25, 0.25)) + geom_point(data = filter(PointPatterns, == "Pattern 1"), Pattern aes(x = x, y = y)) + scale_fill_distiller(palette = "RdBu") + coord_fixed()
geom_bin2d
is called to plot a map of counts of events in the space defined by the bins.- PointPatterns contains x, y coordinates that range from 0 to 1 and a categorical variable Pattern indicating each of the four difffernt density patterns
Example 2: Create a ppp object
# define a window <- owin(c(0,1), c(0,1)) wnd <- as.ppp(PointPatterns, wnd) ppp1 summary(ppp1) ## Marked planar point pattern: 240 points ## Average intensity 240 points per square unit ## ## Coordinates are given to 16 decimal places ## ## Multitype: ## frequency proportion intensity ## Pattern 1 60 0.25 60 ## Pattern 2 60 0.25 60 ## Pattern 3 60 0.25 60 ## Pattern 4 60 0.25 60 ## ## Window: rectangle = [0, 1] x [0, 1] units ## Window area = 1 square unit # plot a specific category of point plot(split.ppp(ppp1)$`Pattern 3`)
- The window defined in
owin
should define a region for analysis that is consistent with the pattern of interest ppp
(plannar point pattern) is the fundamental spatstat object- frequency is the number of points in that region (e.g. Pattern)
- proportion is the proportion of points in that region to the overall dataset
- intensity it the number of points divided by the area (1 x 1 = 1)
- The window defined in
Example 3: Get point counts for each quadrat by region/subregion
quadratcount(split(ppp1), nx = 4, ny = 4) ## List of spatial objects ## ## Pattern 1: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 3 5 1 6 ## [0.5,0.75) 2 3 4 6 ## [0.25,0.5) 5 4 2 3 ## [0,0.25) 2 4 4 6 ## ## Pattern 2: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 14 2 2 6 ## [0.5,0.75) 0 0 4 6 ## [0.25,0.5) 6 3 1 2 ## [0,0.25) 4 6 2 2 ## ## Pattern 3: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 2 11 5 7 ## [0.5,0.75) 1 1 6 4 ## [0.25,0.5) 1 10 3 2 ## [0,0.25) 2 1 2 2 ## ## Pattern 4: ## x ## y [0,0.25) [0.25,0.5) [0.5,0.75) [0.75,1] ## [0.75,1] 4 5 6 3 ## [0.5,0.75) 3 3 4 2 ## [0.25,0.5) 3 3 4 2 ## [0,0.25) 5 4 6 3
- nx and ny specify how many quadrats (i.e. cells) you want per row and per column respectively
split
divides the dataset by the region variable or event type
Example 4: Quadrat Count for Toronto Fast Food
library(spatstat) library(sf) data("Fast_Food", package = "isdas") data("Toronto", package = "isdas") head(Fast_Food) #> Simple feature collection with 6 features and 1 field #> Geometry type: POINT #> Dimension: XY #> Bounding box: xmin: 620173 ymin: 4840698 xmax: 638544.7 ymax: 4853394 #> Projected CRS: NAD83 / UTM zone 17N #> Class geometry #> 1 Chicken POINT (635575.8 4853394) #> 2 Chicken POINT (636724.5 4842644) #> 3 Chicken POINT (622524.7 4840698) #> 4 Chicken POINT (638544.7 4846541) #> 5 Chicken POINT (627850.5 4843178) #> 6 Chicken POINT (620173 4841782) head(toronto) #> Simple feature collection with 1 feature and 0 fields #> Geometry type: MULTIPOLYGON #> Dimension: XY #> Bounding box: xmin: 609550.5 ymin: 4826375 xmax: 651611.8 ymax: 4857439 #> Projected CRS: NAD83 / UTM zone 17N #> geometry #> 1 MULTIPOLYGON (((609550.5 48... 1<- as.ppp(Fast_Food, as.owin(Toronto)) ppp_ff #> Marked planar point pattern: 614 points 2#> Multitype, with levels = Chicken, Hamburger, Pizza, Sub #> window: polygonal boundary #> enclosing rectangle: [609550.5, 651611.8] x [4826375, 4857439] units 3<- quadratcount(ppp_ff, nx = 3, ny = 3) qct_ff 4table(qct_ff) #> 0 6 44 48 60 64 85 144 163 #> 1 1 1 1 1 1 1 1 1 plot(qct_ff)
- 1
- To automatically create a window object using the boundaries from a sf object, {sf} needs to be loaded.
- 2
- The categories in the Class variable are captured as levels
- 3
- 3 x 3 seems to be a good starting grid for regions such as a city if your data isn’t too sparse
- 4
- There is 1 quadrat with 0 points. Given that were using a 3 x3 grid, it’s very small and probably located underneath the quadrat with 44 points.
Example 5: Approximate a window using coordinates
data(bear_df, package = "isdas") summary(bear_df) ## x y marks ## Min. :515743 Min. :6812138 Day Time :502 ## 1st Qu.:518994 1st Qu.:6813396 Night Time:498 ## Median :519526 Median :6816724 ## Mean :519321 Mean :6816474 ## 3rd Qu.:519982 3rd Qu.:6818111 ## Max. :522999 Max. :6821440 <- W owin(xrange = c(515000, 523500), yrange = c(6812000, 6822000)) <- as.ppp(bear_df, W = W) bear.ppp summary(bear.ppp) ## Marked planar point pattern: 1000 points ## Average intensity 1.176471e-05 points per square unit ## ## Coordinates are given to 10 decimal places ## ## Multitype: ## frequency proportion intensity ## Day Time 502 0.502 5.905882e-06 ## Night Time 498 0.498 5.858824e-06 ## ## Window: rectangle = [515000, 523500] x [6812000, 6822000] units ## (8500 x 10000 units) ## Window area = 8.5e+07 square units
- Uses the minimum and maximum values of each coordinate
Tests
Quadrat-based Chi-Square Test
A Pearson \(\chi^2\) independence test that compares the empirical distribution of events by quadrats to the distribution of events as expected under the hypothesis that the underlying process is random.
\[ \begin{align} &\chi^2 = \sum_i^Q r_i^2\\ &\text{where} \;\; r_i = \frac{\text{observed}_i - \text{expected}_i}{\sqrt{\text{expected}_i}} \end{align} \]\(r_i\) is the Pearson residual and \(Q\) is the number of quadrats
Issues
- Test results is affected by the chosen quadrat grid.
- Count-based so size of the quadrat matters. With irregular shaped quadrats (e.g. within a city boundary), it might be difficult to create a grid with roughly homogeneous counts.
- The test is not sensitive to the relative position of the events within the quadrats. So, there could be extreme clustering happening within the quadrats and the test might not reject the Null.
Example:
quadrat.test
using a ppp object
<- q_test quadrat.test(ppp_ff, nx = 3, ny = 3) q_test## Warning: Some expected counts are small; chi^2 approximation may be inaccurate ## Chi-squared test of CSR using quadrat counts ## ## data: Fast_Food.ppp ## X2 = 213.74, df = 8, p-value < 2.2e-16 ## alternative hypothesis: two.sided ## ## Quadrats: 9 tiles (irregular windows)
- All expected counts (assumes a uniform Poisson point process) should be greater than 5 which is why there’s a warning.
- The docs of this function sound very much like the stuff in my Discrete Analysis notebook, so see that for further details.
- Options seem to be to use method = “MonteCarlo” which relaxes the expected count > 5 condition or using a smaller grid. My DA notebook also suggests Exact Tests if that’s available for this sort of thing.
- p-value < 0.05 suggests that this is not a CSR (completely spatially random — aka uniform Poisson point process) pattern
- If lambda is provided then the Null is a Poisson point process with that \(\lambda\) (i.e. intensity or probably also Poisson mean)
- Can also use
split.ppp(ppp_ff)
and have each point category tested. - This function has methods for point patterns (class “ppp”), split point patterns (class “splitppp”), point process models (class “ppm” or “slrm”`) and quadrat count tables (class “quadratcount”).
- Plotting the object will display the quadrats, annotated by their observed and expected counts and the Pearson residuals.
- Pearson residuals can be extracted with
residuals
. They evidently aren’t standardized, but if standardized, they can be treated as z-scores with values > 1.96 indicating which quadrats are causing the rejection of the Null. (See Regression, Diagnostics >> Residuals >> Standardized Residuals)
- All expected counts (assumes a uniform Poisson point process) should be greater than 5 which is why there’s a warning.
G-Function
A cumulative distribution function of distances that tells you the proportion of events that have a nearest neighbor at a distance less than some value. (i.e. ecdf)
- e.g. At a distance of 1.5 km, 32% of events/locations have a nearest neighbor at that distance or less.
Empirical G-Function
\[ \hat G(r) = \frac{d_{i} \le r, \forall i}{n} \]- Where \(d_{i}\) is the distance from event/location, \(i\), to its nearest neighbor and \(r\) is a distance value
Theoretical G-Function
\[ G(r) = 1 - e^{-\lambda \pi r^2} \]- Represents the Null point generating process from a Poisson distribution
Guidelines
- \(\hat G(r) \gt G(r)\) says events are closer together than expected from a random process, i.e. Clustered
- \(\hat G(r) \approx G(r)\) says event/location pattern resembles a random process
- \(\hat G(r) \lt G(r)\) says events are further away from each other than expected from a random process, i.e. Dispersed or Regular
Example: {spatstat.explore::Gest}
# Use split to calculate the G-function only for "Pattern 1" <- g_pattern1 Gest(split(pp0.ppp)$"Pattern 1", correction = "none") plot(g_pattern1) lines(x = c(0.04, 0.04), y = c(-0.1, 0.5), lty = "dotted") lines(x = c(-0.1, 0.04), y = c(0.5, 0.5), lty = "dotted") lines(x = c(-0.1, 0.04), y = c(0.16, 0.16), lty = "dotted", col = "red")
- See Basics >> Example 1 for what Pattern 1 looks like
- \(\hat G(r) \gt G(r)\) indicates clustering
- The line shows about 50% of events have a nearest neighbor at a distance of less than approximately 0.04
- correction: Optional. The edge correction(s) to be used to estimate \(G(r)\). A vector of character strings selected from “none”, “rs”, “km”, “Hanisch” and “best”. Alternatively correction = “all” selects all options.
F-Function
Same as the G-Functions except the distance isn’t from event-to-event but from point-to-event.
The point is an arbitrary location on a map and not necessarily a location of an event
Same empirical and theoretical formulas — again, only the \(d_i\) measurement is different.
Guidelines (Opposite of \(G\))
\(\hat F(r) \gt F(r)\) says empty spaces are closer to events than expected from a random process, i.e. regular or dispersed
\(\hat F(r) \approx F(r)\) says event/location pattern resembles a random process
\(\hat F(r) \lt F(r)\) says empty spaces are further from events than expected from a random process, i.e. clustered
Example: {spatstat.explore::Fest}
data("pp2_df", package = "isdas") <- owin(c(0, 1), c(0, 1)) W <- as.ppp(pp2_df, W = W) pp2.ppp plot(pp2.ppp) <- Fest(pp2.ppp, correction = "none") f_pattern2 plot(f_pattern2) lines(x = c(0, 0.097), y = c(0.4, 0.4), col = "blue", lty = "dotted") lines(x = c(0.045, 0.045), y = c(0.0, 0.4), col = "blue", lty = "dotted") lines(x = c(0.097, 0.097), y = c(0.0, 0.4), col = "blue", lty = "dotted")
\(\hat F \lt F\) indicates clustering
Under the theoretical function 40% of points have a nearest event that is at a distance of approximately 0.045 or less, under the empirical function, the events are generally more distant from the points
K-Function
AKA Ripley’s K-Function
Instead of only using distances to first order neighbors, it takes into account multiple orders of neighbors.
Different types of patterns can be present at different scales (pp3.ppp plot)
- Overall there’s clustering but the clusters are regularly spaced.
- At the cluster-level, events look to possibly have a random pattern.
- F and G will indicate clustering but not recognize the regular spacing of the clusters
Process: At each event, neighbors are counted at some radius, r. Then r is increased and neighbors are counted again. This continues until all events have been counted as a neighbor. (See Geospatial, Spatial Weights >> Diagnostics >> Connectedness for details on higher order neighbors)
Empirical Formula
\[ \hat K(r) = \frac{1}{\hat \lambda A} \sum_i \sum_{j \ne i} (d_{ij} \le r) \]- Not explicitly defined in the book, but I can guess what the terms are
- \(\hat \lambda\) is the estimated intensity (i.e. density) and \(A\) is the overall area
- The rest is summing all the distances (\(d_{ij}\)) from event, \(i\), to all the other events, \(j\).
Theoretical Formula
\[ K(r) = \pi r^2 \]Guidelines
- \(\hat K(r) \gt K(r)\) says events are closer together than expected from a random process, i.e. Clustered
- \(\hat K(r) \approx K(r)\) says event/location pattern resembles a random process
- \(\hat K(r) \lt K(r)\) says events are further away from each other than expected from a random process, i.e. Dispersed or Regular
Example: {spatstat.explore::Kest}
data("pp3_df", package = "isdas") <- owin(c(0, 1), c(0, 1)) W <- as.ppp(pp3_df, W = W) pp3.ppp <- Kest(pp3.ppp, correction = "none") k_pattern3 plot(k_pattern3)
- Plot of pp3.ppp shown earlier
- \(\hat K(r) \gt K(r)\) at a smaller scale which indicates clustering,
- But also, \(\hat K(r) \lt K(r)\) at a larger scale which indicates regularity.
Kernel Density
Kernel density is a smooth estimate of the underlying intensity of the process, and the degree of smoothing is controlled by the bandwidth
- A map of the kernel density is better able to capture the variations in density across the region.
Process
- Each quadrat is treated as independent of the others in the window.
- There isn’t a grid of quadrats but in essence, one that slides around the study area.
- it gives greater weight to events that are close to the center of the window, and less weight to events that are more distant from the center of the window
- The kernel function visits each point on a fine grid and obtains an estimate of the density by summing the weights of all events.
- The shape of the Gaussian kernel depends on the standard deviation, which controls how “big” the window is, or alternatively, how quickly the function decays via decreasing weights. We will call the standard deviation the kernel bandwidth of the function.
{spatstat.explore::density.ppp} - Kernel smoothed intensity function from a point pattern
- kernel: “gaussian”, “epanechnikov”, “quartic” or “disc”
- weights: Optional weights to be attached to the points
- diggle: Logical. If TRUE, use the Jones-Diggle improved edge correction, which is more accurate but slower to compute than the default correction.
-
library(spatstat) library(sf) # sf dfs data("Fast_Food", package = "isdas") data("Toronto", package = "isdas") # create a ppp obj <- as.ppp(Fast_Food, as.owin(Toronto)) ppp_ff # calculate densities for each type of fast food <- density(split(ppp_ff), sigma = bw.diggle) kernel_density par(mfrow = c(2, 2), mar = c(0, 0, 1.1, 2)) ::pwalk( purrrlist( kernel_density,split.ppp(ppp_ff), # add pts names(kernel_density) ), \(x1, x2, x3) {plot(x1, main = x3) plot(x2, add = TRUE) } )
bw.diggle
calculates the bandwidth using cross-validation. It’s part of one of the group of sub-packages, {spatstat.explore}, that automatically gets loaded. There do seem to be other options, but I chose this based on the example in the Pebesma-Bivand book.- If you just want to plot the densities without the points, it’s
plot(kernel_density)
- See
spatstat.geom::plot.im
for beaucoup styling options for the density plot. I didn’t add it here, but when there aren’t any other labels, I kind of liked the addcontour = TRUE as an extra density cue especially for the darker colors. - Note that density color scales have different ranges in the legends