General

Misc

Also see Surveys, Census Data >> Mapping
Packages
- CRAN Task View
Resources
- Geocomputation with * (R, Python, Julia)
  - Notes and Solutions to Exercises
- Spatial sampling with R
- Spatial Statistics for Data Science: Theory and Practice with R
- Spatial Data Science with R and “terra”
- Geodata & Spatial Regression
- rspatialdata: A collection of data sources and tutorials on visualizing spatial data using R
- Spatial Modelling for Data Scientists
- Geospatial Analysis with Python and R
QGIS - free and open source
ArcGIS - expensive and industry-standard
Spatiotemporal Data — data cubes with spatial and regular temporal dimensions — such as
- e.g. gridded temperature values (raster time series) and vector data with temporal records at regular temporal instances (e.g. election results in states).
  - {stars} - regular intervals
  - {sftime} - irregular intervals

Spatial Resampling

Creates cross-validation folds by k-means clustering coordinate variables

library(tidymodels)
library(spatialsample)
set.seed(123)
spatial_splits <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 5)

# fit a logistic model
glm_spec <- logistic_reg()
lsl_form <- lslpts ~ slope + cplan + cprof + elev + log10_carea 
lsl_wf <- workflow(lsl_form, glm_spec)
doParallel::registerDoParallel() 
regular_rs <- fit_resamples(lsl_wf, bad_folds)

Terms

Census Block Groups - ~600–3,000 population; the smallest geography reported; Wiki
Census Tract - ~4,000 average population; Docs
- Also see Survey, Census Data >> Geographies
Conflation - Combining and integrating multiple data sources that describe the same real-world features. (i.e. geospatial joins)
Graticules - a network of lines on a map that delineate the geographic coordinates (degrees of latitude and longitude.)
- Use of graticules is not advised, unless the graphical output will be used for measurement or navigation, or the direction of North is important for the interpretation of the content, or the content is intended to display distortions and artifacts created by projection. Unnecessary use of graticules only adds visual clutter but little relevant information. Use of coastlines, administrative boundaries or place names permits most viewers of the output to orient themselves better than a graticule
- {sf::st_graticule}
Raster Data - Grid data (instead of point/polygon data in Vector Data) where each square on this grid is a small cell, and each cell holds a single value representing some real-world phenomenon, e.g. elevation, temperature, land cover type, rainfall amount, or color of a pixel in a satellite image. The entire collection of these cells and their values is what we call raster data. Raster data is better for continuous phenomena like elevation, soil moisture, or temperature. Most data from satellites and aerial photography comes in raster form.
Vector Data - Data that uses points, lines, and polygons (instead of grid cells like Raster Data) to represent features like roads, buildings, or country borders. Vector data is precise and good for discrete objects.
VRT - File format that allows a virtual GDAL dataset to be composed from other GDAL datasets with repositioning, and algorithms potentially applied as well as various kinds of metadata altered or added. VRT descriptions of datasets can be saved in an XML format normally given the extension .vrt.
- Basically a metadata XML file describing various properties of the actual raster file, like pixel dimensions, geolocation, etc..

Optimization

Cloud Optimized GeoTiff (COG)
- A regular GeoTIFF file, aimed at being hosted on a HTTP file server, with an internal organization that enables more efficient workflows on the cloud. It does this by leveraging the ability of clients issuing HTTP GET range requests to ask for just the parts of a file they need.
Vector Tiles
- Misc
  - Notes from Push the limits of interactive mapping in R with vector tiles
    - McBain goes through a complete example with plenty of tips on simplification strategies and hosting mbtiles files
  - Issues (solution: Vector Tiles)
    - Limited number of features with DOM canvas
      - There’s a limit to how many features leaflet maps can handle, because at some point the DOM gets too full and your browser stops being able to parse it.
    - Limited number of maps on same webpage
      - Once you start rendering spatial data on WebGL canvasses instead of the DOM you’ll find there is a low number of WebGL contexts that can co-exist on any one web page, typically limiting you to only around 8 maps.
    - File sizes blow up to hundreds of MB
      - Trying to reuse WebGL maps by toggling on and off different layers of data for the user at opportune times. This is an improvement, but data for all those layers piles up, and your toolchain wants to embed this in your page as reams of base64 encoded text. Page file sizes are completely blowing out.
- Use Cases
  - Simplification of geometry is not desirable, e.g. because of alignment issues
    - e.g. The zoomed-in road network has to align with the road network on the basemap, so that viewers can see features that lie along sections of road.
  - Simplification of geometry doesn’t really help, you still have too many features
  - Cumulatively your datasets are too large to handle.
- Vector Tiles - Contain arrays of annotated spatial coordinate data which is combined with a separately transmitted stylesheet to produce the tile image.
  - i.e. The edges of the roads, the boundaries of buildings etc. Not an image, but the building blocks for one
  - Different stylesheets can use the same vector data to produce radically different looking maps that either highlight or omit data with certain attributes
  - Mapbox Vector Tiles (MVT) - specification; the de-facto standard for vector tile files
    - stored as a Google protocol buffer - a tightly packed binary format.
- PMTiles - Open source and {mapgl} supports it natively for both MapLibre and Mapbox maps. Use tippecanoe to convert your data to tiles in just the way you want.
- MBTiles - Proprietory format by Mapbox; describe a method of storing an entire MVT tileset inside a single file.
  - Internally .mbtiles files are SQLlite databases containing two tables: metadata and tiles.
    - Tiles table
      - Indexed by z,x,y
      - Contains a tile_data column for the vector tile protocol buffers, which are compressed using gzip
  - SQLite format and gzip compression help with efficient retrieval and transmission
- Using vector tiles we can have unlimited reference layers. Each one contributes nothing to the report file size since it is only streamed on demand when required.
- Workflow to convert data to .tbtiles
  - In R, read source data as an sf, and wrangle
    - Tippecanoe expects epsg 4326 by default
  - Write data out to geojson
  - On the command line, convert geojson to .pmtiles/.mbtiles using the tippecanoe command line utility.
    - Tippecanoe sources
      - Mapbox version - repo
        
        McBain says, he uses this version and hasn’t had any problems
        
        README has helpful cookbook section
      - Actively maintained community forked version - repo
      - May be a headache to get dependencies if using Windows
    - Alternatively it can output a folder structure full of protocol buffer files.
  - Example
```
tippecanoe -zg \
          -o abs_mesh_blocks.mbtiles \
          --coalesce-densest-as-needed \
          --extend-zooms-if-still-dropping \
          mb_shapes.geojson
```
- Mapping
  - Example
```
library(mvtview)
library(rdeck)

# Fire up the server
serve_mvt("abs_mesh_blocks.mbtiles", port = 8765)
# Serving your tile data from http://0.0.0.0:8765/abs_mesh_blocks.json.
# Run clean_mvt() to remove all server sessions.

mesh_blocks <- jsonlite::fromJSON("http://0.0.0.0:8765/abs_mesh_blocks.json")

# Map the data
rdeck(
    initial_bounds = structure(meshblocks$bounds, crs = 4326, class = "bbox") # set map limits using the tilejson
) |>
  add_mvt_layer(
    data = rdeck::tile_json("http://0.0.0.0:8765/abs_mesh_blocks.json"),
    get_fill_color = scale_color_linear(
      random_attribute
    ),
    opacity = 0.6
  )
```
    - See McBain article for options on hosting .mbtiles files
    - Regarding “abs_mesh_blocks”: {mvtview} provides a way to fetch the metadata table from .mbtiles as json by querying a json file with the same name as the .mbitles file.
    - The structure of ‘tilejson’ is yet another specification created by Mapbox, and is supported in deck.gl (and therefore {rdeck}) to describe tile endpoints.

Grid Systems

Misc
- Packages
  - {dggridR} - Builds discrete global grids which partition the surface of the Earth into hexagonal, triangular, or diamond cells, all of which have the same size. Vignettes available.
  - {zonebuilder} (Video)- Breaks up large geographic regions such as cities into manageable zones
    - Most cities have a centre, the ‘heart’ of the city. Therefore, the zones are distributed around the center.
    - Typically, the population is much denser in and around the centre and also the traffic intensity is higher. Therefore, the zones are smaller in and around the centre.
    - The rings (so A, B, C, D, etc) reflect the proximity to the centre point. The distances from the outer borders of the rings A, B, C, D, etc. follow the triangular number sequence 1, 3, 6, 10, etc.
      - This means that in everyday life use, within zone A everything is in walking distance, from ring B to the centre requires a bike, from zone C and further to the center typically requires public transport.
    - Regarding direction relative to the centre, the clock analogy is used, since most people are familiar with that. So each ring (annuli) is divided into 12 segments, where segment 12 is directed at 12:00, segment 1 at 1:00 etc.
- Explainer: Why using hexbins to visualize Australian electoral map is better than a typical provincial map.
  - tl;dr: Geographical size distorts what the value is trying to measure. The value is the party that wins the parliamentary seat
    - The bar graph shows the values the map is trying to visualize geographically. The hexabins better represent the close race by removing the distorting element which is the geographical size of the provinces.
    - Each voting district (hexabin) is voting for 1 representative and has the same number of voters, but districts can have vastly different areas depending on population density.
  - Keep unit at constant size (like hexabins) but alter hex shape to keep state shape.
    - A better U.S. house election results map?
    - Results
      - state size depends on number of districts which depends on population and therefore correctly conveys voting results visually across the whole country
      - Districts get distorted but the states retain their shape and so distortion of the overall visualization is minimized
    - {tmap.cartogram} can do something similar but much less sophisticated.
Uber’s H3 grid system
- Misc
  - Packages:
    - {h3r}, {h3-r}
    - {h3jsr} (article) - Provides access to Uber’s H3 library for geospatial indexing via its JavaScript transpile ‘h3-js’
      - Article recommends this package over the others including {dggridrR}, mostly because of it’s speed.
  - Uber docs
  - Add census data to H3 hexagons, calculate overlaps (article)
  - For large areas, you can reduce the number of hexagons by merging some hexagons into larger hexagons.
    - Reduces storage size
    - Issue: leaves small gaps between hexagons
      - might not matter for your use case
    - Solution: use Microsoft’s Quadkeys approach (see article)
- Each hexagon has a series of smaller hexagons that sit (mostly) inside of another, which creates a hierarchy that can be used for consistent referencing and analysis, all the way down to lengths of 2 feet for the edges.
- “Hexagons were an important choice because people in a city are often in motion, and hexagons minimize the quantization error introduced when users move through a city. Hexagons also allow us to approximate radiuses easily.”
- Re other shapes: “We could use postal code areas, but such areas have unusual shapes and sizes which are not helpful for analysis, and are subject to change for reasons entirely unrelated to what we would use them for. Zones could also be drawn by Uber operations teams based on their knowledge of the city, but such zones require frequent updating as cities change and often define the edges of areas arbitrarily”
- Grid systems can have comparable shapes and sizes across the cities that Uber operates in and are not subject to arbitrary changes. While grid systems do not align to streets and neighborhoods in cities, they can be used to efficiently represent neighborhoods by clustering grid cells. Clustering can be done using objective functions, producing shapes much more useful for analysis. Determining membership of a cluster is as efficient as a set lookup operation.
- 16 Resolutions
  - 0 - 15 (0 being coarsest and 15 being finest)
  - Each finer resolution has cells with one seventh the area of the coarser resolution. Hexagons cannot be perfectly subdivided into seven hexagons, so the finer cells are only approximately contained within a parent cell.
  - The identifiers for these child cells can be easily truncated to find their ancestor cell at a coarser resolution, enabling efficient indexing. Because the children cells are only approximately contained, the truncation process produces a fixed amount of shape distortion. This distortion is only present when performing truncation of a cell identifier; when indexing locations at a specific resolution, the cell boundaries are exact.
  - Want a resolution granular enough to introduce variability and wide enough to capture the effects of an area
  - Example of resolution 6 in Iowa

Features

Carto Spatial Features dataset ($) - https://carto.com/spatial-data-catalog/browser/?country=usa&category=derived&provider=carto
- Resolution: Quadgrid level 15 (with cells of approximately 1x1km) and Quadgrid level 18 (with cells of approximately 100x100m).
  - Guessing if the areas you’re interested in have high population density, then maybe 100 x 100 m cells would be more useful
- Features
  - Total population
  - Population by gender
  - Population by age and gender (e.g. female_0_to_19)
  - POIs by category
    - Retail Stores
    - Education
      - Number of education related POIs, incuding schools, universities, academies, etc.
    - Financial
      - Number of financial sector POIs, including ATMs and banks.
    - Food, Drink
      - Number of sustenance related POIs, including restaurants, bars, cafes and pubs.
    - Healthcare
      - Number of healthcare related POIs, including hospitals
    - Leisure
      - Number of POIs related to leisure activities, such as theaters, stadiums and sport centers.
    - Tourism
      - Number of POIs related to tourism attractions
    - Transportation
      - Number of transportation related POIs, including parking lots, car rentals, train stations and public transport stations.
Carto Data Observatory ($) - https://carto.com/spatial-data-catalog/browser/dataset/mc_geographic\_\_4a11e98c/
- Features
  - Geo id
  - Region id
  - Industry
  - Total Transactions Amount Index
  - Transaction Count Index
  - Account Count Index
  - Average Ticket Size Index
  - Average Frequency of Transaction per Card Index
  - Average Spend Amount by Account Index