Preprocessing

Misc

  • Packages
    • {photon} - High-Performance Geocoding using photon
    • {apache.sedona} - Sparklyr extension for processing geospatial data
  • Conflation Types (i.e. geospatial joins)
    • Notes fromConflating Overture Places Using DuckDB, Ollama, Embeddings, and More
      • Recommends using String Similarity. Then use Embeddings on the remaining unmatched. Shows an 80% match. Can get about 90% by raising the similarity threshold, but it increases the amount of false positive matches.
    • Exact Name Matching (e.g. by name variable
      • Article example had a ~31% match rate
    • String Similarity (e.g. address variable + Jaro-Winkler distance)
      • Article example had a ~68% match rate
    • Embeddings (e.g. embed name variable + cosine similarity)
      • Article example had a ~71% match rated
  • Beware statistical computations of tibbles/sf_tibbles with geometry columns
    • Could result in an expensive union operation over identical geometries and an R session crash

      • Example with 100K rows crashed R.
    • Notes from thread

    • Option 1 (slower): Set do_union = FALSE in summarize

      tx_income_groups <- 
        get_acs(
          geography = "tract",
          table = "B19001",
          state = "TX",
          year = 2020,
          geometry = TRUE
        ) |> 
        filter(variable != "B19001_001") |> 
        mutate(bracket = case_when(
          variable > "B19001_012" ~ "Above $100k",
          TRUE ~ "Below $100k"
        )) |> 
        group_by(GEOID, bracket) |> 
        summarize(n_households = sum(estimate, na.rm = TRUE),
                  do_union = FALSE)
    • Option 2 (faster): Perform calculation without geometries then join

      tx_tracts <- tracts("TX", cb = TRUE, year = 2020) |> 
        select(GEOID)
      
      tx_income_groups <- 
        get_acs(
          geography = "tract",
          table = "B19001",
          state = "TX",
          year = 2020,
          geometry = TRUE
        ) |> 
        filter(variable != "B19001_001") |> 
        mutate(bracket = case_when(
          variable > "B19001_012" ~ "Above $100k",
          TRUE ~ "Below $100k"
        )) |> 
        group_by(GEOID, bracket) |> 
        summarize(n_households = sum(estimate, na.rm = TRUE))
      
      tx_income_groups <- tx_tracts |> 
        left_join(tx_income_groups, by = "GEOID")
      • {tidycensus} has an arg to bypass d/ling the geometries, geometry = FALSE and a separate tracts function to get the census tract geometries

File Types

  • PMTiles - A single-file archive format for tiled data. A PMTiles archive can be hosted on a commodity storage platform such as S3, and enables low-cost, zero-maintenance map applications that are “serverless” - free of a custom tile backend or third party provider. (Docs)

    • Run your interactive, smooth-zooming vector map from any storage like S3 that supports http requests; a Caddy server running on your Wi-Fi router, or even GitHub pages (if tiles < 1GB).
    • Cloudflare R2 is the recommended storage platform for PMTiles because it does not have bandwidth fees, only per-request fees: see R2 Pricing.
  • Shape Files

    • D/L and Load a shapefile

    • May need API key from Census Bureau (see {tigris} docs)

    • Validate geometries when loading a shape file

      NY8_sf <- 
        st_read(system.file(
                  "shapes/NY8_bna_utm18.gpkg", 
                  package = "spData"), 
                quiet = TRUE)
      # validate geometries
      table(st_is_valid(NY8_sf))
      #> TRUE 
      #>  281 
      • If st_is_valid returns FALSE, then you can run st_make_valid
    • Example: Counties in California

      tbl <- tigris::counties(state = "CA") %>%
          st_set_crs(4326)
    • {tigris} - US data

      library(tigris)
      
      us_states <- states(resolution = "20m", year = 2022, cb = TRUE)
      
      lower_48 <- us_states %>%
        filter(!(NAME %in% c("Alaska", "Hawaii", "Puerto Rico")))
    • {rnaturalearth} - World data

      # Via URL
      # Medium scale data, 1:50m Admin 0 - Countries
      # Download from https://www.naturalearthdata.com/downloads/50m-cultural-vectors/
      world_map <- read_sf("ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp") %>%
        filter(iso_a3 != "ATA"# Remove Antarctica
      
      # Via Package
      library(rnaturalearth)
      
      # rerturnclass = "sf" makes it so the resulting dataframe has the special
      # sf-enabled geometry column
      world_map <- ne_countries(scale = 50, returnclass = "sf") %>%
        filter(iso_a3 != "ATA"# Remove Antarctica
  • GeoJSON

    • Write data to geojson

      data %>%
          st_write("mb_shapes.geojson")
  • CSV to Dataframe (source)

    # points
    localidad <- 
      st_as_sf(localidad, 
               coords = c("longitude", "latitude"), 
               crs = 4326)
    
    # polygons
    departamentos <- 
      st_as_sf(departamentos, 
               wkt = "geomTDepartamento")
    • wkt is the column that holds WKT encoded geometries

Big Data

  • Reduce Size Via SQL Query

    • Uses OGR SQL (SQLite might also be accepted)

    • Choose Layer

      library("sf")
      st_layers("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg")
      ## Driver: GPKG 
      ## Available layers:
      ##            layer_name geometry_type features fields crs_name
      ## 1 LSOA_2021_EW_BGC_V3 Multi Polygon    35672      7  OSGB36 / British National Grid         
      • This file only has one layer
    • Get an idea of the columns in the layer by looking at the first row.

      st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg",
              query = "SELECT * FROM LSOA_2021_EW_BGC_V3 WHERE FID = 1",
              quiet = TRUE)
      
      ## Simple feature collection with 1 feature and 7 fields
      ## Geometry type: MULTIPOLYGON
      ## Dimension:     XY
      ## Bounding box:  xmin: 531948.3 ymin: 181263.5 xmax: 532308.9 ymax: 182011.9
      ## Projected CRS: OSGB36 / British National Grid
      ##    LSOA21CD            LSOA21NM  BNG_E  BNG_N     LONG      LAT
      ## 1 E01000001 City of London 001A 532123 181632 -0.09714 51.51816
      ##                                 GlobalID                          SHAPE
      ## 1 {1A259A13-A525-4858-9CB0-E4952BA01AF6} MULTIPOLYGON (((532105.3 18...
      • FID = 1 says look at the first row. FID is the feature ID attribute. I don’t think it’s actual column in the dataset.
    • Query the layer and filter

      st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg",
              query = "SELECT * FROM LSOA_2021_EW_BGC_V3 WHERE LSOA21CD LIKE 'W%'",
              quiet = TRUE)
      • W% says looks for values that start with “W” (Wales) in the LSOA21CD column
        • Based on the OCR SQL docs I think % is wildcard for multiple characters.
  • Use a bounding box to filter overlapping geometries

    • Example: Filter polygons overlapping the boundaries of Wales

      • Filter Wales from a UK shapefile dataset

        uk <- sf::st_read("data/Countries_December_2022_GB_BGC.gpkg")
        wales <- dplyr::filter(uk, CTRY22NM == "Wales")
      • Create Wales polygon

        wales_wkt <-  
          wales |>
          sf::st_geometry() |>
          sf::st_as_text()
      • Filter overlapping geometries

        wales_lsoa <- 
          sf::st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg",
                      wkt_filter = wales_wkt)
        • Some English LSOAs along the Wales/England border in addition to the Welsh LSOAs are read in, because these technically overlap with the Wales polygon on the border itself. Not perfect but still reduces the data being read into memory.

Projections

Misc

  • Resources
    • Google “epsg code” + “your region name” to find a reasonable projection code to use
    • Spatial Reference
      • Search
        • Example: Polar region
          • Ellis used a combination of WGS 84 (south) and NSIDC (north).
      • Explorer
        • Click a place on the map, choose “projected” (other options available), choose ESPG (other options available), then a list of projections is the output
        • Example: Berlin
  • Formats
    • epsg: The EPSG code (if available), a standardized code for coordinate systems.
    • proj4string: A text string defining the projection parameters in PROJ.4 format.
    • wkt: The Well-Known Text representation of the CRS.
  • OGC (Open Geospatial Consortium) vs WGS (World Geodetic System)
    • Both are same except for axis order
    • WGS84 (EPSG:4326) uses latitude/longitude order (y,x)
      • Traditional axis order used by cartographers and geographers throughout history
    • CRS84 (OGC:CRS84) uses longitude/latitude order (x,y)
      • Adopted since many mathematical and computational systems conventionally list x first, then y.
      • Example: st_transform(data, "OGC:CRS84")
  • With data that’s in an older coordinate system, transformation instructions to a more modern system might be included in the output (wkt text) of the st_crs
    • SOURCECRS: Your actual current coordinate system (e.g. UTM Zone 18N with Clarke 1866 datum)
    • TARGETCRS: The reference system you’ll be in after the transformation (e.g. WGS84)
    • ABRIDGEDTRANSFORMATION: Instructions for converting between the source and target systems

Geographic Coordinates

  • Required for Great Distance and Spherical Distance calculations

  • WGS 84

    • EPSG value is 4326

    • Most common (and required by leaflet)

    • Formats

      st_transform(data, "EPSG:4326")    # Most common
      st_transform(data, "WGS84")        # Also works
      st_transform(data, "+proj=longlat +datum=WGS84")  # Seen in older code before epsg
      st_transform(data, 4326)           # epsg value
  • Convert dataframe with latitude and longitude to a sf object with geometries

    new_tbl <- old_tbl # contains latitude and longitude variables
        # convert to simple features object
        sf::st_as_sf(
            coords = c("<longitude_var>", "<latitude_var>"), # order matters
            crs = 4326
        ) %>%
        mapviw::mapview()
  • Transform a sf object to WSG84 coordinates

    us_states <- us_states %>% # df with geometries
      sf::st_transform("EPSG:4326"# WGS 84

Projected Coordinates

  • Used for Planar Distance calculations

  • United States:

    • EPSG:26901 to EPSG:26923 for NAD83 UTM zones are very common
    • Each state also has its own State Plane coordinate system
  • Europe:

    • ETRS89-LAEA (EPSG:3035) is commonly used for pan-European analysis
    • Individual countries often have their own systems
      • e.g. British National Grid EPSG:27700
  • NAD83, Albers, Mercator, Robinson

    library(patchwork)
    
    p1 <- ggplot() +
      geom_sf(data = lower_48, fill = "#0074D9", color = "white", linewidth = 0.25) +
      coord_sf(crs = st_crs("EPSG:4269")) +  # NAD83
      labs(title = "NAD83 projection") +
      theme_void() +
      theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light"))
    
    p2 <- ggplot() +
      geom_sf(data = lower_48, fill = "#0074D9", color = "white", linewidth = 0.25) +
      coord_sf(crs = st_crs("ESRI:102003")) +  # Albers
      labs(title = "Albers projection") +
      theme_void() +
      theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light"))
    
    p3 <- ggplot() +
      geom_sf(data = world_map, fill = "#FF4136", color = "white", linewidth = 0.1) +
      coord_sf(crs = st_crs("EPSG:3395")) +  # Mercator
      labs(title = "Mercator projection") +
      theme_void() +
      theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light"))
    
    p4 <- ggplot() +
      geom_sf(data = world_map, fill = "#FF4136", color = "white", linewidth = 0.1) +
      coord_sf(crs = st_crs("ESRI:54030")) +  # Robinson
      labs(title = "Robinson projection") +
      theme_void() +
      theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light"))
    
    (p1 | p2) / (p3 | p4)

Python

  • Example: Filter Data based on a polygon using latitude and longitude data

    • Get California’s polygon

      import osmnx
      import geopandas as gpd
      
      place = "California, USA"
      gdf = osmnx.geocode_to_gdf(place)
      # Get the target geometry
      gdf = gdf[["geometry", "bbox_north", "bbox_south", "bbox_east", "bbox_west"]]
    • Filter data according the polygon geometry

      from shapely.geometry import Point
      
      # Convert to a GeoDataFrame with Point geometry
      geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])]
      earthquake_gdf = gpd.GeoDataFrame(df, geometry=geometry, crs='EPSG:4326')
      
      # Filter to keep only points within the California bounding box
      points_within_california = gpd.sjoin(earthquake_gdf, gdf, how='inner', predicate='within')
      
      # Select latitude, longitude etc. columns
      df = points_within_california[['id', 'Latitude', 'Longitude', 'datetime', 'properties.mag']]
      • Latitude and longitude are converted to point geometry to match the polygon point geometry
      • An inner join is used on the data and california polygon to get the points that are only in California.

sf

  • Misc
    • GEOS and S2 are engines
      • Some arguments work with different versions

      • Check versions

        sf_extSoftVersion()["GEOS"]
        #>     GEOS 
        #> "3.12.1"
        packageVersion("s2")
        #> [1] ‘1.1.7’
  • Distances
    • Planar Distances (Euclidean, feet or meters)
      • If your study area is small (e.g., a city block, a small park), the difference between planar and Great Circle distances will be minimal, and planar distances are usually sufficient
      • “For a city-sized area like Chicago, which spans roughly 15 km east-west and 25 km north-south, the distortion from using planar distance with raw lat/lon coordinates would be relatively small. At Chicago’s latitude (approximately 41.8°N), and over such short distances, the error compared to Great Circle distance would typically be less than 1%. errors of several kilometers for distances measured across the state.”
      • Uses Planar (projected) Coordinate Systems,
        • UTM (Universal Transverse Mercator): Divides the Earth into 60 zones, each 6 degrees of longitude wide. Each zone has its own origin and projection parameters designed to minimize distortion within that zone.
          • e.g. UTM Zone 18N
        • State Plane Coordinate System: Used in the United States, each state (or sometimes a portion of a state) has its own projection and parameters optimized for that specific area
    • Great Circle Distances (latitude, longitude, degrees)
      • The shortest distance between two points along a great circle (a circle that divides the sphere into two equal hemispheres). It’s the most accurate for Earth-based calculations.
      • If your study area is large (e.g., a state, a country, multiple continents), the Earth’s curvature becomes significant, and you must use Great Circle distances.
      • Uses Geographic (angular) Coordinate Systems
        • A latitude and longitude system in degrees, which are angles measured on the Earth’s surface. They are inherently spherical (or more accurately, ellipsoidal).
        • e.g. WGS 84 (World Geodetic System 1984), a widely used geographic coordinate system based on the WGS 84 ellipsoid with EPSG code 4326
    • Spherical Distances (latitude, longitude, km)
      • Measured along a small circle parallel to the equator (constant latitude). This is simpler to calculate (euclidean vs trigonometric) than Great Circle distance, but less accurate, especially at high latitudes.
        • Unless you’re working with real-time calculations or massive datasets (millions of pairs of points), the computational savings probably wouldn’t justify the trade-off in accuracy.
      • Latitude is treated as a straight-line distance rather than accounting for the curvature of the Earth, which is why it’s less accurate than Great Circle distance at high latitudes.
  • Points in Polygons
    • st_centroid- Finds the point in the center of a polygon

      • This could be in a body of water (i.e. outside the polygon) for some oddly shaped boundaries such as counties with coastal inlets or bays, or gerrymandered districts, etc.

      • Example

        # spherical
        NY8_sf <- 
          st_read(
            system.file("shapes/NY8_bna_utm18.gpkg", 
                        package = "spData"), 
            quiet = TRUE
          )
        NY8_ct_sf <- 
          st_centroid(
            st_geometry(NY8_sf), 
            of_largest_polygon = TRUE
          )
        # planar
        st_is_longlat(NY8_ct_sf)
        #> [1] FALSE
        
        ny8_crs <- st_crs(NY8_sf)
        # Set the CRS of the centroid geometry
        NY8_ct_sf_sph <- 
          st_set_crs(NY8_ct_sf, 
                     ny8_crs)
        # spherical
        st_is_longlat(NY8_ct_sf)
        #> [1] TRUE
    • st_point_the_surface - Guarantees that the point will fall on the surface of a member polygon. (Code is the same as st_centroid)

      • The point won’t necessarily be near the center.
      • Can generate random points in a polygon for labels
    • Generate a (potentially) more central point guaranteed to be within the polgon

      ny8_circs <- 
        st_inscribed_circle(st_geometry(NY8_sf), 
                            nQuadSegs = 0)
      
      NY8_cic_sf <- 
        st_cast(ny8_circs, 
                to = "POINT")
      • st_inscribed_circle: Calculates the largest circle that can fit entirely within each geometry in NY8_sf.
        • nQuadSegs controls the number of segments used to approximate the circle. When set to 0, it returns a 2-point LINESTRING (center point and a point on the circle). This representation is more accurate (center + radius) and efficient than using a polygon. This setting is only available for GEOS >= “3.9.0”. In general, a higher value (e.g. default = 30) results in a smoother polygon.
      • st_cast: Converts the 2-point LINESTRING to just a (center) point (to = “POINT”)