Preprocessing
Misc
- Packages
- {photon} - High-Performance Geocoding using photon
- {apache.sedona} - Sparklyr extension for processing geospatial data
- Conflation Types (i.e. geospatial joins)
- Notes fromConflating Overture Places Using DuckDB, Ollama, Embeddings, and More
- Recommends using String Similarity. Then use Embeddings on the remaining unmatched. Shows an 80% match. Can get about 90% by raising the similarity threshold, but it increases the amount of false positive matches.
- Exact Name Matching (e.g. by name variable
- Article example had a ~31% match rate
- String Similarity (e.g. address variable + Jaro-Winkler distance)
- Article example had a ~68% match rate
- Embeddings (e.g. embed name variable + cosine similarity)
- Article example had a ~71% match rated
- Notes fromConflating Overture Places Using DuckDB, Ollama, Embeddings, and More
- Beware statistical computations of tibbles/sf_tibbles with geometry columns
Could result in an expensive union operation over identical geometries and an R session crash
- Example with 100K rows crashed R.
Notes from thread
Option 1 (slower): Set do_union = FALSE in
summarize
<- tx_income_groups get_acs( geography = "tract", table = "B19001", state = "TX", year = 2020, geometry = TRUE |> ) filter(variable != "B19001_001") |> mutate(bracket = case_when( > "B19001_012" ~ "Above $100k", variable TRUE ~ "Below $100k" |> )) group_by(GEOID, bracket) |> summarize(n_households = sum(estimate, na.rm = TRUE), do_union = FALSE)
Option 2 (faster): Perform calculation without geometries then join
<- tracts("TX", cb = TRUE, year = 2020) |> tx_tracts select(GEOID) <- tx_income_groups get_acs( geography = "tract", table = "B19001", state = "TX", year = 2020, geometry = TRUE |> ) filter(variable != "B19001_001") |> mutate(bracket = case_when( > "B19001_012" ~ "Above $100k", variable TRUE ~ "Below $100k" |> )) group_by(GEOID, bracket) |> summarize(n_households = sum(estimate, na.rm = TRUE)) <- tx_tracts |> tx_income_groups left_join(tx_income_groups, by = "GEOID")
- {tidycensus} has an arg to bypass d/ling the geometries, geometry = FALSE and a separate
tracts
function to get the census tract geometries
- {tidycensus} has an arg to bypass d/ling the geometries, geometry = FALSE and a separate
File Types
PMTiles - A single-file archive format for tiled data. A PMTiles archive can be hosted on a commodity storage platform such as S3, and enables low-cost, zero-maintenance map applications that are “serverless” - free of a custom tile backend or third party provider. (Docs)
- Run your interactive, smooth-zooming vector map from any storage like S3 that supports http requests; a Caddy server running on your Wi-Fi router, or even GitHub pages (if tiles < 1GB).
- Cloudflare R2 is the recommended storage platform for PMTiles because it does not have bandwidth fees, only per-request fees: see R2 Pricing.
Shape Files
D/L and Load a shapefile
May need API key from Census Bureau (see {tigris} docs)
Validate geometries when loading a shape file
<- NY8_sf st_read(system.file( "shapes/NY8_bna_utm18.gpkg", package = "spData"), quiet = TRUE) # validate geometries table(st_is_valid(NY8_sf)) #> TRUE #> 281
- If
st_is_valid
returns FALSE, then you can runst_make_valid
- If
Example: Counties in California
<- tigris::counties(state = "CA") %>% tbl st_set_crs(4326)
{tigris} - US data
library(tigris) <- states(resolution = "20m", year = 2022, cb = TRUE) us_states <- us_states %>% lower_48 filter(!(NAME %in% c("Alaska", "Hawaii", "Puerto Rico")))
{rnaturalearth} - World data
# Via URL # Medium scale data, 1:50m Admin 0 - Countries # Download from https://www.naturalearthdata.com/downloads/50m-cultural-vectors/ <- read_sf("ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp") %>% world_map filter(iso_a3 != "ATA") # Remove Antarctica # Via Package library(rnaturalearth) # rerturnclass = "sf" makes it so the resulting dataframe has the special # sf-enabled geometry column <- ne_countries(scale = 50, returnclass = "sf") %>% world_map filter(iso_a3 != "ATA") # Remove Antarctica
GeoJSON
Write data to geojson
%>% data st_write("mb_shapes.geojson")
CSV to Dataframe (source)
# points <- localidad st_as_sf(localidad, coords = c("longitude", "latitude"), crs = 4326) # polygons <- departamentos st_as_sf(departamentos, wkt = "geomTDepartamento")
- wkt is the column that holds WKT encoded geometries
Big Data
Reduce Size Via SQL Query
Uses OGR SQL (SQLite might also be accepted)
Choose Layer
library("sf") st_layers("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg") ## Driver: GPKG ## Available layers: ## layer_name geometry_type features fields crs_name ## 1 LSOA_2021_EW_BGC_V3 Multi Polygon 35672 7 OSGB36 / British National Grid
- This file only has one layer
Get an idea of the columns in the layer by looking at the first row.
st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg", query = "SELECT * FROM LSOA_2021_EW_BGC_V3 WHERE FID = 1", quiet = TRUE) ## Simple feature collection with 1 feature and 7 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: 531948.3 ymin: 181263.5 xmax: 532308.9 ymax: 182011.9 ## Projected CRS: OSGB36 / British National Grid ## LSOA21CD LSOA21NM BNG_E BNG_N LONG LAT ## 1 E01000001 City of London 001A 532123 181632 -0.09714 51.51816 ## GlobalID SHAPE ## 1 {1A259A13-A525-4858-9CB0-E4952BA01AF6} MULTIPOLYGON (((532105.3 18...
- FID = 1 says look at the first row. FID is the feature ID attribute. I don’t think it’s actual column in the dataset.
Query the layer and filter
st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg", query = "SELECT * FROM LSOA_2021_EW_BGC_V3 WHERE LSOA21CD LIKE 'W%'", quiet = TRUE)
- W% says looks for values that start with “W” (Wales) in the LSOA21CD column
- Based on the OCR SQL docs I think
%
is wildcard for multiple characters.
- Based on the OCR SQL docs I think
- W% says looks for values that start with “W” (Wales) in the LSOA21CD column
Use a bounding box to filter overlapping geometries
Example: Filter polygons overlapping the boundaries of Wales
Filter Wales from a UK shapefile dataset
<- sf::st_read("data/Countries_December_2022_GB_BGC.gpkg") uk <- dplyr::filter(uk, CTRY22NM == "Wales") wales
Create Wales polygon
<- wales_wkt |> wales ::st_geometry() |> sf::st_as_text() sf
Filter overlapping geometries
<- wales_lsoa ::st_read("data/Lower_layer_Super_Output_Areas_2021_EW_BGC_V3.gpkg", sfwkt_filter = wales_wkt)
- Some English LSOAs along the Wales/England border in addition to the Welsh LSOAs are read in, because these technically overlap with the Wales polygon on the border itself. Not perfect but still reduces the data being read into memory.
Projections
Misc
- Resources
- Google “epsg code” + “your region name” to find a reasonable projection code to use
- Spatial Reference
- Formats
- epsg: The EPSG code (if available), a standardized code for coordinate systems.
- proj4string: A text string defining the projection parameters in PROJ.4 format.
- wkt: The Well-Known Text representation of the CRS.
- OGC (Open Geospatial Consortium) vs WGS (World Geodetic System)
- Both are same except for axis order
- WGS84 (EPSG:4326) uses latitude/longitude order (y,x)
- Traditional axis order used by cartographers and geographers throughout history
- CRS84 (OGC:CRS84) uses longitude/latitude order (x,y)
- Adopted since many mathematical and computational systems conventionally list x first, then y.
- Example:
st_transform(data, "OGC:CRS84")
- With data that’s in an older coordinate system, transformation instructions to a more modern system might be included in the output (wkt text) of the
st_crs
- SOURCECRS: Your actual current coordinate system (e.g. UTM Zone 18N with Clarke 1866 datum)
- TARGETCRS: The reference system you’ll be in after the transformation (e.g. WGS84)
- ABRIDGEDTRANSFORMATION: Instructions for converting between the source and target systems
Geographic Coordinates
Required for Great Distance and Spherical Distance calculations
WGS 84
EPSG value is 4326
Most common (and required by leaflet)
Formats
st_transform(data, "EPSG:4326") # Most common st_transform(data, "WGS84") # Also works st_transform(data, "+proj=longlat +datum=WGS84") # Seen in older code before epsg st_transform(data, 4326) # epsg value
Convert dataframe with latitude and longitude to a sf object with geometries
<- old_tbl # contains latitude and longitude variables new_tbl # convert to simple features object ::st_as_sf( sfcoords = c("<longitude_var>", "<latitude_var>"), # order matters crs = 4326 %>% ) ::mapview() mapviw
Transform a sf object to WSG84 coordinates
<- us_states %>% # df with geometries us_states ::st_transform("EPSG:4326") # WGS 84 sf
Projected Coordinates
Used for Planar Distance calculations
United States:
- EPSG:26901 to EPSG:26923 for NAD83 UTM zones are very common
- Each state also has its own State Plane coordinate system
Europe:
- ETRS89-LAEA (EPSG:3035) is commonly used for pan-European analysis
- Individual countries often have their own systems
- e.g. British National Grid EPSG:27700
NAD83, Albers, Mercator, Robinson
library(patchwork) <- ggplot() + p1 geom_sf(data = lower_48, fill = "#0074D9", color = "white", linewidth = 0.25) + coord_sf(crs = st_crs("EPSG:4269")) + # NAD83 labs(title = "NAD83 projection") + theme_void() + theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light")) <- ggplot() + p2 geom_sf(data = lower_48, fill = "#0074D9", color = "white", linewidth = 0.25) + coord_sf(crs = st_crs("ESRI:102003")) + # Albers labs(title = "Albers projection") + theme_void() + theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light")) <- ggplot() + p3 geom_sf(data = world_map, fill = "#FF4136", color = "white", linewidth = 0.1) + coord_sf(crs = st_crs("EPSG:3395")) + # Mercator labs(title = "Mercator projection") + theme_void() + theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light")) <- ggplot() + p4 geom_sf(data = world_map, fill = "#FF4136", color = "white", linewidth = 0.1) + coord_sf(crs = st_crs("ESRI:54030")) + # Robinson labs(title = "Robinson projection") + theme_void() + theme(plot.title = element_text(hjust = 0.5, family = "Overpass Light")) | p2) / (p3 | p4) (p1
Python
Example: Filter Data based on a polygon using latitude and longitude data
Get California’s polygon
import osmnx import geopandas as gpd = "California, USA" place = osmnx.geocode_to_gdf(place) gdf # Get the target geometry = gdf[["geometry", "bbox_north", "bbox_south", "bbox_east", "bbox_west"]] gdf
Filter data according the polygon geometry
from shapely.geometry import Point # Convert to a GeoDataFrame with Point geometry = [Point(xy) for xy in zip(df['Longitude'], df['Latitude'])] geometry = gpd.GeoDataFrame(df, geometry=geometry, crs='EPSG:4326') earthquake_gdf # Filter to keep only points within the California bounding box = gpd.sjoin(earthquake_gdf, gdf, how='inner', predicate='within') points_within_california # Select latitude, longitude etc. columns = points_within_california[['id', 'Latitude', 'Longitude', 'datetime', 'properties.mag']] df
- Latitude and longitude are converted to point geometry to match the polygon point geometry
- An inner join is used on the data and california polygon to get the points that are only in California.
sf
- Misc
- GEOS and S2 are engines
Some arguments work with different versions
Check versions
sf_extSoftVersion()["GEOS"] #> GEOS #> "3.12.1" packageVersion("s2") #> [1] ‘1.1.7’
- GEOS and S2 are engines
- Distances
- Planar Distances (Euclidean, feet or meters)
- If your study area is small (e.g., a city block, a small park), the difference between planar and Great Circle distances will be minimal, and planar distances are usually sufficient
- “For a city-sized area like Chicago, which spans roughly 15 km east-west and 25 km north-south, the distortion from using planar distance with raw lat/lon coordinates would be relatively small. At Chicago’s latitude (approximately 41.8°N), and over such short distances, the error compared to Great Circle distance would typically be less than 1%. errors of several kilometers for distances measured across the state.”
- Uses Planar (projected) Coordinate Systems,
- UTM (Universal Transverse Mercator): Divides the Earth into 60 zones, each 6 degrees of longitude wide. Each zone has its own origin and projection parameters designed to minimize distortion within that zone.
- e.g. UTM Zone 18N
- State Plane Coordinate System: Used in the United States, each state (or sometimes a portion of a state) has its own projection and parameters optimized for that specific area
- UTM (Universal Transverse Mercator): Divides the Earth into 60 zones, each 6 degrees of longitude wide. Each zone has its own origin and projection parameters designed to minimize distortion within that zone.
- Great Circle Distances (latitude, longitude, degrees)
- The shortest distance between two points along a great circle (a circle that divides the sphere into two equal hemispheres). It’s the most accurate for Earth-based calculations.
- If your study area is large (e.g., a state, a country, multiple continents), the Earth’s curvature becomes significant, and you must use Great Circle distances.
- Uses Geographic (angular) Coordinate Systems
- A latitude and longitude system in degrees, which are angles measured on the Earth’s surface. They are inherently spherical (or more accurately, ellipsoidal).
- e.g. WGS 84 (World Geodetic System 1984), a widely used geographic coordinate system based on the WGS 84 ellipsoid with EPSG code 4326
- Spherical Distances (latitude, longitude, km)
- Measured along a small circle parallel to the equator (constant latitude). This is simpler to calculate (euclidean vs trigonometric) than Great Circle distance, but less accurate, especially at high latitudes.
- Unless you’re working with real-time calculations or massive datasets (millions of pairs of points), the computational savings probably wouldn’t justify the trade-off in accuracy.
- Latitude is treated as a straight-line distance rather than accounting for the curvature of the Earth, which is why it’s less accurate than Great Circle distance at high latitudes.
- Measured along a small circle parallel to the equator (constant latitude). This is simpler to calculate (euclidean vs trigonometric) than Great Circle distance, but less accurate, especially at high latitudes.
- Planar Distances (Euclidean, feet or meters)
- Points in Polygons
st_centroid
- Finds the point in the center of a polygonThis could be in a body of water (i.e. outside the polygon) for some oddly shaped boundaries such as counties with coastal inlets or bays, or gerrymandered districts, etc.
Example
# spherical <- NY8_sf st_read( system.file("shapes/NY8_bna_utm18.gpkg", package = "spData"), quiet = TRUE )<- NY8_ct_sf st_centroid( st_geometry(NY8_sf), of_largest_polygon = TRUE )# planar st_is_longlat(NY8_ct_sf) #> [1] FALSE <- st_crs(NY8_sf) ny8_crs # Set the CRS of the centroid geometry <- NY8_ct_sf_sph st_set_crs(NY8_ct_sf, ny8_crs)# spherical st_is_longlat(NY8_ct_sf) #> [1] TRUE
st_point_the_surface
- Guarantees that the point will fall on the surface of a member polygon. (Code is the same asst_centroid
)- The point won’t necessarily be near the center.
- Can generate random points in a polygon for labels
Generate a (potentially) more central point guaranteed to be within the polgon
<- ny8_circs st_inscribed_circle(st_geometry(NY8_sf), nQuadSegs = 0) <- NY8_cic_sf st_cast(ny8_circs, to = "POINT")
st_inscribed_circle
: Calculates the largest circle that can fit entirely within each geometry in NY8_sf.- nQuadSegs controls the number of segments used to approximate the circle. When set to 0, it returns a 2-point LINESTRING (center point and a point on the circle). This representation is more accurate (center + radius) and efficient than using a polygon. This setting is only available for GEOS >= “3.9.0”. In general, a higher value (e.g. default = 30) results in a smoother polygon.
st_cast
: Converts the 2-point LINESTRING to just a (center) point (to = “POINT”)