Census Data

Misc

  • Notes from
  • Resources
  • Other Surveys
    • American Housing Survey (AHS)
    • General Social Survey (GSS)
      • Conducted since 1972. Contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
      • Packages
        • {gssr} - Convenient interface that provides access to GSS data
        • {gssrdoc} - Companion package to {gssr}
          • Provides documentation for all GSS variables in the cumulative data file via R’s help system.
          • Browse variables by name in the package’s help file or type ? followed by the name of the variable at the console to get a standard R help page containing information on the variable, the values it takes and (in most cases) a crosstabulation of the variable’s values for each year of the GSS.
    • Demographic and Health Surveys (DHS) - The program has collected, analyzed, and disseminated accurate and representative data on population, health, HIV, and nutrition through more than 400 surveys in over 90 countries
      • Packages
        • {DHSr} - Mapping, spatial analysis, and statistical modeling of microdata from sources such as DHS and IPUMS (Integrated Public Use Microdata Series)
          • Supports spatial correlation index construction and visualization, along with empirical Bayes approximation of regression coefficients in a multistage setup
          • Create large scale repeated regression statistics
            • e.g. Run regression for n groups, the group ID should be vertically composed into the variable for the parameter ‘location_var’.
            • glms, logit, probit, and more.
  • For details on joining census data to other data, see Chapter 7.2 in Analyzing Census Data
  • FIPS GEOID
  • Census Geocoder (link)
    • Enter an address and codes for various geographies are returned
    • Batch geocoding available for up to 10K records
      • Codes for geographies returned in a .csv file
  • TIGERweb (link)
    • Allows you to get geography codes by searching for an area on a map
    • Once zoomed-in on your desired area, you turn on geography layers to find the geography code for your area.
  • US Census Regions

Geographies

  • Misc

    • {tidycensus} docs on various geographies, function arguments, and which surveys (ACS, Census) they’re available in.
    • ACS Geography Boundaries by Year (link)
  • Types

    • Legal/Administrative
      • Census gets boundaries from outside party (state, county, city, etc.)
      • e.g. election areas, school districts, counties, county subdivisions
    • Statistical
      • Census creates these boundaries
      • e.g. regions, census tracts, ZCTAs, block groups, MSAs, urban areas
  • Nested Areas

    • Census Tracts
      • Areas within a county
      • Around 1200 to 8000 people
      • Small towns, rural areas, neighborhoods
      • ** Census tracts may cross city boundaries **
    • Block Groups
      • Areas within a census tract
      • Around 600 to 3000 people
    • Census Blocks
      • Areas within a block group
      • Not for ACS, only for the 10-yr census
  • Places

    • Misc
      • One place cannot overlap another place
      • Expand and contract as population or commercial activity increases or decreases
      • Must represent an organized settlement of people living in close proximity.
    • Incorporated Places
      • cities, towns, villages
      • Updated through Boundary and Annexation Survey (BAS) yearly
    • Census Designated Places (CDPs)
      • Areas that can’t become Incorporated Places because of state or city regulations
      • Concentrations of population, housing, commericial structures
      • Updated through Boundary and Annexation Survey (BAS) yearly
  • County Subdivisions

    • Minor Civil Divisions (MCDs)
      • Legally defined by the state or county, stable entity. May have elected government
      • e.g. townships, charter townships, or districts
    • Census County Divisions (CCDs)
      • no population requirment
      • Subcounty units with stable boundaries and recognizable names
  • Zip Code Tabulation Areas (ZCTAs)

    • Misc
      • Regular zip codes are problematic — can cross state lines.
      • {crosswalkZCTA} - Contains the US Census Bureau’s 2020 ZCTA to County Relationship File, as well as convenience functions to translate between States, Counties and ZIP Code Tabulation Areas (ZCTAs)
    • Approximate USPS Code distribution for housing units
      • The most frequently occurring zip code within an census block is assigned to a census block
      • Then blocks are aggregated into areas (ZCTAs)
    • ZCTAs do NOT nest within any other geographies
      • I guess the aggregated ZCTA blocks can overlap block groups
    • 2010 ZCTAs exclude large bodies of water and unpopulated areas
    • 2020 ZCTAs that cross state lines (source, code)

American Community Survey (ACS)

  • Misc

    • Default MOE is a 90%CI
    • Popular variable calculations from variables in ACS
    • For variables, vars <- load_variables(2022, "acs5")
      • For the 2022 5-year ACS,
        • "acs5" for the Detailed Tables;
        • "acs5/profile" for the Data Profile; aggregated statistics for acs5
          • p (suffix): Percentage with appropriate denominator
        • "acs5/subject" for the Subject Tables; and
        • "acs5/cprofile" for the Comparison Profile
      • Geographies only shown for 5-year ACS
  • About

    • Yearly estimates based on samples of the population over a 5yr period
      • Therefore a Margin of Error (MoE) is included with the estimates.
    • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
    • Detailed social, economic, housing, and demographic characteristics. Variables covering e.g. income, education, language, housing characteristics
    • census.gov/acs
  • ACS Release Schedule (releases)

    • September - 1-Year Estimates (from previous year’s collection)
      • Estimates for areas with populations of >65K
    • October - 1-Year Supplemental Estimates
      • Estimates for areas with populations between 20K-64999
    • December - 5-Year Estimates
      • Estimates for areas including census tract and block groups
  • Data Collected

    • Population
      • Social
        • Ancestry, Citizenship, Citizen Voting Age  Population, Disability, Education Attainment, Fertility, Grandparents, Language, Marital Status, Migration, School Enrollment, Veterans
      • Demographic
        • Age, Hispanic Origin, Race, Relationship, Sex
      • Economic
        • Class of worker, Commuting, Employment Status, Food Stamps (SNAP), Health Insurance, Hours/Week, Weeks/Year, Income, Industry & Occupation
    • Housing
      • Computer & Internet Use, Costs (Mortgage, Taxes, Insurance), Heating Fuel, Home Value, Occupancy, Plumbing/Kitchen Facilities, Structure, Tenure (Own/Rent), Utilities, Vehicles, Year Built/Year Movied In
  • Example: Median Household Income for Texas Counties

    texas_income <- get_acs(
      geography = "county",
      variables = "B19013_001", # median household income
      state = "TX",
      year = 2022
    )
    • Default MOE is a 90%CI (i.e. estimate \(\pm\) MOE)
  • Example: Census Tract for Multiple Counties in NY

    nyc_income <- get_acs(
      geography = "tract",
      variables = "B19013_001",
      state = "NY",
      county = c("New York", "Kings", "Queens",
                 "Bronx", "Richmond"),
      year = 2022,
      geometry = TRUE
    )
    mapview(nyc_income, zcol = "estimate")
  • Example: Multiple Races Percentages for San Diego County

    san_diego_race_wide <- get_acs(
      geography = "tract",
      variables = c(
        Hispanic = "DP05_0073P",
        White = "DP05_0079P",
        Black = "DP05_0080P",
        Asian = "DP05_0082P"
      ),
      state = "CA",
      county = "San Diego",
      geometry = TRUE,
      output = "wide",
      year = 2022
    )
    faceted_choro <- ggplot(san_diego_race, aes(fill = estimate)) + 
      geom_sf(color = NA) + 
      theme_void() + 
      scale_fill_viridis_c(option = "rocket") + 
      facet_wrap(~variable) + 
      labs(title = "Race / ethnicity by Census tract",
           subtitle = "San Diego County, California",
           fill = "ACS estimate (%)",
           caption = "2018-2022 ACS | tidycensus R package")
    • Allows for comparison, but for groups with less variation as compared to other groups since scaled according to all groups
      • You’d want to make a separate map for the Black population in order to compare variation between counties.
  • Migrations Flows

    • Example:

      fulton_inflow <- 
        get_flows(
          geography = "county",
          state = "GA",
          county = "Fulton",
          geometry = TRUE,
          year = 2020
        ) %>%
        filter(variable == "MOVEDIN") %>%
        na.omit()
      
      fulton_top_origins <- 
        fulton_inflow %>%
          slice_max(estimate, 
                    n = 30) 
      
      library(rdeck)
      
      Sys.getenv("MAPBOX_ACCESS_TOKEN")
      
      fulton_top_origins$centroid1 <- 
        st_transform(fulton_top_origins$centroid1, 4326)
      fulton_top_origins$centroid2 <- 
        st_transform(fulton_top_origins$centroid2, 4326)
      
      flow_map <- 
        rdeck(
          map_style = mapbox_light(), 
          initial_view_state = view_state(center = c(-98.422, 38.606), 
                                          zoom = 3, 
                                          pitch = 45)
        ) %>%
        add_arc_layer(
          get_source_position = centroid2,
          get_target_position = centroid1,
          data = as_tibble(fulton_top_origins),
          get_source_color = "#274f8f",
          get_target_color = "#274f8f",
          get_height = 1,
          get_width = scale_linear(estimate, range = 1:5),
          great_circle = TRUE
        )
      • Width of lines is scaled to counts

Dicennial US Census

Misc

  • A complete count — not based on samples like the ACS
  • Applies differential privacy to preserve respondent confidentiality
    • Adds noise to data. Greater effect at lower levels (i.e. block level)
    • The exception is that is no differetial privacy for household-level data.

PL94-171

  • Population data which the government needs for redistricting

  • sumfile = “pl”

  • State Populations

    pop20 <- 
      get_decennial(
        geography = "state",
        variables = "P1_001N",
        year = 2020
      )
    • For 2020, default is sumfile = “pl”

DHC

  • Age, Sex, Race, Ethnicity, and Housing Tenure (most popular dataset)

  • sumfile = “dhc”

  • County

    tx_population <- 
      get_decennial(
        geography = "county",
        variables = "P1_001N",
        state = "TX",
        sumfile = "dhc",
        year = 2020
      )
  • Census Block (analogous to a city block)

    matagorda_blocks <- 
      get_decennial(
        geography = "block",
        variables = "P1_001N",
        state = "TX",
        county = "Matagorda",
        sumfile = "dhc",
        year = 2020
      )

Demographic Profile

  • Pretabulated percentages from dhc

  • sumfile = “dp”

    • Tabulations for 118th Congress and Island Areas (i.e. Congressional Districts)
      • sumfile = “cd118”
  • C suffix variables are counts while P suffix variables are percentages

    • 0.4 is 0.4% not 40%
  • Example: Same-sex married and partnered in California by County

    ca_samesex <- 
      get_decennial(
        geography = "county",
        state = "CA",
        variables = c(married = "DP1_0116P",
                      partnered = "DP1_0118P"),
        year = 2020,
        sumfile = "dp",
        output = "wide"
      )

Detailed DHC-A

  • Detailed demographic data; Thousands of racial and ethnic groups; Tabulation by sex and age.

  • Different groups are in different tables, so specific groups can be hard to locate.

  • Adaptive design means the demographic group (i.e. variable) will only be available in certain areas. For privacy, data gets supressed when the area has low population.

    • There’s typically considerable sparsity especially when going down census tract
  • Args

    • sumfile = “ddhca”
    • pop_group - Population group code (See get_pop_groups below)
      • “all” for all groups
      • pop_group_label = TRUE - Adds group labels
  • get_pop_groups(2020, "ddhca") - Gets group codes for ethnic groups

    • For various groups there could be at least two variables (e..g Somaili, Somali and any combination)
    • For time series analysis, analagous groups to 2020’s for 2000 is SF2/SF4 and for 2010 is SF2. (SF stands for Summary File)
  • check_ddhca_groups - Checks which variables are available for a specific group

    • Example: Somali

      check_ddhca_groups(
        geography = "county", 
        pop_group = "1325", 
        state = "MN", 
        county = "Hennepin"
      )
  • Example: Minnesota group populations

    load_variables(2020, "ddhca") %>% 
      View()
    mn_population_groups <- 
      get_decennial(
        geography = "state",
        variables = "T01001_001N", # total population
        state = "MN",
        year = 2020,
        sumfile = "ddhca",
        pop_group = "all", # for all groups
        pop_group_label = TRUE
      )
    • Includes aggregate categories like European Alone, Other White Alone, etc., so you can’t just aggregate the value column to get the total population in Minnesota.
      • So, in order to calculate ethnic group ratios of the total state or county, etc. population, you need to get those state/county totals from other tables (e.g. PL94-171)
  • Use dot density and not chloropleths to visualize these sparse datasets

    • Example: Somali populations by census tract in Minneapolis

      hennepin_somali <- 
        get_decennial(
          geography = "tract",
          variables = "T01001_001N", # total population
          state = "MN",
          county = "Hennepin",
          year = 2020,
          sumfile = "ddhca",
          pop_group = "1325", # somali
          pop_group_label = TRUE,
          geometry = TRUE
        )
      
      somali_dots <- 
        as_dot_density(
          hennepin_somali,
          value = "value", # column name which is by default, "value"
          values_per_dot = 25
        )
      
      mapview(somali_dots, 
              cex = 0.01, 
              layer.name = "Somali population<br>1 dot = 25 people",
              col.regions = "navy", 
              color = "navy")
      • values_per_dot = 25 says make each dot worth 25 units (e.g. people or housing units)

Time Series Analysis

  • {tidycensus} only has 2010 and 2020 censuses

  • Issue: county names and boundaries change over time (e.g. Alaska redraws a lot)

    • Census gives a different GeoID to counties that get renamed even though they’re the same county.
    • NA values showing up after you calculate how the value changes over time is a good indication of this problem. Check for NAs: filter(county_change, is.na(value10))
  • Example: Join 2010 and 2020 and Calculate Percent Change

    county_pop_10 <- 
      get_decennial(
        geography = "county",
        variables = "P001001", 
        year = 2010,
        sumfile = "sf1"
      )
    
    county_pop_10_clean <- 
      county_pop_10 %>%
        select(GEOID, value10 = value) 
    
    county_pop_20 <- 
      get_decennial(
        geography = "county",
        variables = "P1_001N",
        year = 2020,
        sumfile = "dhc"
      ) %>%
        select(GEOID, NAME, value20 = value)
    
    county_joined <- 
      county_pop_20 %>%
        left_join(county_pop_10_clean, by = "GEOID") 
    
    county_joined
    
    county_change <- 
      county_joined %>%
        mutate( 
          total_change = value20 - value10, 
          percent_change = 100 * (total_change / value10) 
        ) 
  • Example: Age distribution over time in Michigan

    • Code available in the github repo or R/Workshops/tidycensus-umich-workshop-2024-main/census-2020/bonus-chart.R
    • Distribution shape remains pretty much the same, but decreasing for most age cohorts, i.e. people are leaving the state across most age groups.
      • e.g. The large hump representing the group of people in there mid-40s in 2000 steadily decreases over time.

tidycensus

  • Get an API key

    • Request a key, then activate the key from the link in your email.(https://api.census.gov/data/key_signup.html)
      • Required for hitting the census API over 500 times per day which isn’t as hard as you’d think.
    • Set as an environment variable: census_api_key("<api key>", install = TRUE)
      • Or add this line to .Renviron file, CENSUS_API_KEY=‘<api key’
  • Search Variables

    • Columns

      • Name - ID of the variable (Use this in the survey functions)
      • Label - Detailed description of the variable
      • Context - Subject of the table that the variable is located in.
    • Prefixes (Variables can have combinations of prefixes)

      • P: i.e. Person; Data available at the census block and larger
      • CT: Data available at the census track and larger
      • H: Data available at the Housing Unit level
        • I think housing unit is an alternatve unit. So instead of the unit being a person, which I assume is the typical unit, it’s a housing unit (~family).

        • Not affected by Differential Privacy (i.e. no noise added; true value)

        • Example: Total Deleware housing units at census block level

          dp_households <- 
                get_decennial(
                      geography = "block",
                      variables = "H1_001N",
                      state = "DE",
                      sumfile = "dhc",
                      year = 2020
                )
    • Example: DHC data in census for 2020

      vars <- load_variables(2020, "dhc")
      
      View(vars)
      • View table, click filter, and then search for parameters (e.g. Age, Median, etc.) with the Label, Context boxes, and overall search box
  • summary_var - Argument for supplying an additional variable that you need to calculate some kind of summary statistic

    • Example: Race Percentage per Congressional District

      
      race_vars <- c(
        Hispanic = "P5_010N", # all races identified as hispanic
        White = "P5_003N", # white not hispanic
        Black = "P5_004N", # black not hispanic
        Native = "P5_005N", # native american not hispanic
        Asian = "P5_006N", # asian not hispanic
        HIPI = "P5_007N" # hawaiian, islander not hispanic
      )
      
      cd_race <- 
        get_decennial(
          geography = "congressional district",
          variables = race_vars,
          summary_var = "P5_001N", # total population for county
          year = 2020,
          sumfile = "cd118"
      )
      
      cd_race_percent <- 
        cd_race %>%
          mutate(percent = 100 * (value / summary_value)) %>% 
          select(NAME, variable, percent)

Mapping

Misc

  • Use geometry = TRUE for any of the get_* {tidycensus} functions, and it’ll join the shapefile with the census data. Returns a SF (Simple Features) dataframe for mapping.

  • If you only want the shape files without the demographic data, see {tigris}

  • For examples with {tmap}, see Chapter 6.3 of Analyzing US Census Data

  • {mapview} along with some other packages gives you some tools for comparing maps (useful for eda or exploratory reports, etc.) (m1 and m2 are mapview objects)

    • m1 + m2 - Creates layers that allows you click the layers button and cycle through multiple maps. So I assume you could compare more than two maps here.
    • m1 | m2 - Creates swipe map (need {leaflet.extras2}). There will be a vertical slider that you can interactively slide horizontally to gradually expose one map or the other.
    • sync(m1, m2) - Uses {leafsync} to create side by side maps. Zooming and cursor movement are synced on both maps.

Preprocessing

  • Remove water from geographies


    nyc_income_tiger <- 
      get_acs(
        geography = "tract",
        variables = "B19013_001",
        state = "NY",
        county = c("New York", "Kings", "Queens",
                   "Bronx", "Richmond"),
        year = 2022,
        cb = FALSE,
        geometry = TRUE
    )
    
    library(tigris)
    library(sf)
    sf_use_s2(FALSE)
    
    nyc_erase <- 
      erase_water(
        nyc_income_tiger,
        area_threshold = 0.5,
        year = 2022
    )
    
    mapview(nyc_erase, zcol = "estimate")
    • The left figure is before and the right figure is after water removal
      • At the center-bottom, you can see how a sharp point is exposed where it was squared off before
      • At the top-left, the point along the border which are piers/docks are now exposed.
      • At the upper-middle, extraneous boundary lines have been removed and islands in the waterway are more clearly visible.
    • Works only with regular tigris shapefiles from the US census bureau — so not OpenStreetMaps, etc. For other shapefiles, you’d need to do the manual overlay, see Chapter 7.1 in Analyzing US Census Data for details.
    • Can take a couple minutes to run
    • cb = FALSE says get the regular tigris line files which avoid sliver polygons which are caused by slight misalignment of layers (?)
    • area_threshold = 0.5 says that water areas below the 50th percentile in terms of size are removed. Probably a value you’ll have to play around with.

Choropleths

  • Best for continuous data like rates and percentages, but you can use for discrete variables

    • You can create a discrete color palette with the at argument in the mapview function.
      • Example

        # check min and max of your data to select range of bins
        min(iowa_over_65, 
            na.rm = TRUE) # 0
        max(iowa_over_65, 
            na.rm = TRUE) # 38.4
        
        m1 <- 
          mapview(
            iowa_over_65, 
            zcol = "value",
            layer.name = "% age 65 and up<br>Census tracts in Iowa",
            col.regions = inferno(100, direction = -1),
            at = c(0, 10, 20, 30, 40)
          )
        • This will result in a discrete palette with bins of 0-10, 10-20, etc. Looks like an overlap, so I’m sure which bin contains the endpoints.
  • Example: Over 65 in Iowa by census tract

    library(mapviw); library(viridisLite)
    
    iowa_over_65 <- 
      get_decennial(
        geography = "tract",
        variables = "DP1_0024P",
        state = "IA",
        geometry = TRUE,
        sumfile = "dp",
        year = 2020
      )
    m1 <- 
      mapview(
        iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa",
        col.regions = inferno(100, 
                              direction = -1))
    • {mapview} is interactive and great for exploration of data
  • Export as an HTML file

    htmlwidgets::saveWidget(m1@map, "iowa_over_65.html")
    • Can embed it elsewhere (html report or website) by adding it as an asset
  • In {ggplot}

    texas_income_sf <- 
      get_acs(
        geography = "county",
        variables = "B19013_001",
        state = "TX",
        year = 2022,
        geometry = TRUE
    )
    
    plot(texas_income_sf['estimate'])

Circle Maps

  • “Graduated Symbol” maps are better for count data. Even though using a choropleth is not as bad at the census tract level since all tracts have around 4000 people, the sizes of the tracts can be substantially different which can influence the interpretation. Using circles or bubbles, etc. focuses the user on the size of the symbol and less on the size of the geography polygons.

  • Example: Hispanic Counts in San Diego County at the Census Tract Level

    
    san_diego_race_counts <- get_acs(
      geography = "tract",
      variables = c(
        Hispanic = "DP05_0073",
        White = "DP05_0079",
        Black = "DP05_0080",
        Asian = "DP05_0082"
      ),
      state = "CA",
      county = "San Diego",
      geometry = TRUE,
      year = 2022
    )
    
    san_diego_hispanic <- 
      filter(
        san_diego_race_counts, 
        variable == "Hispanic"
      )
    
    centroids <- st_centroid(san_diego_hispanic)
    
    grad_symbol <- 
      ggplot() + 
      geom_sf(
        data = san_diego_hispanic, 
        color = "black", 
        fill = "lightgrey") + 
      geom_sf(
        data = centroids, 
        aes(size = estimate),
        alpha = 0.7, 
        color = "navy") + 
      theme_void() + 
      labs(
        title = "Hispanic population by Census tract",
        subtitle = "2018-2022 ACS, San Diego County, California",
        size = "ACS estimate") + 
      scale_size_area(max_size = 6)
    • st_centroid finds the center point of geography polygons which will be the location of the symbols. If you look at the geometry column it will say POINT, which only has a latitude and longitude, instead of POLYGON, which as multiple coordiates.

    • scale_size_area scales the size of the circle according to the count value.

      • max_size is the maximum diameter of the circle which you’ll want to adjust to be large enough so that you can differentiate the circles but small enough so you have the least amount of overlap between circles in neighboring geographies (although this is probably inevitable).

Dot Density

  • Useful to show heterogeneity and mixing between groups versus plotting group facet maps.

  • Example: Population by Race in San Diego County

    san_diego_race_dots <- 
      as_dot_density(
        san_diego_race_counts, # see circle maps example for code
        value = "estimate", # group population
        values_per_dot = 200,
        group = "variable" # races
      )
    
    dot_density_map <- 
      ggplot() + 
      annotation_map_tile(type = "cartolight", 
                          zoom = 9) + 
      geom_sf(
        data = san_diego_race_dots, 
        aes(color = variable), 
        size = 0.01) + 
      scale_color_brewer(palette = "Set1") + 
      guides(color = guide_legend(override.aes = list(size = 3))) + 
      theme_void() + 
      labs(
        color = "Race / ethnicity",
        caption = "2018-2022 ACS | 1 dot = approximately 200 people")
    • as_dot_density scatters the dots randomly within a geography. values_per_dotsays each dot is 200 units (e.g. people or households). Without shuffling, ggplot will layer each group’s dots on top of each other.

    • annotation_map_tile from {ggspatial} applies a base map layer as a reference for the user. Base maps have land marks and popular features labeled in the geography and surrounding areas to help the user identify the area being shown.