Census Data

Misc

  • Notes from
  • Packages
    • {tidycensus} - Allows users to easily interface with a select number of the US Census Bureau’s data APIs and return tidyverse-ready data frames, optionally with simple feature geometry included
      • Dicennial US Census and ACS
      • Also Population Estimates API and Migration Flows API (See Other Datasets Vignette)
    • {censusapi} - Access all Census APIs
    • {lehdr} - Allows users to interface with Origin-Destination Employment Statistics (LODES) dataset returned as dataframes.
      • LODES data is part of the LEHD survey (See below)
  • Resources
  • Beware of Connecticut
    • Changed from having 9 counties to having 8 “Planning Regions” in 2022.
    • This affects census tracts as well
  • Other Surveys
    • American Housing Survey (AHS)
    • General Social Survey (GSS)
      • Conducted since 1972. Contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.
      • Packages
        • {gssr} - Convenient interface that provides access to GSS data
        • {gssrdoc} - Companion package to {gssr}
          • Provides documentation for all GSS variables in the cumulative data file via R’s help system.
          • Browse variables by name in the package’s help file or type ? followed by the name of the variable at the console to get a standard R help page containing information on the variable, the values it takes and (in most cases) a crosstabulation of the variable’s values for each year of the GSS.
    • Demographic and Health Surveys (DHS) - The program has collected, analyzed, and disseminated accurate and representative data on population, health, HIV, and nutrition through more than 400 surveys in over 90 countries
      • Packages
        • {DHSr} - Mapping, spatial analysis, and statistical modeling of microdata from sources such as DHS and IPUMS (Integrated Public Use Microdata Series)
          • Supports spatial correlation index construction and visualization, along with empirical Bayes approximation of regression coefficients in a multistage setup
          • Create large scale repeated regression statistics
            • e.g. Run regression for n groups, the group ID should be vertically composed into the variable for the parameter ‘location_var’.
            • glms, logit, probit, and more.
    • U.S. National Health Interview Survey (NHIS) - Monitored the health of the nation since 1957. NHIS data on a broad range of health topics are collected through personal household interviews. Survey results have been instrumental in providing data to track health status, health care access, and progress toward achieving national health objectives. (source). Uses a stratified, multistage sample design.
    • Current Population Survey (CPS) - From 1962 to the present. Data include demographic information, rich employment data, program participation and supplemental data on topics such as fertility, tobacco use, volunteer activities, voter registration, computer and internet use, food security, and more.
      • Alternative to the ACS for getting nationally representative demographic data.
        • Census Bureau’s faq on the differences
        • Notably, the ACS only surveys households, and it uses a shorter 8-question survey, while the CPS tries to reach the entire civilian noninstitutionalized population and uses a longer, more detailed survey.
      • A monthly survey of ≈100,000 US individuals. A more comprehensive annual version—the Annual Social and Economic Supplement (ASEC)—is published every March.
      • Heiss article on comparing your survey data estimates to CPS population estimates
    • Longitudinal Employer-Household Dynamics (LEHD) - Since 1999, the survey has explored the interactions between workers and firms. By linking employer and household data, the LEHD program has built a comprehensive database of longitudinally linked jobs data.
      • Access
      • Job-to-Job Flows (J2J)
      • Origin-Destination Employment Statistics (LODES) (See R package)
      • Post-Secondary Employment Outcomes (PSEO)
      • Quarterly Workforce Indicators (QWI)
      • Veteran Employment Outcomes (VEO)
      • Restricted-Use Microdata
  • For details on joining census data to other data, see Chapter 7.2 in Analyzing Census Data
  • FIPS GEOID
  • Census Geocoder (link)
    • Enter an address and codes for various geographies are returned
    • Batch geocoding available for up to 10K records
      • Codes for geographies returned in a .csv file
  • TIGERweb (link)
    • Allows you to get geography codes by searching for an area on a map
    • Once zoomed-in on your desired area, you turn on geography layers to find the geography code for your area.
  • US Census Regions

Geographies

  • Misc

    • {tidycensus} docs on various geographies, function arguments, and which surveys (ACS, Census) they’re available in.
    • ACS Geography Boundaries by Year (link)
    • Public Use Microdata Areas (PUMA)
      • See Walker’s book, Ch. 9.4
      • Redrawn each Dicennial US Census but usually only go into affect via other surveys a couple years afterwards
        • e.g. 2022 1-yr ACS was the first ACS to use 2020 PUMAs
      • Typically contain 100K-200K people
      • Useful when variables at the county level are sparse. By moving up to a larger geographic area, there are fewer NAs (probably none).
  • Types

    • Legal/Administrative
      • Census gets boundaries from outside party (state, county, city, etc.)
      • e.g. election areas, school districts, counties, county subdivisions
    • Statistical
      • Census creates these boundaries
      • e.g. regions, census tracts, ZCTAs, block groups, MSAs, urban areas
  • Nested Areas

    • Census Tracts
      • Areas within a county
      • Around 1200 to 8000 people
      • Small towns, rural areas, neighborhoods
      • ** Census tracts may cross city boundaries **
    • Block Groups
      • Areas within a census tract
      • Around 600 to 3000 people
    • Census Blocks
      • Areas within a block group
      • Could be a city block in dense areas or much larger in rural areas.
      • Not for ACS, only for the Dicennial (10-yr) census
  • Places

    • Misc
      • One place cannot overlap another place
      • Expand and contract as population or commercial activity increases or decreases
      • Must represent an organized settlement of people living in close proximity.
    • Incorporated Places
      • cities, towns, villages
      • Updated through Boundary and Annexation Survey (BAS) yearly
    • Census Designated Places (CDPs)
      • Areas that can’t become Incorporated Places because of state or city regulations
      • Concentrations of population, housing, commericial structures
      • Updated through Boundary and Annexation Survey (BAS) yearly
  • County Subdivisions

    • Minor Civil Divisions (MCDs)
      • Legally defined by the state or county, stable entity. May have elected government
      • e.g. townships, charter townships, or districts
    • Census County Divisions (CCDs)
      • no population requirment
      • Subcounty units with stable boundaries and recognizable names
  • Zip Code Tabulation Areas (ZCTAs)

    • Misc
      • Regular zip codes are problematic — can cross state lines.
      • {crosswalkZCTA} - Contains the US Census Bureau’s 2020 ZCTA to County Relationship File, as well as convenience functions to translate between States, Counties and ZIP Code Tabulation Areas (ZCTAs)
    • Approximate USPS Code distribution for housing units
      • The most frequently occurring zip code within an census block is assigned to a census block
      • Then blocks are aggregated into areas (ZCTAs)
    • ZCTAs do NOT nest within any other geographies
      • I guess the aggregated ZCTA blocks can overlap block groups
    • 2010 ZCTAs exclude large bodies of water and unpopulated areas
    • 2020 ZCTAs that cross state lines (source, code)

American Community Survey (ACS)

Misc

Description

  • About

    • Yearly estimates based on samples of the population over a 5yr period
      • Therefore a Margin of Error (MoE) is included with the estimates.
    • Available as 1-year estimates (for geographies of population 65,000 and greater) and 5-year estimates (for geographies down to the block group)
      • 1-Year: Useful for getting the most up-to-date estimates
        • Not all variables will be available even if your geography has over 65K people for confidentiality or accuracy concerns)
        • Some MOEs might be larger than the estimate
        • If the geography you choose has a lot of NAs, try moving up to a larger geography
          • Also see Geographies >> Misc

          • Example: From County to PUMA

            tx_wfh <- get_acs(
              geography = "puma",
              variables = "DP03_0024P",
              state = "TX",
              survey = "acs1",
              year = 2023
            )
            • You still get decent granularity for denser populated regions, so patterns will be visable while also getting complete coverage of a study area
      • 5-Year: Useful is smaller geographies and lower MOEs are necessary.
    • Detailed social, economic, housing, and demographic characteristics. Variables covering e.g. income, education, language, housing characteristics
    • Smallest geography is the Block Group
    • census.gov/acs
  • ACS Release Schedule (releases)

    • September - 1-Year Estimates (from previous year’s collection)
      • Estimates for areas with populations of >65K
    • October - 1-Year Supplemental Estimates
      • Estimates for areas with populations between 20K-64999
    • December - 5-Year Estimates
      • Estimates for areas including census tract and block groups
  • Data Collected

    • Population
      • Social
        • Ancestry, Citizenship, Citizen Voting Age  Population, Disability, Education Attainment, Fertility, Grandparents, Language, Marital Status, Migration, School Enrollment, Veterans
      • Demographic
        • Age, Hispanic Origin, Race, Relationship, Sex
      • Economic
        • Class of worker, Commuting, Employment Status, Food Stamps (SNAP), Health Insurance, Hours/Week, Weeks/Year, Income, Industry & Occupation
    • Housing
      • Computer & Internet Use, Costs (Mortgage, Taxes, Insurance), Heating Fuel, Home Value, Occupancy, Plumbing/Kitchen Facilities, Structure, Tenure (Own/Rent), Utilities, Vehicles, Year Built/Year Movied In

Examples

  • Example 1: Median Household Income for Texas Counties

    texas_income <- get_acs(
      geography = "county",
      variables = "B19013_001", # median household income
      state = "TX",
      year = 2022
    )
    • Required: geography, variables
      • Variable format is table_variable e.g. (B19013 is the table, 001 is the variable)
    • Default is 2019-2023 5-year ACS (survey = “acs5”)
      • 1-yr ACS only surveys counties with population > 65K
    • Returns: GEOID, Name (e.g. county and state), variable, estimate, moe
    • Default MOE is a 90%CI (i.e. estimate \(\pm\) MOE)
    • table = “B19001” - Specifying this argument gets you all the variables in this table (e.g. all income variables)
  • Example 2: Census Tract for Multiple Counties in NY

    nyc_income <- get_acs(
      geography = "tract",
      variables = "B19013_001",
      state = "NY",
      county = c("New York", "Kings", "Queens",
                 "Bronx", "Richmond"),
      year = 2022,
      geometry = TRUE
    )
    mapview(nyc_income, zcol = "estimate")
  • Example 3: Error-Bar

    utah_income <- get_acs(
      geography = "county",
      variables = "B19013_001",
      state = "UT",
      year = 2023
    ) 
    
    ggplot(utah_income, aes(x = estimate, 
                            y = reorder(NAME, estimate))) + 
      geom_errorbar(aes(xmin = estimate - moe, 
                        xmax = estimate + moe),
                    width = 0.5, 
                    linewidth = 0.5) +
      geom_point(color = "darkblue", size = 2) + 
      scale_x_continuous(labels = scales::label_dollar()) + 
      scale_y_discrete(labels = function(x) stringr::str_remove(x, " County, Utah")) + 
      labs(title = "Median household income, 2019-2023 ACS",
           subtitle = "Counties in Utah",
           caption = "Data acquired with R and tidycensus. Error bars represent margin of error around estimates.",
           x = "ACS estimate",
           y = "") + 
      theme_minimal(base_size = 12)
  • Example 4: Multiple Races Percentages for San Diego County

    san_diego_race_wide <- get_acs(
      geography = "tract",
      variables = c(
        Hispanic = "DP05_0073P",
        White = "DP05_0079P",
        Black = "DP05_0080P",
        Asian = "DP05_0082P"
      ),
      state = "CA",
      county = "San Diego",
      geometry = TRUE,
      output = "wide",
      year = 2022
    )
    faceted_choro <- ggplot(san_diego_race, aes(fill = estimate)) + 
      geom_sf(color = NA) + 
      theme_void() + 
      scale_fill_viridis_c(option = "rocket") + 
      facet_wrap(~variable) + 
      labs(title = "Race / ethnicity by Census tract",
           subtitle = "San Diego County, California",
           fill = "ACS estimate (%)",
           caption = "2018-2022 ACS | tidycensus R package")
    • Allows for comparison, but for groups with less variation as compared to other groups since scaled according to all groups
      • You’d want to make a separate map for the Black population in order to compare variation between counties.

PUMS

  • Misc

    • ACS Public Use Microdata Series (PUMS)
    • Also see
    • 1-Year covers 1% of the US population
    • 5-Year covers 5% of the US population
    • Usage
      • Allows you to do within-household analysis
        • e.g. Given a certain demographic, how many kids does the average household have?
      • Allows you to get single Age statistics (the regular ACS has binned ages)
    • PUMS variables located in dataset, pums_variables
      • survey: e.g. acs1
      • year, data_type, value_min, value_max
      • var_code: Value for variables argument
      • var_label: Basic description
      • val_label: Describes the measured value
        • e.g. state name, dollar amounts, houses on less than an acre
      • level: e.g. household-level values or person-level values
      • recode: logical which says where you can use recode = TRUE in get_pums which will add columns with the categories (original columns have numerical codes)
    • Filter data at the API to reduce size of downloaded data
      • Example:

        la_pums_filtered <- get_pums(
          variables = c("SEX", "AGEP", "HHT"),
          state = "LA",
          survey = "acs5",
          variables_filter = list(
            SEX = 2,
            AGEP = 30:49
          ),
          year = 2023
        )
    • Census Bureau recommends using Succesive Difference Replication to calculate standard errors and provides 80 replicate weights (i.e. 80 extra columns) to do this. (See example)
  • Example 1: Louisiana

    la_pums <- get_pums(
      variables = c("SEX", "AGEP", "HHT"),
      state = "LA",
      survey = "acs1",
      year = 2023
    )
    
    head(la_pums)
    #> # A tibble: 45,026 × 8
    #>    SERIALNO      SPORDER  WGTP PWGTP  AGEP STATE HHT   SEX  
    #>    <chr>           <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
    #>  1 2023GQ0000080       1     0    22    35 22    b     1    
    #>  2 2023GQ0000129       1     0   138    58 22    b     2    
    #>  3 2023GQ0000171       1     0    37    19 22    b     1    
    #>  4 2023GQ0000183       1     0    97    22 22    b     2    
    #>  5 2023GQ0000200       1     0    62    19 22    b     2 
    • SERIALNO is the serial number of the household
      • GQ means “Group Quarters” so no family dwellings (e.g. senior care facility, dorm room). They get a WGPT of 0 since they aren’t consider to be a household.
    • SPORDER is the “order” of a household where 1 is the “head” of the household
      • Combined with SERIALNO uniquely identifies a person
    • WGTP is the household weight, i.e. how many households a household represents
    • PWGTP is the person weight, i.e. how many persons a record represents
      • e.g. 75 means this person represents approximately 75 with similar demographic characteristics.
    • AGEP is the age of the person
    • HHT is the household type; Need to check out the variable information to see what the numerical codes represent. Setting recode = TRUE will add columns with the categories based on the information in pums_variables. (Will also add a SEX column with the sex categories)
  • Example 2: Get the number of 41 year olds in Louisiana

    # Number of 41 year-olds is 63826
    la_pums |> 
      filter(AGEP == 41) |> 
      summarize(pop_41 = sum(PWGTP))
    
    # total population is 4573749
    sum(la_pums$PWGTP)
    
    # compare with regular ACS... matches
    get_acs("state", 
            "B01003_001", 
            state = "LA", 
            survey = "acs1", 
            year = 2023)
    #> # A tibble: 1 × 5
    #>   GEOID NAME      variable   estimate   moe
    #>   <chr> <chr>     <chr>         <dbl> <dbl>
    #> 1 22    Louisiana B01003_001  4573749    NA
    • Summing the weights gives you total populations which shows you that PUMS data are weighted to align with regular ACS
  • Example 3: Calculate MOE for Age Population Using Replicate Weights

    la_pums_replicate <- get_pums(
      variables = c("AGEP", "PUMA"),
      state = "LA",
      survey = "acs1",
      year = 2023,
      rep_weights = "person" 
    )
    
    # converts df to a survey design object
    la_survey <- tidycensus::to_survey(
      la_pums_replicate,
      type = "person"
    )
    
    library(srvyr)
    la_survey %>%
      filter(AGEP == 41) %>%
      survey_count() %>%
      mutate(n_moe = n_se * 1.645) # 90% CI
    
    #> # A tibble: 1 × 3
    #>       n  n_se n_moe
    #>   <dbl> <dbl> <dbl>
    #> 1 63826 3662. 6024.
  • Example 4: Median Age Estimates by PUMA with MOE

    library(srvyr)
    
    la_survey %>%
      group_by(PUMA) %>%
      summarize(median_age = survey_median(AGEP)) %>%
      mutate(median_age_moe = median_age_se * 1.645) # 90% CI
    
    #> # A tibble: 35 × 4
    #>    PUMA  median_age median_age_se median_age_moe
    #>    <chr>      <dbl>         <dbl>          <dbl>
    #>  1 00101         39         1.76           2.89 
    #>  2 00102         41         1.26           2.07 
    #>  3 00200         40         0.502          0.826
    #>  4 00300         40         0.754          1.24 
    #>  5 00400         36         0.502          0.826
    #>  6 00500         42         1.00           1.65 
    #>  7 00600         36         1.00           1.65 
    #>  8 00700         36         0.502          0.826
    #>  9 00801         40         1.00           1.65 
    
    # check calc with 1 yr ACS
    la_age_puma <- get_acs(
      geography = "puma",
      variables = "B01002_001",
      state = "LA",
      year = 2023,
      survey = "acs1"
    )
    • la_survey is a survey design object from the previous example
    • {srvyr} automatically calculates the standard errors
    • Calculated median ages via PUMS are not identical to regular ACS estimates, but are very close. Only use PUMS calculated statistics (larger MOEs) if they aren’t available in the regular ACS surveys.

Dicennial US Census

Misc

  • A complete count — not based on samples like the ACS
  • Smallest geography is the census block
  • Applies differential privacy to preserve respondent confidentiality
    • Adds noise to data. Greater effect at lower levels (i.e. block level)
    • The exception is that is no differetial privacy for household-level data.

PL94-171

  • Population data which the government needs for redistricting

  • sumfile = “pl”

  • State Populations

    pop20 <- 
      get_decennial(
        geography = "state",
        variables = "P1_001N",
        year = 2020
      )
    • For 2020, default is sumfile = “pl”

DHC

  • Age, Sex, Race, Ethnicity, and Housing Tenure (most popular dataset)

  • sumfile = “dhc”

  • County

    tx_population <- 
      get_decennial(
        geography = "county",
        variables = "P1_001N",
        state = "TX",
        sumfile = "dhc",
        year = 2020
      )
  • Census Block (analogous to a city block)

    matagorda_blocks <- 
      get_decennial(
        geography = "block",
        variables = "P1_001N",
        state = "TX",
        county = "Matagorda",
        sumfile = "dhc",
        year = 2020
      )

Demographic Profile

  • Pretabulated percentages from dhc

  • sumfile = “dp”

    • Tabulations for 118th Congress and Island Areas (i.e. Congressional Districts)
      • sumfile = “cd118”
  • C suffix variables are counts while P suffix variables are percentages

    • 0.4 is 0.4% not 40%
  • Example: Same-sex married and partnered in California by County

    ca_samesex <- 
      get_decennial(
        geography = "county",
        state = "CA",
        variables = c(married = "DP1_0116P",
                      partnered = "DP1_0118P"),
        year = 2020,
        sumfile = "dp",
        output = "wide"
      )

Detailed DHC-A

  • Detailed demographic data; Thousands of racial and ethnic groups; Tabulation by sex and age.

  • Different groups are in different tables, so specific groups can be hard to locate.

  • Adaptive design means the demographic group (i.e. variable) will only be available in certain areas. For privacy, data gets supressed when the area has low population.

    • There’s typically considerable sparsity especially when going down census tract
  • Args

    • sumfile = “ddhca”
    • pop_group - Population group code (See get_pop_groups below)
      • “all” for all groups
      • pop_group_label = TRUE - Adds group labels
  • get_pop_groups(2020, "ddhca") - Gets group codes for ethnic groups

    • For various groups there could be at least two variables (e..g Somaili, Somali and any combination)
    • For time series analysis, analagous groups to 2020’s for 2000 is SF2/SF4 and for 2010 is SF2. (SF stands for Summary File)
  • check_ddhca_groups - Checks which variables are available for a specific group

    • Example: Somali

      check_ddhca_groups(
        geography = "county", 
        pop_group = "1325", 
        state = "MN", 
        county = "Hennepin"
      )
  • Example: Minnesota group populations

    load_variables(2020, "ddhca") %>% 
      View()
    mn_population_groups <- 
      get_decennial(
        geography = "state",
        variables = "T01001_001N", # total population
        state = "MN",
        year = 2020,
        sumfile = "ddhca",
        pop_group = "all", # for all groups
        pop_group_label = TRUE
      )
    • Includes aggregate categories like European Alone, Other White Alone, etc., so you can’t just aggregate the value column to get the total population in Minnesota.
      • So, in order to calculate ethnic group ratios of the total state or county, etc. population, you need to get those state/county totals from other tables (e.g. PL94-171)
  • Use dot density and not chloropleths to visualize these sparse datasets

    • Example: Somali populations by census tract in Minneapolis

      hennepin_somali <- 
        get_decennial(
          geography = "tract",
          variables = "T01001_001N", # total population
          state = "MN",
          county = "Hennepin",
          year = 2020,
          sumfile = "ddhca",
          pop_group = "1325", # somali
          pop_group_label = TRUE,
          geometry = TRUE
        )
      
      somali_dots <- 
        as_dot_density(
          hennepin_somali,
          value = "value", # column name which is by default, "value"
          values_per_dot = 25
        )
      
      mapview(somali_dots, 
              cex = 0.01, 
              layer.name = "Somali population<br>1 dot = 25 people",
              col.regions = "navy", 
              color = "navy")
      • values_per_dot = 25 says make each dot worth 25 units (e.g. people or housing units)

Time Series Analysis

  • {tidycensus} only has 2010 and 2020 censuses

  • Issue: county names and boundaries change over time (e.g. Alaska redraws a lot)

    • Census gives a different GeoID to counties that get renamed even though they’re the same county.
    • NA values showing up after you calculate how the value changes over time is a good indication of this problem. Check for NAs: filter(county_change, is.na(value10))
  • Example: Join 2010 and 2020 and Calculate Percent Change

    county_pop_10 <- 
      get_decennial(
        geography = "county",
        variables = "P001001", 
        year = 2010,
        sumfile = "sf1"
      )
    
    county_pop_10_clean <- 
      county_pop_10 %>%
        select(GEOID, value10 = value) 
    
    county_pop_20 <- 
      get_decennial(
        geography = "county",
        variables = "P1_001N",
        year = 2020,
        sumfile = "dhc"
      ) %>%
        select(GEOID, NAME, value20 = value)
    
    county_joined <- 
      county_pop_20 %>%
        left_join(county_pop_10_clean, by = "GEOID") 
    
    county_joined
    
    county_change <- 
      county_joined %>%
        mutate( 
          total_change = value20 - value10, 
          percent_change = 100 * (total_change / value10) 
        ) 
  • Example: Age distribution over time in Michigan

    • Code available in the github repo or R/Workshops/tidycensus-umich-workshop-2024-main/census-2020/bonus-chart.R
    • Distribution shape remains pretty much the same, but decreasing for most age cohorts, i.e. people are leaving the state across most age groups.
      • e.g. The large hump representing the group of people in there mid-40s in 2000 steadily decreases over time.

tidycensus

Misc

  • Get an API key

    • Request a key, then activate the key from the link in your email.(https://api.census.gov/data/key_signup.html)
      • Required for hitting the census API over 500 times per day which isn’t as hard as you’d think.
    • Set as an environment variable: census_api_key("<api key>", install = TRUE)
      • Or add this line to .Renviron file, CENSUS_API_KEY=‘<api key’
  • Workaround for Trump’s Executive Order that halts access the Census Bureau’s shapefiles

    • Currently {tigris} pulls from the Census Bureau’s ftp server, but that’s down. The API still works though

    • {arcgislayers} - R interface to the ArcGIS REST API.

    • TIGERweb Rest Services

    • Example: Census Tracts for Cook County, IL and the 2023 ACS

      Go TIGERweb and find survey, year, and geography

      • Walker says he likes TIGERweb Generalized REST Services because of the tight boundaries around the coastlines
      Choose the Appropriate Resolution

      • 500K (1:500,000 scale) – Higher resolution, more detail. Suitable for regional and local analysis.
        • 1 inch on a 500K map = 500,000 inches in the real world (~7.89 miles)
      • 5M (1:5,000,000 scale) – Medium resolution, moderate generalization. Best for state-level mapping.
      • 20M (1:20,000,000 scale) – Lower resolution, highly generalized. Good for national or large-scale mapping.
      Copy URL

      • After clicking the resolution link, copy the URL of the webpage it takes you to.
      Pull shapefiles
      library(arcgislayers)
      cook_geo <- 
        arc_read("https://tigerweb.geo.census.gov/arcgis/rest/services/Generalized_ACS2023/Tracts_Blocks/MapServer/4", 
                 where = "COUNTY = '031' AND STATE = '17'")
      head(cook_geo)
      #> Simple feature collection with 1331 features and 17 fields
      #> Geometry type: POLYGON
      #> Dimension:     XY
      #> Bounding box:  xmin: -9825464 ymin: 5081873 xmax: -9743145 ymax: 5184116
      #> Projected CRS: WGS 84 / Pseudo-Mercator
      #> First 10 features:
      #>    MTFCC             OID       GEOID STATE COUNTY  TRACT BASENAME
      #> 1  G5020 207903704492010 17031030101    17    031 030101   301.01
      #> 2  G5020 207903704492021 17031030701    17    031 030701   307.01
      #> 3  G5020 207903704492051 17031070103    17    031 070103   701.03
      #> 4  G5020  20790197959947 17031807100    17    031 807100     8071
      #> 5  G5020  20790197959946 17031807200    17    031 807200     8072
      • Paste URL and enter COUNTY and STATE FIPS code components

      • If you look in the tidycensus data, the state and county FIPS components will be the first five digits of the GEOID column

      • For a group of geographies (e.g. counties):

        where = "STATE = '36' AND COUNTY IN ('005', '047', '061', '081', '085')"
      Join with tidycensus Data
      cook_education_geo <- cook_geo %>%
        select(GEOID) %>%
        left_join(cook_education, by = "GEOID")
      • Only the GEOID column is kept, but since this is a sf class dataset, I think the geometry comes along automatically.

Variables

  • Search Variables

    • Code: vars <- load_variables(2022, "acs5")

      • Subtables (e.g. ACS)

        • Beware: variables in the Data Profile and Subject Tables can change names over time
        • "acs5" for the Detailed Tables;
        • "acs5/profile" for the Data Profile; aggregated statistics for acs5
          • p (suffix): Percentage with appropriate denominator
          • e.g. Educational Attainment: Percentage of the population 25yrs or older with at least a bachelor’s degree.
        • "acs5/subject" for the Subject Tables
        • "acs5/cprofile" for the Comparison Profile
    • Contains:

      • name: variable id, value for variables argument
      • label: type of estimate (e.g total) and fine details (e.g. males 30 to 34 yrs)
      • concept: general category information (e.g. sex by age)
      • geography: smallest geography e.g. block group
        • Geographies only shown for 5-year ACS
    • Top variables in a table (e.g. _001, 002) are typically the overall group estimates

      • Example: something_002 with concept “Sex by Age” is the total of all Males regardless of age group. something_001 might just be the Total which would be all males and females regardless of age group
    • Prefixes (Variables can have combinations of prefixes)

      • P: i.e. Person; Data available at the census block and larger
      • CT: Data available at the census track and larger
      • H: Data available at the Housing Unit level
        • I think housing unit is an alternatve unit. So instead of the unit being a person, which I assume is the typical unit, it’s a housing unit (~family).

        • Not affected by Differential Privacy (i.e. no noise added; true value)

        • Example: Total Deleware housing units at census block level

          dp_households <- 
                get_decennial(
                      geography = "block",
                      variables = "H1_001N",
                      state = "DE",
                      sumfile = "dhc",
                      year = 2020
                )
    • Example: DHC data in census for 2020

      vars <- load_variables(2020, "dhc")
      
      View(vars)
      • View(vars) allows you to filter by column (click Filter button) or the whole table (search box)
        • Workflow: Use search box to filter for general concept (e.g. income). Then use column filter to get specific
  • summary_var - Argument for supplying an additional variable that you need to calculate some kind of summary statistic

    • Example: Race Percentage per Congressional District

      
      race_vars <- c(
        Hispanic = "P5_010N", # all races identified as hispanic
        White = "P5_003N", # white not hispanic
        Black = "P5_004N", # black not hispanic
        Native = "P5_005N", # native american not hispanic
        Asian = "P5_006N", # asian not hispanic
        HIPI = "P5_007N" # hawaiian, islander not hispanic
      )
      
      cd_race <- 
        get_decennial(
          geography = "congressional district",
          variables = race_vars,
          summary_var = "P5_001N", # total population for county
          year = 2020,
          sumfile = "cd118"
      )
      
      cd_race_percent <- 
        cd_race %>%
          mutate(percent = 100 * (value / summary_value)) %>% 
          select(NAME, variable, percent)

Migrations Flows

  • Example:

    fulton_inflow <- 
      get_flows(
        geography = "county",
        state = "GA",
        county = "Fulton",
        geometry = TRUE,
        year = 2020
      ) %>%
      filter(variable == "MOVEDIN") %>%
      na.omit()
    
    fulton_top_origins <- 
      fulton_inflow %>%
        slice_max(estimate, 
                  n = 30) 
    
    library(rdeck)
    
    Sys.getenv("MAPBOX_ACCESS_TOKEN")
    
    fulton_top_origins$centroid1 <- 
      st_transform(fulton_top_origins$centroid1, 4326)
    fulton_top_origins$centroid2 <- 
      st_transform(fulton_top_origins$centroid2, 4326)
    
    flow_map <- 
      rdeck(
        map_style = mapbox_light(), 
        initial_view_state = view_state(center = c(-98.422, 38.606), 
                                        zoom = 3, 
                                        pitch = 45)
      ) %>%
      add_arc_layer(
        get_source_position = centroid2,
        get_target_position = centroid1,
        data = as_tibble(fulton_top_origins),
        get_source_color = "#274f8f",
        get_target_color = "#274f8f",
        get_height = 1,
        get_width = scale_linear(estimate, range = 1:5),
        great_circle = TRUE
      )
    • Width of lines is scaled to counts

Mapping

Misc

  • Use geometry = TRUE for any of the get_* {tidycensus} functions, and it’ll join the shapefile with the census data. Returns a SF (Simple Features) dataframe for mapping.

  • If you only want the shape files without the demographic data, see {tigris}

  • For examples with {tmap}, see Chapter 6.3 of Analyzing US Census Data

  • {mapview} along with some other packages gives you some tools for comparing maps (useful for eda or exploratory reports, etc.) (m1 and m2 are mapview objects)

    • m1 + m2 - Creates layers that allows you click the layers button and cycle through multiple maps. So I assume you could compare more than two maps here.
    • m1 | m2 - Creates swipe map (need {leaflet.extras2}). There will be a vertical slider that you can interactively slide horizontally to gradually expose one map or the other.
    • sync(m1, m2) - Uses {leafsync} to create side by side maps. Zooming and cursor movement are synced on both maps.

Preprocessing

  • Remove water from geographies


    nyc_income_tiger <- 
      get_acs(
        geography = "tract",
        variables = "B19013_001",
        state = "NY",
        county = c("New York", "Kings", "Queens",
                   "Bronx", "Richmond"),
        year = 2022,
        cb = FALSE,
        geometry = TRUE
    )
    
    library(tigris)
    library(sf)
    sf_use_s2(FALSE)
    
    nyc_erase <- 
      erase_water(
        nyc_income_tiger,
        area_threshold = 0.5,
        year = 2022
    )
    
    mapview(nyc_erase, zcol = "estimate")
    • The left figure is before and the right figure is after water removal
      • At the center-bottom, you can see how a sharp point is exposed where it was squared off before
      • At the top-left, the point along the border which are piers/docks are now exposed.
      • At the upper-middle, extraneous boundary lines have been removed and islands in the waterway are more clearly visible.
    • Works only with regular tigris shapefiles from the US census bureau — so not OpenStreetMaps, etc. For other shapefiles, you’d need to do the manual overlay, see Chapter 7.1 in Analyzing US Census Data for details.
    • Can take a couple minutes to run
    • cb = FALSE says get the regular tigris line files which avoid sliver polygons which are caused by slight misalignment of layers (?)
    • area_threshold = 0.5 says that water areas below the 50th percentile in terms of size are removed. Probably a value you’ll have to play around with.

Choropleths

  • Best for continuous data like rates and percentages, but you can use for discrete variables

    • You can create a discrete color palette with the at argument in the mapview function.
      • Example

        # check min and max of your data to select range of bins
        min(iowa_over_65, 
            na.rm = TRUE) # 0
        max(iowa_over_65, 
            na.rm = TRUE) # 38.4
        
        m1 <- 
          mapview(
            iowa_over_65, 
            zcol = "value",
            layer.name = "% age 65 and up<br>Census tracts in Iowa",
            col.regions = inferno(100, direction = -1),
            at = c(0, 10, 20, 30, 40)
          )
        • This will result in a discrete palette with bins of 0-10, 10-20, etc. Looks like an overlap, so I’m sure which bin contains the endpoints.
  • Example: Over 65 in Iowa by census tract

    library(mapviw); library(viridisLite)
    
    iowa_over_65 <- 
      get_decennial(
        geography = "tract",
        variables = "DP1_0024P",
        state = "IA",
        geometry = TRUE,
        sumfile = "dp",
        year = 2020
      )
    m1 <- 
      mapview(
        iowa_over_65, zcol = "value",
        layer.name = "% age 65 and up<br>Census tracts in Iowa",
        col.regions = inferno(100, 
                              direction = -1))
    • {mapview} is interactive and great for exploration of data
  • Export as an HTML file

    htmlwidgets::saveWidget(m1@map, "iowa_over_65.html")
    • Can embed it elsewhere (html report or website) by adding it as an asset
  • In {ggplot}

    texas_income_sf <- 
      get_acs(
        geography = "county",
        variables = "B19013_001",
        state = "TX",
        year = 2022,
        geometry = TRUE
    )
    
    plot(texas_income_sf['estimate'])

Circle Maps

  • “Graduated Symbol” maps are better for count data. Even though using a choropleth is not as bad at the census tract level since all tracts have around 4000 people, the sizes of the tracts can be substantially different which can influence the interpretation. Using circles or bubbles, etc. focuses the user on the size of the symbol and less on the size of the geography polygons.

  • Example: Hispanic Counts in San Diego County at the Census Tract Level

    
    san_diego_race_counts <- get_acs(
      geography = "tract",
      variables = c(
        Hispanic = "DP05_0073",
        White = "DP05_0079",
        Black = "DP05_0080",
        Asian = "DP05_0082"
      ),
      state = "CA",
      county = "San Diego",
      geometry = TRUE,
      year = 2022
    )
    
    san_diego_hispanic <- 
      filter(
        san_diego_race_counts, 
        variable == "Hispanic"
      )
    
    centroids <- st_centroid(san_diego_hispanic)
    
    grad_symbol <- 
      ggplot() + 
      geom_sf(
        data = san_diego_hispanic, 
        color = "black", 
        fill = "lightgrey") + 
      geom_sf(
        data = centroids, 
        aes(size = estimate),
        alpha = 0.7, 
        color = "navy") + 
      theme_void() + 
      labs(
        title = "Hispanic population by Census tract",
        subtitle = "2018-2022 ACS, San Diego County, California",
        size = "ACS estimate") + 
      scale_size_area(max_size = 6)
    • st_centroid finds the center point of geography polygons which will be the location of the symbols. If you look at the geometry column it will say POINT, which only has a latitude and longitude, instead of POLYGON, which as multiple coordiates.

    • scale_size_area scales the size of the circle according to the count value.

      • max_size is the maximum diameter of the circle which you’ll want to adjust to be large enough so that you can differentiate the circles but small enough so you have the least amount of overlap between circles in neighboring geographies (although this is probably inevitable).

Dot Density

  • Useful to show heterogeneity and mixing between groups versus plotting group facet maps.

  • Example: Population by Race in San Diego County

    san_diego_race_dots <- 
      as_dot_density(
        san_diego_race_counts, # see circle maps example for code
        value = "estimate", # group population
        values_per_dot = 200,
        group = "variable" # races
      )
    
    dot_density_map <- 
      ggplot() + 
      annotation_map_tile(type = "cartolight", 
                          zoom = 9) + 
      geom_sf(
        data = san_diego_race_dots, 
        aes(color = variable), 
        size = 0.01) + 
      scale_color_brewer(palette = "Set1") + 
      guides(color = guide_legend(override.aes = list(size = 3))) + 
      theme_void() + 
      labs(
        color = "Race / ethnicity",
        caption = "2018-2022 ACS | 1 dot = approximately 200 people")
    • as_dot_density scatters the dots randomly within a geography. values_per_dotsays each dot is 200 units (e.g. people or households). Without shuffling, ggplot will layer each group’s dots on top of each other.

    • annotation_map_tile from {ggspatial} applies a base map layer as a reference for the user. Base maps have land marks and popular features labeled in the geography and surrounding areas to help the user identify the area being shown.

IPUMS

  • IPUMS - A University of Minnesota project that’s supported by a consortium of other institutions and companies. Provides access to many types of census and survey data through its website and API.
    • Includes CPS, ACS, census microtata for 100+ countries, historical GIS shapefiles, time use surveys, and more.
    • Some or maybe all surveys require registering an account and agreeing to usage constraints.
    • Constraints
      • You will not redistribute the data without permission
      • You will use the data in these data files for statistical reporting and analysis only
      • You will not use the data to identify individuals
      • Cite the IPUMS data appropriately
        • For information on proper citation refer to citation and use. Publications and research reports making use of IPUMS NHIS should be added to our Bibliography.

Misc

  • Also see American Community Survey (ACS) >> PUMS
  • Packages
    • {ipumsr}
      • collection = * values for currently available datasets. If you want to check if this list has changed, use ipums_data_collections.
        • USA: “usa”
        • CPS: “cps”
        • International: “ipumsi”
        • IPUMS Time Use
          • ATUS: “atus”
          • AHTUS: “ahtus”
          • MTUS: “mtus”
        • IPUMS Health Surveys
          • NHIS: “nhis”
          • MEPS: “meps”
    • {{ipumspy}}
  • There are plenty of steps to get this data which can seem a little daunting if you haven’t dealt with this sort of process before. Some of the tasks:
    • You have to set up an account where they ask who you are, what you do, and type of research for each type of survey, but it’s just information collection and isn’t determinative on whether you gain access to the data or not.
      • Once you do it the first time, they’re nice enough to fill out the fields for you when you want to get data from another survey.
    • There can be oodles of variables to choose from.
    • Finding the exact design information needed to perform analyses.
    • Some of the interfaces are slightly different but the main surveys are very similar.
    • Need to get an API key, of course, to be able to use packages
      • For R, it should stored in your .Renviron file and under IPUMS_API_KEY
  • Available datasets through the API
  • Coded Value that represents “NIU” mean “Not in Universe”
    • Example: For the variable BMICALC, the Universe for survey years 2021 and 2023: Sample adults age 18+. Therefore NIU means the person isn’t an adult and therefore didn’t have their BMI calculated.

Design Information

  • You need to know information about the study design so that your variance estimates will be correct. This information will be inputs for the study design object if you’re using {survey} or {srvyr}.

    • What is the sampling unit id variable (e.g. primary sampling unit)
    • Is there stratification? If so, what is the stratification id variable?
    • Is there clustering? If so, what is the cluster id variable
      • And have the cluster ids been nested inside the stratification id variable?
    • Have the weights been included or do they need to be calculated?
  • tl;dr

    • For the NHIS at least, the easiest way to figure out the inputs for your study design object is to look in the Variance Estimation guide where lo and behold, they just give it to you!

      library(survey)
      library(srvyr)
      data <- 
        as_survey_design(data, 
                         id = PSU, 
                         weight = PERWEIGHT, 
                         strata = STRATA, 
                         nest = TRUE)
      summarize(data, 
                var1_mean = survey_mean(var1, 
                                        na.rm = TRUE))
  • Other methods to find these variables/parameters (in case other surveys aren’t so nice)

    • Luckily the weights, stratification, and clustering variable should be automatically included in the dataset (See Getting the Data>> Example: NHIS via Website >> 6. Review and ID Variables)

      • Our variables are SAMPWEIGHT, STRATA, and PSU. If you check out the description of STRATA and PSU, it says they’re what we require to handle stratification and clustering in our analysis.
    • Explore User Guide

      • There’s a lot of interesting information on the User Guide page, but to answer the nesting question, we can go to the Sample Design user note.

      • After searching for the word “nest,” I find a paragraph that says, “The current sample design selects clusters of households and non-institutional group quarters (such as college dormitories) that are nested within primary sampling units (PSUs). A PSU can consist of a county, a small group of adjacent counties, or a metropolitan statistical area.”

      • After searching for the word “strata,” I find this paragraph that says, “[After creating PSUs] The second step involves, for some states, dividing the PSUs into two strata defined by population density, generally urban and rural counties. All other states contain only one stratum. The third step involves defining clusters of approximately 2,500 addresses within each stratum, where each address cluster is located entirely within one of the originally defined 1,689 PSUs. Fourth, a specific number of address clusters in each stratum is systematically selected for the NHIS sample.”

        • So households clusters are nested within population-density-defined Strata which are nested within clusters/PSU (groups of counties).
      • The documentation for nest = TRUE in a study design object is “relabel cluster ids to enforce nesting within strata,” and the definition for the ids argument is “variables specifying cluster ids from largest level to smallest level”

        • It’s confusing to me, because in this example, PSUs are used as cluster ids. So, the study design object is specified to say (from my understanding) that PSUs are nested inside the Strata which is opposite of what is happening.
        • I guess the key is from “largest level to smallest level” and since we have multiple levels of clustering, the PSU variable makes sense. It is what it is, but I think it could be worded better.

Getting the Data

  • The workflow is essentially: Select variables \(\rightarrow\) Select Years \(\rightarrow\) Select Data Format \(\rightarrow\) Download

  • Example 1: NHIS via Website

    I’m using this data to follow along with a tutorial. See Surveys, Analysis >> {srvyr}.

    Choose a Survey

    Homepage

    • There are two ways to get to the survey variables interface
    Search for Variables

    • The navbar shows various methods for selecting variables. I know what variables I’m after, so I’m just using the A-Z index. As the figure shows, there are many variables to choose from which can be overwhelming.

    • Actually Search would’ve been the best option as I could type the variables directly in and add them.

    Browsing by Category

    Read Variable Details

    • Clicking on the name of the variable sends you to its details page
    • Honestly everything here is potentially useful for your analysis.
      • Codes: Gives you the coded variable values
        • e.g. AGE 997 values mean the person refused to give their age
      • Description: Gives information such as how a variable is calculated or how it’s description has been changed over the years (e.g. more categories added)
        • e.g. BMICALC: “[Weight in pounds/(Height in inches, squared)] multiplied by 703 and rounded to one digit past the implied decimal point.”
      • Comparabilty: Tells you if you’re comparing apples to apples if you compare a variable across different years of the survey.
        • e.g REGION: Comparable prior to 2019 when there was a redesign
      • Universe: Tells you what the demographics of the people that are sampled for this variable
        • e.g. BMICALC:
          • 2020 2022: Sample children ages 10-17 and sample adults age 18+.
          • 2021; 2023: Sample adults age 18+
      • Availability: Says which survey years contain this variable
      • Survey Text: Gives the exact text of the survey question
        • e.g. SEX has experienced many changes in how the question was asked and the responses available.
      • Weights: Gives the name of the weights variable depending the on the survey year and links to that weights variables details page.
        • e.g. 1963-2018 : PERWEIGHT; 2019-2023 : SAMPWEIGHT
      • NHIS Sources: Says which version of the survey was used to obtain the variable
        • e.g. SEX
          • 2023 - In both sample adult and child questionnaires
          • 2018 - Person (core questionnaire) ## 4. Select Variables and Years
    Select Variables

    • Here I’m selecting AGE by clicking on the green plus. Also notice that SERIAL has been preselected. So the your dataset comes with some default variables
    • Proceed to select all the variables that you want.
    Select Years

    • Once you selected all your variables, you click the Select Samples button.
      • Or if you check your Cart first, it’ll tell you to Select Samples.
    • Then, select the years that you want.
    • Make sure you have checked each of your selected variable’s availability prior
    Review Variables

    • After selecting your samples (i.e. years) and you submit them, it takes you to your Cart where you look over your variables.
    • Then click Create Data Extract
    Preselected Variables
    • Currently it seems like PSU, STRATA, and maybe SERIAL are the only id variables I’m interested in.
    • A closer look at the ID variables:
      • PSU - Primary Sampling Unit that represents the impact of the sample design clustering on the estimates of variance and standard errors.
        • Researchers need to use PSU in conjunction with STRATA to account for stratification and clustering when computing variance estimates with IPUMS NHIS data.
        • 3-digit numeric variable
      • STRATA - Represents the impact of the sample design stratification on the estimates of variance and standard errors
        • Generally (not completely) comparable for 1985 forward but may be affected by changes in the 2019. Says they’re working on a way to make it comparable prior to 1985.
          • Not exactly sure what “generally” actually means, but it also says, after 2019, “there is no longer a sampling weight to produce household population estimates.”
          • Might have to do with them switching from sampling everyone in the household to sampling 1 adult and 1 child.
        • 5-digit numeric variable
        • 000 indicates not in Universe
        • Universe:
          • 1963-2018: All households.
          • 2019-2023: Sample adults age 18+ and sample children age 0-17.
      • SERIAL - A unique id to each household in a given survey year. The combination of YEAR and SERIAL provides a unique identifier for every household in the IPUMS NHIS database.
      • NHISHID - A unique id to each household in a given survey year. Used to link IPUMS NHIS data with additional household-level data elements, from the NHIS source public use data files, that are not currently available in IPUMS.
        • So not currently a useful variable
        • If and when the public use data files become available, there is a guide for using these variables to merge data in the User Guide. The link to the User Guide is in left side panel of the NHIS homepate (See Design Information for a pic)
      • NHISPID - Same as NHISHID but for persons
        • So not currently a useful variable (See NHISHID)
      • PERNUM - Either a 1 or a 2. Variable that numbers all persons within each family (prior to 2019) or within each household consecutively (starting with “1”) in the order in which they appear in the original NHIS data.
        • A sample adult and sample child from each household are interviewed instead of collecting detailed information on all members of the household, so there is a maximum of two people per household as reported in PERNUM beginning in 2019.
        • Since a 1 or 2 isn’t consistent with an adult or a child, I’m not sure how useful this is.
      • HHX - For sample adults and sample children, HHX reports the person’s household number on the original NHIS data. HHX is unique within sample but only unique across samples when combined with information on year.
        • HHX is unique within sample.
        • Beginning in 2019, HHX leads with the character “H” followed by six numeric digits. Prior to 2019, HHX consisted only of six numeric digits.
        • In the NHISPID, Comparability section, it says that HHX is a component of the 16-digit NHISPID value. So this variable is used for linking person-level data to household-level information.
        • Since NHISPID isn’t currently useful, neither is this one.
    Decide on File Format and Create Extract

    • The main thing here is the file format. The default is “Fixed-width text (.dat)” data format, whatever that is. I changed it to .csv.
    • You can also change it to a Hierarchical structure. Not sure what that is about though (maybe JSON?).
    Download Data and Codebook

    • Click to download the data
    • There’s a Basic and DDI — I’d get them both.
    • “Basic” is just a text file, and even though it has a .cbk extension, notebook++ read it just fine
      • Some basic information from variable details page, e.g. List of Variables, Availability, Codes, etc.
      • The .xml DDI file contains more information. Almost as much as the Variable details webpage. See Example 2 >> Explore DDI
    • “DDI” stands for the Data Documentation Initiative (DDI) which is a standard for documenting social science datasets. It’s an XML file so it’s machine readable (e.g. {ipumsr})
  • Example 2: NHIS via {ipumsr}

    API key should be in your .Renviron

    tl;dr See the Download and Read Data tab for the entire piped workflow

    Find the collection name of the NHIS survey
    ipums_data_collections()
    #> # A tibble: 14 × 4
    #>    collection_name     collection_type code_for_api api_support
    #>    <chr>               <chr>           <chr>        <lgl>      
    #>  1 IPUMS USA           microdata       usa          TRUE       
    #>  2 IPUMS CPS           microdata       cps          TRUE       
    #>  3 IPUMS International microdata       ipumsi       TRUE       
    #>  4 IPUMS NHGIS         aggregate data  nhgis        TRUE       
    #>  5 IPUMS IHGIS         aggregate data  ihgis        FALSE      
    #>  6 IPUMS ATUS          microdata       atus         TRUE       
    #>  7 IPUMS AHTUS         microdata       ahtus        TRUE       
    #>  8 IPUMS MTUS          microdata       mtus         TRUE       
    #>  9 IPUMS DHS           microdata       dhs          FALSE      
    #> 10 IPUMS PMA           microdata       pma          FALSE      
    #> 11 IPUMS MICS          microdata       mics         FALSE      
    #> 12 IPUMS NHIS          microdata       nhis         TRUE       
    #> 13 IPUMS MEPS          microdata       meps         TRUE       
    #> 14 IPUMS Higher Ed     microdata       highered     FALSE           
    • Looking for the NHIS survey, so nhis is what I want
    Get the sample (aka survey year) names for the NHIS survey
    tail(get_sample_info("nhis"))
    #> # A tibble: 6 × 2
    #>   name   description
    #>   <chr>  <chr>      
    #> 1 ih2018 2018 NHIS  
    #> 2 ih2019 2019 NHIS  
    #> 3 ih2020 2020 NHIS  
    #> 4 ih2021 2021 NHIS  
    #> 5 ih2022 2022 NHIS  
    #> 6 ih2023 2023 NHIS 
    • I’m just pulling the 2023 survey, so I need ih2023
    Define Extract
    def_extract_nhis <- 
      define_extract_micro(
        collection = "nhis",
        description = "2023 NHIS via API", # whatever you want
        samples = "ih2023",
        variables = c("SEX", 
                      "AGE", 
                      "REGION", 
                      "BMICALC"),
        data_format = "csv"
      )
    
    def_extract_nhis
    #> Unsubmitted IPUMS NHIS extract 
    #> Description: 2023 NHIS via API
    #> 
    #> Samples: (1 total) ih2023
    #> Variables: (4 total) SEX, AGE, REGION, BMICALC          
    • There’s currently not programmatical way to search for variables. So you’ll have to do that via the website. See Example 1.
    • All the inputs are subsettable from the extract definition, e.g. names(usa_extract_definition$variables)
    Submit Request
    sub_extract_nhis <- submit_extract(def_extract_nhis)
    #> Successfully submitted IPUMS NHIS extract number 2
    sub_extract_nhis$status
    #> [1] "queued"
    names(sub_extract_nhis$variables)
    #>  [1] "YEAR"       "SERIAL"     "STRATA"     "PSU"        "NHISHID"    "REGION"     "PERNUM"     "NHISPID"    "HHX"       
    #> [10] "SAMPWEIGHT" "ASTATFLG"   "CSTATFLG"   "AGE"        "SEX"        "BMICALC"  
    status_sub_nhis <- wait_for_extract(sub_extract_nhis)
    #> Checking extract status...
    #> IPUMS NHIS extract 2 is ready to download.        
    • The extract won’t be downloadable immediately. They’ll notify you via email or you can check via function.
      • The output can be subset if your script is noninteractive, e.g. status_sub_nhis$status which will say “completed” when ready. Other possible values: “queued”, “started”, “produced”, “canceled”, and “failed”.
      • There’s also is_extract_ready that returns a logical
    • Your variable list has been updated by the API to include all the preselected variables (ID variables and YEAR). See Example 1 for a description of the ID variables.
    • ASTATFLG and CSTATFLG indicate whether the person is an Adult or Child (1 each is sampled per household in this survey design)
    Download and Read Data
    path_ext_nhis <- 
      download_extract(
        extract = sub_extract_nhis,
        download_dir = "../../Data/ipums/",
        lower_vars = TRUE
      )
    dat_nhis <- read_ipums_micro(path_sub_nhis)
    • “For microdata extracts, only the file path to the downloaded .xml DDI file will be returned, as it is sufficient for reading the data provided in the associated .dat.gz data file.”

      • See Example 1 for an explanation of the DDI file.
    • The data file name is nhis_00002_csv.gz. Since you can download previously requested extracts (e.g. download_extract("nhis:1") for the first nhis requested extract), I’m thinking the 2 in the file name refers to the fact that this is my second NHIS request.

    Piped Workflow
    dat_nhis <- 
      define_extract_micro(
        collection = "nhis",
        description = "2023 NHIS via API", # whatever you want
        samples = "ih2023",
        variables = c("SEX", 
                      "AGE", 
                      "REGION", 
                      "BMICALC"),
        data_format = "csv"
      ) |> 
      submit_extract() |> 
      wait_for_extract() |> 
      download_extract(
        download_dir = "../../Data/ipums/",
        lower_vars = TRUE
      ) |>       
      read_ipums_micro()
    Search History
    • Your extract history can be checked with get_extract_history which retrieves prior extracts, their survey years, and their variables.

    • That history can be used to search for past extracts that contain certain variables:

      purrr::keep(extracts_past_nhis, ~ "BMICALC" %in% names(.x$variables))
    Read a Variable Description
    writeLines(ipums_var_desc(ddi_nhis, var = "BMICALC"))

    #> BMICALC reports the Body Mass Index, a measure of body fat based on height and weight, as calculated by IPUMS NHIS from the public use file data on height and weight. BMICALC was calculated using the following formula: [Weight in pounds/(Height in inches, squared)] multiplied by 703 and rounded to one digit past the implied decimal point. Individuals not asked their weight and height are coded “000” (for “Not in universe”) for BMICALC. Individuals whose weight or height was topcoded, bottomcoded, or, for 1997 forward, an outlying value that was suppressed for confidentiality reasons, receive a code of “996” for BMICALC.

    #> The website of the Centers for Disease Control and Prevention on BMI reports the following “standard weight status categories associated with BMI ranges for adults.” BMI below 18.5 is associated with “underweight” weight status; BMI 18.5 to 24.9 is associated with “normal” weight status; BMI 25.0 to 29.9 is associated with “overweight” weight status; BMI 30.0 and above is associated with “obese” weight status.

    #> The CDC website also notes, “the correlation between the BMI number and body fatness is fairly strong; however, the correlation varies by sex, race, and age.” For example, women tend to have more body fat than men at the same BMI; older people, on average, tend to have more body fat than young adults; and highly trained athletes may have a high BMI because of increased muscularity rather than increased body fatness. The CDC adds, “It is important to remember that BMI is only one factor related to risk for disease,” and that other predictors, such as waist circumference and levels of blood pressure and physical activity, are also important.

    Variable Information
    info_vars_nhis <- ipums_var_info(ddi_nhis, vars = c("AGE", "REGION"))
    
    dplyr::glimpse(info_vars_nhis)
    #> Rows: 2
    #> Columns: 10
    #> $ var_name   <chr> "REGION", "AGE"
    #> $ var_label  <chr> "Region of residence", "Age"
    #> $ var_desc   <chr> "REGION reports the region of the U.S. where the housing unit containing survey participants was located…
    #> $ val_labels <list> [<tbl_df[6 x 2]>], [<tbl_df[6 x 2]>]
    #> $ code_instr <chr> NA, "AGE is a 3-digit-numeric variable.\n\n085:\tTop code for 85 years or older (1963-1968 and 1997-forw…
    #> $ start      <dbl> 32, 73
    #> $ end        <dbl> 33, 75
    #> $ imp_decim  <dbl> 0, 0
    #> $ var_type   <chr> "integer", "numeric"
    #> $ rectypes   <lgl> NA, NA
    • Contains all variable information in a tibble
      • e.g. The variable description extracted using the helper function in the previous section can be extracted from the tibble.
    • The value labels seem to be the different categories and codes. There’s also a separate Codes column
    • Not sure what the rest of the columns are (e.g. start and end)