Lab 1: Census Data Quality for Policy Decisions

Evaluating Data Reliability for Algorithmic Decision-Making

Author

Liz Crouse

Published

January 30, 2026

Assignment Overview

Scenario

You are a data analyst for the Washington Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.

Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.

Learning Objectives

Apply dplyr functions to real census data for policy analysis
Evaluate data quality using margins of error
Connect technical analysis to algorithmic decision-making
Identify potential equity implications of data reliability issues
Create professional documentation for policy stakeholders

Submission Instructions

Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/

Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.

Part 1: Portfolio Integration

Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:

- text: Assignments
  menu:
    - href: labs/lab_1/your_file_name.qmd
      text: "Lab 1: Census Data Exploration"

If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text

Setup

# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)

# Set your Census API key
census_api_key("fadfb87c0c7417766780c9bf52b8df9adef61e21", overwrite = TRUE, install = TRUE)

[1] "fadfb87c0c7417766780c9bf52b8df9adef61e21"

Sys.getenv("CENSUS_API_KEY")

[1] "fadfb87c0c7417766780c9bf52b8df9adef61e21"

Choose your state for analysis - assign it to a variable called my_state

my_state <- "WA"

State Selection: I have chosen Washington for this analysis because having worked in housing policy in the state, I understand nuances in both demographics and in public service administration.

Part 2: County-Level Resource Assessment

2.1 Data Retrieval

Your Task: Use get_acs() to retrieve county-level data for your chosen state.

Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide

Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.

# Write your get_acs() code here
variables <- c(mhi = "B19013_001",
               pop = "B01003_001")

wa_data <- get_acs(geography = "county",
                   state = my_state,
                   variables = variables,
                   year = 2022,
                   survey = "acs5",
                   output = "wide")

# Clean the county names to remove state name and "County" 
# Hint: use mutate() with str_remove()

wa_data <- wa_data %>%
              mutate(county = str_remove(wa_data$NAME, " County, Washington"))
wa_data <- select(wa_data, "GEOID","county","mhiE","mhiM","popE")

# Display the first few rows

head(wa_data)

# A tibble: 6 × 5
  GEOID county   mhiE  mhiM   popE
  <chr> <chr>   <dbl> <dbl>  <dbl>
1 53001 Adams   63105  3509  20557
2 53003 Asotin  63724  6045  22370
3 53005 Benton  83778  1849 207560
4 53007 Chelan  71876  4147  79076
5 53009 Clallam 66108  2368  77333
6 53011 Clark   90115  1650 504091

2.2 Data Quality Assessment

Your Task: Calculate margin of error percentages and create reliability categories.

Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)

Hint: Use mutate() with case_when() for the categories.

# Calculate MOE percentage and reliability categories using mutate()

wa_data <- wa_data %>%
            mutate(wa_data, moe_percentage = (mhiM / mhiE) * 100)
wa_data <- wa_data %>%
            mutate(confidence =
                  case_when(moe_percentage < 5 ~ "high",
                            moe_percentage >= 5 & moe_percentage <= 10 ~ "moderate",
                            moe_percentage > 10 ~ "low"))

# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages

confidence_summary <- wa_data %>%
  count(confidence, sort = TRUE, name = "county_confidence")

confidence_summary <- confidence_summary %>%
  mutate(percentages = 
           case_when(confidence == "high" ~ "5% or less",
                     confidence == "moderate" ~ "5-10%",
                     confidence == "low" ~ "more than 10%"))

2.3 High Uncertainty Counties

Your Task: Identify the 5 counties with the highest MOE percentages.

Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()

Hint: Use arrange(), slice(), and select() functions.

# Create table of top 5 counties by MOE percentage

uncertain_counties <- wa_data %>%
  arrange(desc(moe_percentage))
uncertain_counties <- uncertain_counties %>%
  slice(1:5)
uncertain_counties <- uncertain_counties %>%
  select(county, mhiE, mhiM, moe_percentage, confidence)


# Format as table with kable() - include appropriate column names and caption
uncertain_table <- uncertain_counties %>%
  kable(col.names = c("County", "Median Household Income", "Margin of Error","MoE %","Confidence Category"), caption = "Five Washington Counties with Least Reliable Income Data")
view(uncertain_table)

Data Quality Commentary:

####The five counties with least reliable data in Washington state are Garfield, Pend Oreille, Wahkiakum, Asotin, and Ferry Counties. All of these counties are in the bottom 25% of counties for population size. All are rural and distant from the main population centers in Washington. They also have relatively low median incomes – all under $65,000 with four under $60,000. This means that algorithms that rely on income data may be using unreliable data from the poorest counties to bias decision making models, emphasizing the need for closer attention to counties with low data reliability and low median incomes.

Part 3: Neighborhood-Level Analysis

3.1 Focus Area Selection

Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.

Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.

# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties

selected_counties <- wa_data %>%
  filter(county == "Garfield" | county == "Jefferson" | county == "Pierce")

# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties <- selected_counties %>%
  select(county, mhiE, moe_percentage, confidence)
selected_counties <- selected_counties %>%
  rename(MHI = 'mhiE')
view(selected_counties)

Comment on the output: The selected counties increase in median household income as reliability becomes greater. Garfield County has the lowest median household income and lowest reliability; Pierce County has the highest of each.

3.2 Tract-Level Demographics

Your Task: Get demographic data for census tracts in your selected counties.

Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.

# Define your race/ethnicity variables with descriptive names

race_variables <- c(white = "B03002_003", black = "B03002_004", latino = "B03002_012", totalpop = "B03002_001")
                    

# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter

selected_data <- get_acs(geography = "tract",
                          state = my_state,
                          county = c("Jefferson","Pierce","Garfield"),
                          variables = race_variables,
                          year = 2022, 
                          survey = "acs5",
                          output = "wide",
                          geometry = FALSE)

# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations

selected_data <- selected_data %>%
                  mutate(percent_white = (whiteE / totalpopE) * 100,
                          percent_black = (blackE / totalpopE) * 100,
                          percent_latino = (latinoE / totalpopE) * 100)

# Add readable tract and county name columns using str_extract() or similar

selected_data <- selected_data %>%
  mutate(
    tract = str_extract(NAME, "(?<=Tract )[^;]+"),
   county = NAME %>%
      str_extract("(?<=;).*") %>%
      str_remove(" County.*$") %>%
      str_trim())
  
selected_data <- selected_data %>%
  select(-NAME)

selected_data <- selected_data %>% 
  relocate(tract, county)

3.3 Demographic Analysis

Your Task: Analyze the demographic patterns in your selected areas.

# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract

highest_latino <- selected_data %>%
  arrange(desc(percent_latino))
  
highest_latino <- highest_latino %>% slice(1)

# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group

selected_data <- selected_data %>%
  mutate(
    pct_white_tract  = (whiteE / totalpopE) * 100,
    pct_black_tract  = (blackE / totalpopE) * 100,
    pct_latino_tract = (latinoE / totalpopE) * 100)
demographics_by_county <- selected_data %>%
  group_by(county) %>%
  summarize(
    number_tracts       = n(),
    avg_white_per_tract = mean(pct_white_tract, na.rm = TRUE),
    avg_black_per_tract = mean(pct_black_tract, na.rm = TRUE),
    avg_latino_per_tract= mean(pct_latino_tract, na.rm = TRUE)
  )
demographics_by_county <- demographics_by_county %>%
  mutate(Confidence = selected_counties$confidence)


# Create a nicely formatted table of your results using kable()
demographics_table <- demographics_by_county %>%
  kable(col.names = c("County", "# Tracts", "Average White Alone per Tract","Average Black/African-American per Tract","Average Latino/Hispanic per Tract","Confidence"),caption = "Average Demographics by Census Tract in Garfield, Jefferson, and Pierce Counties, Washington")

view(demographics_table)

Part 4: Comprehensive Data Quality Evaluation

4.1 MOE Analysis for Demographic Variables

Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.

Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics

# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)

selected_data <- selected_data %>%
  mutate(white_moe = (whiteM/whiteE) * 100,
         black_moe = (blackM/blackE) * 100,
         latino_moe = (latinoM/latinoE) * 100)

# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement

selected_data <- selected_data %>%
  mutate(high_error = case_when(white_moe > 15 |
                                  black_moe > 15 |
                                  latino_moe > 15 ~ "unreliable",
                                  TRUE ~ "reliable"))
selected_data <- selected_data %>% 
  mutate(totalpop_moe = (totalpopM/totalpopE) * 100)
selected_data <- selected_data %>%
  mutate(totalpop_reliability = case_when(totalpop_moe > 15 ~ "unreliable",
                                  TRUE ~ "reliable"))
selected_data <- selected_data %>%
  mutate(white_only_high_error = case_when(white_moe > 15 ~ "unreliable",
                                           TRUE ~ "reliable"))
  


####This returns all tracts flagged as unreliable. In looking at the data, many of the census tracts, particularly in Jefferson and Garfield Counties, the populations are very small, resulting in an inflated percentage margin of error. This is especially pronounced in the Hispanic/Latino and Black/African American county. A large proportion of Washington's population is white; therefore, I chose to isolate the white MOE to see if that yielded more diverse results. I also calculate the total tract margin of error to assess the aggregate.

selected_data %>% 
  count(white_only_high_error)

# A tibble: 2 × 2
  white_only_high_error     n
  <chr>                 <int>
1 reliable                 80
2 unreliable              125

selected_data %>% 
  count(totalpop_reliability)

# A tibble: 2 × 2
  totalpop_reliability     n
  <chr>                <int>
1 reliable               150
2 unreliable              55

####When isolated to only white populations, 80 tracts have reliable data, or ~39%. While this is still relatively few tracts with reliable data, it shows that data pertaining to white populations is significantly more reliable than for minority populations.
         


# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages



confidence_summary <- wa_data %>%
  count(confidence, sort = TRUE, name = "county_confidence")

confidence_summary <- confidence_summary %>%
  mutate(percentages = 
           case_when(confidence == "high" ~ "5% or less",
                     confidence == "moderate" ~ "5-10%",
                     confidence == "low" ~ "more than 10%"))
           
           

# Create summary statistics showing how many tracts have data quality issues
selected_data %>% 
  count(high_error)

# A tibble: 1 × 2
  high_error     n
  <chr>      <int>
1 unreliable   205

selected_data %>% 
  count(white_only_high_error)

# A tibble: 2 × 2
  white_only_high_error     n
  <chr>                 <int>
1 reliable                 80
2 unreliable              125

selected_data %>% 
  count(totalpop_reliability)

# A tibble: 2 × 2
  totalpop_reliability     n
  <chr>                <int>
1 reliable               150
2 unreliable              55

selected_data %>%
  count(latino_moe > 15)

# A tibble: 2 × 2
  `latino_moe > 15`     n
  <lgl>             <int>
1 FALSE                 1
2 TRUE                204

selected_data %>%
  count(black_moe > 15)

# A tibble: 1 × 2
  `black_moe > 15`     n
  <lgl>            <int>
1 TRUE               205

#### 100%, or 205/205 tracts had unreliable data (MOE > 15%) about at least one racial population group. 61%, or 125/205 tracts had unreliable data had unreliable data about its white population. For total tract population, 55/205, or 27% of tracts were unreliable.

4.2 Pattern Analysis

Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.

# Group tracts by whether they have high MOE issues
####Having established that the non-white margins of error are significantly higher and that white populations Washington--and specifically Garfield (89%) and Jefferson (85%) Counties--contains a high proportion of white people, I chose to evaluate patterns based on the weighted mean reliability of the total population in each tract.
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison

tract_averages <- selected_data %>%
  group_by(totalpop_reliability) %>%
    summarize(
      avg_tract_pop = mean(totalpopE, na.rm = TRUE),
      avg_tract_white = mean(pct_white_tract, na.rm = TRUE),
      avg_tract_black = mean(pct_black_tract, na.rm = TRUE),
      avg_tract_latino = mean(pct_latino_tract, na.rm = TRUE))
      


# Create a professional table showing the patterns
tract_averages_table <- tract_averages %>%
  kable(col.names = c("Reliability Category","Average Population (count)","Average White Percentage","Average Black/African-American Percentage","Average Hispanic/Latino Percentage"), caption = "Tract level statistics by reliability category")
view(tract_averages_table)

Pattern Analysis: [Describe any patterns you observe. Do certain types of communities have less reliable data? What might explain this?] ####Yes, non-white groups have significantly less reliable data than their white counterparts. There are a few reasons for this: first, since the minority groups are relatively small compared to the white population, especially in Jefferson and Garfield Counties, each increase in deviation has a larger overall effect on the margins of error, causing them to be much higher. Non-white communities may not be surveyed equitably, and the areas where their populations are higher are particularly unreliable. Because of my initial obstacle where all of the tracts were counted as unreliable if any one racial group had unreliable data, using total population may have diluted the effect that each race’s proportion has on unreliability. However, this even more starkly demonstrates how unreliable these Washington counties’data on non-white populations are due to the immediately noticeable large margins of error.

Part 5: Policy Recommendations

5.1 Analysis Integration and Professional Summary

Your Task: Write an executive summary that integrates findings from all four analyses.

Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?

Executive Summary: Washington State Department of Human Services:

An analysis of data reliability in Washington state reveals clear and consistent patterns across counties and census tracts. Of the counties examined, 21 were classified as high confidence, 15 as moderate confidence, and only 3 as low confidence. The state’s four most populous counties (King, Snohomish, Pierce, and Spokane) also exhibited the highest levels of data reliability. These counties are major urban centers and, in most cases, have higher median household incomes. King, Snohomish, and Pierce Counties are all part of the Seattle-Bellevue-Tacoma MSA, and they have some of the highest median incomes and also the most reliable data compared to other counties in the state. Similarly, nine of the ten counties with the highest median incomes were classified as high confidence. In contrast, counties with the lowest reliability were predominantly small, rural, and sparsely populated, including Garfield, Wahkiakum, and Pend Oreille Counties.

Findings indicate that communities with higher proportions of Black and Latino residents face the greatest risk of algorithmic bias due to data unreliability. Non-white populations are largely concentrated in urban areas, particularly in Pierce County, where average Latino and Black populations were highest at the tract level. However, nearly all tracts exhibited substantial margins of error for these groups. Only one tract had a Latino margin of error below 15%, and all tracts had Black margins of error exceeding 15 percent. While white population estimates also showed notable uncertainty with 125 tracts exceeding a 15% percent margin of error, these rates were substantially lower than those observed for Black and Latino populations. Reliable tracts tended to have higher average White populations and lower average Black and Latino populations, indicating systematic disparities in data quality across demographic groups.

The primary drivers of data quality limitations and bias risk emerge from population size, sampling constraints, and geographic context. Smaller counties and rural areas tend to have limited survey samples, resulting in high margins of error. Similarly, demographic groups with smaller populations at the tract level are subject to disproportionately large margins of error. Because the American Community Survey relies on sample-based estimates rather than full population counts, areas with low population density or small subgroup populations experience greater unreliability in estimates. Higher-income and urban counties benefit from larger and more stable sample sizes, leading to more reliable data. Conversely, rural, low-population, and lower-income areas face structural disadvantages in data quality that propagate into analytic systems.

To address these systematic issues, the Department should implement several targeted strategies. Reliability thresholds and subgroup population minimums should be formally incorporated into analytic workflows to flag and contextualize unstable estimates. Reporting frameworks should transparently communicate margins of error and data limitations, particularly when informing policy or resource allocation decisions. Finally, the Department should consider supplementing ACS data with administrative records, local surveys, and community-based data collection efforts in underrepresented areas. Together, these measures will reduce bias risk, improve analytical equity, and strengthen the foundation for data-informed decision-making.

6.3 Specific Recommendations

Your Task: Create a decision framework for algorithm implementation.

# Create a summary table using your county reliability data
## Include: county name, median income, MOE percentage, reliability category

county_reliability <- wa_data %>%
  select(county, mhiE, moe_percentage, confidence)

# Add a new column with algorithm recommendations using case_when():
## - High Confidence: "Safe for algorithmic decisions"
##- Moderate Confidence: "Use with caution - monitor outcomes"  
## - Low Confidence: "Requires manual review or additional data"
county_reliability <- county_reliability %>%
  mutate(recommendation = 
           case_when(confidence == "high" ~ "safe for algorithmic decisions",
                     confidence == "moderate" ~ "use with caution - monitor outcomes",
                     confidence == "low" ~ "requires manual review or additional data"))

# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"  
# - Low Confidence: "Requires manual review or additional data"

# Format as a professional table with kable()

county_reliability_table <- county_reliability %>%
  kable (col.names = c("County Name","Median Household Income", "Margin of Error", "Confidence", "Recommendation"),
         caption = "Washington Counties by Reliability of Median Household Income Data Reliability")

Key Recommendations:

Your Task: Use your analysis results to provide specific guidance to the department.

Counties suitable for immediate algorithmic implementation: [List counties with high confidence data and explain why they’re appropriate
Counties requiring additional oversight: [List counties with moderate confidence data and describe what kind of monitoring would be needed]
Counties needing alternative approaches: [List counties with low confidence data and suggest specific alternatives - manual review, additional surveys, etc.]

high_confidence_counties <- county_reliability %>%
  filter(confidence == "high") %>%
  select(county)
print(high_confidence_counties)

# A tibble: 21 × 1
   county      
   <chr>       
 1 Benton      
 2 Clallam     
 3 Clark       
 4 Cowlitz     
 5 Douglas     
 6 Grays Harbor
 7 Island      
 8 King        
 9 Kitsap      
10 Kittitas    
# ℹ 11 more rows

moderate_confidence_counties <- county_reliability %>%
  filter(confidence == "moderate") %>%
    select(county)
print(moderate_confidence_counties)

# A tibble: 15 × 1
   county     
   <chr>      
 1 Adams      
 2 Asotin     
 3 Chelan     
 4 Columbia   
 5 Ferry      
 6 Franklin   
 7 Grant      
 8 Jefferson  
 9 Klickitat  
10 Mason      
11 Pacific    
12 Skamania   
13 Stevens    
14 Walla Walla
15 Whitman

low_confidence_counties <- county_reliability %>%
  filter(confidence == "low") %>%
    select(county)
print(low_confidence_counties)

# A tibble: 3 × 1
  county      
  <chr>       
1 Garfield    
2 Pend Oreille
3 Wahkiakum

####For the low confidence counties--Garfield, Pend Oreille, and Wahkiakum--additional survey methods may be needed. Wahkiakum and Garfield Counties are two of Washington's three smallest counties with populations of around 4,900 and 2,500, respectively. The relative size of these counties make them more logistically suitable for additional data gathering methods to be executed.

Questions for Further Investigation

[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]

####Question 1: Are the counties with the lowest reliability clustered spatially or characteristically? For example, are rural counties more likely to have less reliable data? Question 2: Washington is home to several Indian Reservations. How does percent of county (land or population) in the Indian Reservation relate to the reliability of county data? Question 3: Do the counties with the highest Black/African American and Hispanic/Latino populations show significant deviation in data reliability from those with smaller non-white populations?

Technical Notes

Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 2/2/2026

Reproducibility: - All analysis conducted in R version R version 4.4.2 - Census API key required for replication - Complete code and documentation available at: https://lizmcrouse.github.io/crouse

Methodology Notes: The key reproducibility challenge to be aware of in this analysis is that Washington state has high rates of data unreliability, especially in non-white groups. In the portion where counties with MOEs > 15% for any one of the three analyzed racial groups (white, Latino/Hispanic, Black/African American) all came back flagged as unreliable. Therefore, to conduct the remainder of the analysis, I chose to rely on the aggregate population margin of error to assess reliability. When reproduced for another state, using the original methods that look at each racial group separately may yield more nuanced results than it did for Washington. Limitations: Washington is a relatively small non-white population, and non-white groups are concentrated in specific counties and areas. This could create stark outliers that throw off the data analysis process. Washington is also a rapidly growing state, so 2018-2022 data could be out of date and thus undercount certain population groups.

Submission Checklist

Before submitting your portfolio link on Canvas:

All code chunks run without errors
All “[Fill this in]” prompts have been completed
Tables are properly formatted and readable
Executive summary addresses all four required components
Portfolio navigation includes this assignment
Census API key is properly set
Document renders correctly to HTML

Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html