# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(knitr)Lab 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the Washington Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: labs/lab_1/your_file_name.qmd
text: "Lab 1: Census Data Exploration"
If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text
Setup
# Set your Census API key
census_api_key("fadfb87c0c7417766780c9bf52b8df9adef61e21", overwrite = TRUE, install = TRUE)[1] "fadfb87c0c7417766780c9bf52b8df9adef61e21"
Sys.getenv("CENSUS_API_KEY")[1] "fadfb87c0c7417766780c9bf52b8df9adef61e21"
Choose your state for analysis - assign it to a variable called my_state
my_state <- "WA"State Selection: I have chosen Washington for this analysis because having worked in housing policy in the state, I understand nuances in both demographics and in public service administration.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
variables <- c(mhi = "B19013_001",
pop = "B01003_001")
wa_data <- get_acs(geography = "county",
state = my_state,
variables = variables,
year = 2022,
survey = "acs5",
output = "wide")
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
wa_data <- wa_data %>%
mutate(county = str_remove(wa_data$NAME, " County, Washington"))
wa_data <- select(wa_data, "GEOID","county","mhiE","mhiM","popE")
# Display the first few rows
head(wa_data)# A tibble: 6 × 5
GEOID county mhiE mhiM popE
<chr> <chr> <dbl> <dbl> <dbl>
1 53001 Adams 63105 3509 20557
2 53003 Asotin 63724 6045 22370
3 53005 Benton 83778 1849 207560
4 53007 Chelan 71876 4147 79076
5 53009 Clallam 66108 2368 77333
6 53011 Clark 90115 1650 504091
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
wa_data <- wa_data %>%
mutate(wa_data, moe_percentage = (mhiM / mhiE) * 100)
wa_data <- wa_data %>%
mutate(confidence =
case_when(moe_percentage < 5 ~ "high",
moe_percentage >= 5 & moe_percentage <= 10 ~ "moderate",
moe_percentage > 10 ~ "low"))
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
confidence_summary <- wa_data %>%
count(confidence, sort = TRUE, name = "county_confidence")
confidence_summary <- confidence_summary %>%
mutate(percentages =
case_when(confidence == "high" ~ "5% or less",
confidence == "moderate" ~ "5-10%",
confidence == "low" ~ "more than 10%"))2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
uncertain_counties <- wa_data %>%
arrange(desc(moe_percentage))
uncertain_counties <- uncertain_counties %>%
slice(1:5)
uncertain_counties <- uncertain_counties %>%
select(county, mhiE, mhiM, moe_percentage, confidence)
# Format as table with kable() - include appropriate column names and caption
uncertain_table <- uncertain_counties %>%
kable(col.names = c("County", "Median Household Income", "Margin of Error","MoE %","Confidence Category"), caption = "Five Washington Counties with Least Reliable Income Data")
view(uncertain_table)Data Quality Commentary:
####The five counties with least reliable data in Washington state are Garfield, Pend Oreille, Wahkiakum, Asotin, and Ferry Counties. All of these counties are in the bottom 25% of counties for population size. All are rural and distant from the main population centers in Washington. They also have relatively low median incomes – all under $65,000 with four under $60,000. This means that algorithms that rely on income data may be using unreliable data from the poorest counties to bias decision making models, emphasizing the need for closer attention to counties with low data reliability and low median incomes.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- wa_data %>%
filter(county == "Garfield" | county == "Jefferson" | county == "Pierce")
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties <- selected_counties %>%
select(county, mhiE, moe_percentage, confidence)
selected_counties <- selected_counties %>%
rename(MHI = 'mhiE')
view(selected_counties)Comment on the output: The selected counties increase in median household income as reliability becomes greater. Garfield County has the lowest median household income and lowest reliability; Pierce County has the highest of each.
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
race_variables <- c(white = "B03002_003", black = "B03002_004", latino = "B03002_012", totalpop = "B03002_001")
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
selected_data <- get_acs(geography = "tract",
state = my_state,
county = c("Jefferson","Pierce","Garfield"),
variables = race_variables,
year = 2022,
survey = "acs5",
output = "wide",
geometry = FALSE)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
selected_data <- selected_data %>%
mutate(percent_white = (whiteE / totalpopE) * 100,
percent_black = (blackE / totalpopE) * 100,
percent_latino = (latinoE / totalpopE) * 100)
# Add readable tract and county name columns using str_extract() or similar
selected_data <- selected_data %>%
mutate(
tract = str_extract(NAME, "(?<=Tract )[^;]+"),
county = NAME %>%
str_extract("(?<=;).*") %>%
str_remove(" County.*$") %>%
str_trim())
selected_data <- selected_data %>%
select(-NAME)
selected_data <- selected_data %>%
relocate(tract, county)3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
highest_latino <- selected_data %>%
arrange(desc(percent_latino))
highest_latino <- highest_latino %>% slice(1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
selected_data <- selected_data %>%
mutate(
pct_white_tract = (whiteE / totalpopE) * 100,
pct_black_tract = (blackE / totalpopE) * 100,
pct_latino_tract = (latinoE / totalpopE) * 100)
demographics_by_county <- selected_data %>%
group_by(county) %>%
summarize(
number_tracts = n(),
avg_white_per_tract = mean(pct_white_tract, na.rm = TRUE),
avg_black_per_tract = mean(pct_black_tract, na.rm = TRUE),
avg_latino_per_tract= mean(pct_latino_tract, na.rm = TRUE)
)
demographics_by_county <- demographics_by_county %>%
mutate(Confidence = selected_counties$confidence)
# Create a nicely formatted table of your results using kable()
demographics_table <- demographics_by_county %>%
kable(col.names = c("County", "# Tracts", "Average White Alone per Tract","Average Black/African-American per Tract","Average Latino/Hispanic per Tract","Confidence"),caption = "Average Demographics by Census Tract in Garfield, Jefferson, and Pierce Counties, Washington")
view(demographics_table)Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
selected_data <- selected_data %>%
mutate(white_moe = (whiteM/whiteE) * 100,
black_moe = (blackM/blackE) * 100,
latino_moe = (latinoM/latinoE) * 100)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
selected_data <- selected_data %>%
mutate(high_error = case_when(white_moe > 15 |
black_moe > 15 |
latino_moe > 15 ~ "unreliable",
TRUE ~ "reliable"))
selected_data <- selected_data %>%
mutate(totalpop_moe = (totalpopM/totalpopE) * 100)
selected_data <- selected_data %>%
mutate(totalpop_reliability = case_when(totalpop_moe > 15 ~ "unreliable",
TRUE ~ "reliable"))
selected_data <- selected_data %>%
mutate(white_only_high_error = case_when(white_moe > 15 ~ "unreliable",
TRUE ~ "reliable"))
####This returns all tracts flagged as unreliable. In looking at the data, many of the census tracts, particularly in Jefferson and Garfield Counties, the populations are very small, resulting in an inflated percentage margin of error. This is especially pronounced in the Hispanic/Latino and Black/African American county. A large proportion of Washington's population is white; therefore, I chose to isolate the white MOE to see if that yielded more diverse results. I also calculate the total tract margin of error to assess the aggregate.
selected_data %>%
count(white_only_high_error)# A tibble: 2 × 2
white_only_high_error n
<chr> <int>
1 reliable 80
2 unreliable 125
selected_data %>%
count(totalpop_reliability)# A tibble: 2 × 2
totalpop_reliability n
<chr> <int>
1 reliable 150
2 unreliable 55
####When isolated to only white populations, 80 tracts have reliable data, or ~39%. While this is still relatively few tracts with reliable data, it shows that data pertaining to white populations is significantly more reliable than for minority populations.
# Create a summary showing count of counties in each reliability category
# Hint: use count() and mutate() to add percentages
confidence_summary <- wa_data %>%
count(confidence, sort = TRUE, name = "county_confidence")
confidence_summary <- confidence_summary %>%
mutate(percentages =
case_when(confidence == "high" ~ "5% or less",
confidence == "moderate" ~ "5-10%",
confidence == "low" ~ "more than 10%"))
# Create summary statistics showing how many tracts have data quality issues
selected_data %>%
count(high_error)# A tibble: 1 × 2
high_error n
<chr> <int>
1 unreliable 205
selected_data %>%
count(white_only_high_error)# A tibble: 2 × 2
white_only_high_error n
<chr> <int>
1 reliable 80
2 unreliable 125
selected_data %>%
count(totalpop_reliability)# A tibble: 2 × 2
totalpop_reliability n
<chr> <int>
1 reliable 150
2 unreliable 55
selected_data %>%
count(latino_moe > 15)# A tibble: 2 × 2
`latino_moe > 15` n
<lgl> <int>
1 FALSE 1
2 TRUE 204
selected_data %>%
count(black_moe > 15)# A tibble: 1 × 2
`black_moe > 15` n
<lgl> <int>
1 TRUE 205
#### 100%, or 205/205 tracts had unreliable data (MOE > 15%) about at least one racial population group. 61%, or 125/205 tracts had unreliable data had unreliable data about its white population. For total tract population, 55/205, or 27% of tracts were unreliable.4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
####Having established that the non-white margins of error are significantly higher and that white populations Washington--and specifically Garfield (89%) and Jefferson (85%) Counties--contains a high proportion of white people, I chose to evaluate patterns based on the weighted mean reliability of the total population in each tract.
# Calculate average characteristics for each group:
# - population size, demographic percentages
# Use group_by() and summarize() to create this comparison
tract_averages <- selected_data %>%
group_by(totalpop_reliability) %>%
summarize(
avg_tract_pop = mean(totalpopE, na.rm = TRUE),
avg_tract_white = mean(pct_white_tract, na.rm = TRUE),
avg_tract_black = mean(pct_black_tract, na.rm = TRUE),
avg_tract_latino = mean(pct_latino_tract, na.rm = TRUE))
# Create a professional table showing the patterns
tract_averages_table <- tract_averages %>%
kable(col.names = c("Reliability Category","Average Population (count)","Average White Percentage","Average Black/African-American Percentage","Average Hispanic/Latino Percentage"), caption = "Tract level statistics by reliability category")
view(tract_averages_table)Pattern Analysis: [Describe any patterns you observe. Do certain types of communities have less reliable data? What might explain this?] ####Yes, non-white groups have significantly less reliable data than their white counterparts. There are a few reasons for this: first, since the minority groups are relatively small compared to the white population, especially in Jefferson and Garfield Counties, each increase in deviation has a larger overall effect on the margins of error, causing them to be much higher. Non-white communities may not be surveyed equitably, and the areas where their populations are higher are particularly unreliable. Because of my initial obstacle where all of the tracts were counted as unreliable if any one racial group had unreliable data, using total population may have diluted the effect that each race’s proportion has on unreliability. However, this even more starkly demonstrates how unreliable these Washington counties’data on non-white populations are due to the immediately noticeable large margins of error.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary: Washington State Department of Human Services:
An analysis of data reliability in Washington state reveals clear and consistent patterns across counties and census tracts. Of the counties examined, 21 were classified as high confidence, 15 as moderate confidence, and only 3 as low confidence. The state’s four most populous counties (King, Snohomish, Pierce, and Spokane) also exhibited the highest levels of data reliability. These counties are major urban centers and, in most cases, have higher median household incomes. King, Snohomish, and Pierce Counties are all part of the Seattle-Bellevue-Tacoma MSA, and they have some of the highest median incomes and also the most reliable data compared to other counties in the state. Similarly, nine of the ten counties with the highest median incomes were classified as high confidence. In contrast, counties with the lowest reliability were predominantly small, rural, and sparsely populated, including Garfield, Wahkiakum, and Pend Oreille Counties.
Findings indicate that communities with higher proportions of Black and Latino residents face the greatest risk of algorithmic bias due to data unreliability. Non-white populations are largely concentrated in urban areas, particularly in Pierce County, where average Latino and Black populations were highest at the tract level. However, nearly all tracts exhibited substantial margins of error for these groups. Only one tract had a Latino margin of error below 15%, and all tracts had Black margins of error exceeding 15 percent. While white population estimates also showed notable uncertainty with 125 tracts exceeding a 15% percent margin of error, these rates were substantially lower than those observed for Black and Latino populations. Reliable tracts tended to have higher average White populations and lower average Black and Latino populations, indicating systematic disparities in data quality across demographic groups.
The primary drivers of data quality limitations and bias risk emerge from population size, sampling constraints, and geographic context. Smaller counties and rural areas tend to have limited survey samples, resulting in high margins of error. Similarly, demographic groups with smaller populations at the tract level are subject to disproportionately large margins of error. Because the American Community Survey relies on sample-based estimates rather than full population counts, areas with low population density or small subgroup populations experience greater unreliability in estimates. Higher-income and urban counties benefit from larger and more stable sample sizes, leading to more reliable data. Conversely, rural, low-population, and lower-income areas face structural disadvantages in data quality that propagate into analytic systems.
To address these systematic issues, the Department should implement several targeted strategies. Reliability thresholds and subgroup population minimums should be formally incorporated into analytic workflows to flag and contextualize unstable estimates. Reporting frameworks should transparently communicate margins of error and data limitations, particularly when informing policy or resource allocation decisions. Finally, the Department should consider supplementing ACS data with administrative records, local surveys, and community-based data collection efforts in underrepresented areas. Together, these measures will reduce bias risk, improve analytical equity, and strengthen the foundation for data-informed decision-making.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
## Include: county name, median income, MOE percentage, reliability category
county_reliability <- wa_data %>%
select(county, mhiE, moe_percentage, confidence)
# Add a new column with algorithm recommendations using case_when():
## - High Confidence: "Safe for algorithmic decisions"
##- Moderate Confidence: "Use with caution - monitor outcomes"
## - Low Confidence: "Requires manual review or additional data"
county_reliability <- county_reliability %>%
mutate(recommendation =
case_when(confidence == "high" ~ "safe for algorithmic decisions",
confidence == "moderate" ~ "use with caution - monitor outcomes",
confidence == "low" ~ "requires manual review or additional data"))
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
# Format as a professional table with kable()
county_reliability_table <- county_reliability %>%
kable (col.names = c("County Name","Median Household Income", "Margin of Error", "Confidence", "Recommendation"),
caption = "Washington Counties by Reliability of Median Household Income Data Reliability")Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: [List counties with high confidence data and explain why they’re appropriate
Counties requiring additional oversight: [List counties with moderate confidence data and describe what kind of monitoring would be needed]
Counties needing alternative approaches: [List counties with low confidence data and suggest specific alternatives - manual review, additional surveys, etc.]
high_confidence_counties <- county_reliability %>%
filter(confidence == "high") %>%
select(county)
print(high_confidence_counties)# A tibble: 21 × 1
county
<chr>
1 Benton
2 Clallam
3 Clark
4 Cowlitz
5 Douglas
6 Grays Harbor
7 Island
8 King
9 Kitsap
10 Kittitas
# ℹ 11 more rows
moderate_confidence_counties <- county_reliability %>%
filter(confidence == "moderate") %>%
select(county)
print(moderate_confidence_counties)# A tibble: 15 × 1
county
<chr>
1 Adams
2 Asotin
3 Chelan
4 Columbia
5 Ferry
6 Franklin
7 Grant
8 Jefferson
9 Klickitat
10 Mason
11 Pacific
12 Skamania
13 Stevens
14 Walla Walla
15 Whitman
low_confidence_counties <- county_reliability %>%
filter(confidence == "low") %>%
select(county)
print(low_confidence_counties)# A tibble: 3 × 1
county
<chr>
1 Garfield
2 Pend Oreille
3 Wahkiakum
####For the low confidence counties--Garfield, Pend Oreille, and Wahkiakum--additional survey methods may be needed. Wahkiakum and Garfield Counties are two of Washington's three smallest counties with populations of around 4,900 and 2,500, respectively. The relative size of these counties make them more logistically suitable for additional data gathering methods to be executed.Questions for Further Investigation
[List 2-3 questions that your analysis raised that you’d like to explore further in future assignments. Consider questions about spatial patterns, time trends, or other demographic factors.]
####Question 1: Are the counties with the lowest reliability clustered spatially or characteristically? For example, are rural counties more likely to have less reliable data? Question 2: Washington is home to several Indian Reservations. How does percent of county (land or population) in the Indian Reservation relate to the reliability of county data? Question 3: Do the counties with the highest Black/African American and Hispanic/Latino populations show significant deviation in data reliability from those with smaller non-white populations?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on 2/2/2026
Reproducibility: - All analysis conducted in R version R version 4.4.2 - Census API key required for replication - Complete code and documentation available at: https://lizmcrouse.github.io/crouse
Methodology Notes: The key reproducibility challenge to be aware of in this analysis is that Washington state has high rates of data unreliability, especially in non-white groups. In the portion where counties with MOEs > 15% for any one of the three analyzed racial groups (white, Latino/Hispanic, Black/African American) all came back flagged as unreliable. Therefore, to conduct the remainder of the analysis, I chose to rely on the aggregate population margin of error to assess reliability. When reproduced for another state, using the original methods that look at each racial group separately may yield more nuanced results than it did for Washington. Limitations: Washington is a relatively small non-white population, and non-white groups are concentrated in specific counties and areas. This could create stark outliers that throw off the data analysis process. Washington is also a rapidly growing state, so 2018-2022 data could be out of date and thus undercount certain population groups.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html