Philadelphia Housing Model - Technical Appendix

Author

Daisy, Liz, Johnny, Parker

Published

March 17, 2026

Load Necessary Libraries

Code
library(sf)
library(tidyverse)
library(tidycensus)
library(tigris)
library(MASS)
library(dplyr)
library(scales)
library(ggplot2)
library(caret)
library(car)
library(knitr)
library(readr)
library(patchwork)
library(lubridate)
library(modelsummary)
library(broom)
library(ggtext)
library(here)

Phase 1: Data Preparation

Step 1: Load and Clean Philadelphia Sales Data

Load Primary Data

Code
getwd()
[1] "C:/Users/lizmc/Desktop/Y2S2/ppa/crouse/labs/lab_3"
Code
#| message: false
#| warning: false
#| echo: false

opa <- st_read("lab3_crouse/data/opa_properties_public.geojson", quiet = TRUE) %>%
  st_transform(2272)

Filter to Residential Sales

Code
# Filter to single and multi-family residential sales in 2023-2024
# Remove records with missing or invalid key variables
phl_props_filtered <- opa %>%
  mutate(
    sale_date = as.Date(sale_date),
    sale_year = lubridate::year(sale_date)
  ) %>%
  filter(
    category_code_description %in% c("SINGLE FAMILY", "MULTI FAMILY"),
    sale_year %in% c(2023, 2024),
    !is.na(sale_price),
    !is.na(total_livable_area), total_livable_area > 0,
    !is.na(number_of_bedrooms), number_of_bedrooms > 0,
    !is.na(state_code), state_code == "PA"
  ) %>%
  mutate(
    log_sale_price = log(sale_price)
  )

# Check dimensions after filtering
cat("Rows after filtering:", nrow(phl_props_filtered), "\n")
Rows after filtering: 30859 

Filter outliers

  • There are a large number of low-end outliers, so we evaluated different threshold cutoffs to determine where filtering would improve data quality while preserving a meaningful and defensible sample.
Code
# Plot the data to see distribution of sale_price
ggplot(phl_props_filtered, aes(x = sale_price)) +
  geom_histogram(bins = 100, fill = "steelblue", color = "white") +
  scale_x_log10(labels = scales::dollar) + 
  labs(title = "Distribution of Sale Price (Log Scale)",
       x = "Sale Price (log)",
       y = "Count") +
  theme_minimal()
Warning in scale_x_log10(labels = scales::dollar): log-10 transformation
introduced infinite values.
Warning: Removed 102 rows containing non-finite outside the scale range
(`stat_bin()`).

Code
# We found that part of the distribution looks similar to normal distribution

# Check at which threshold the data are the most similar to normal distribution
thresholds <- c(1000, 5000, 10000, 20000, 40000, 50000)

map_dfr(thresholds, function(t) {
  d <- phl_props_filtered %>% filter(sale_price > t) %>% pull(sale_price)
  tibble(
    threshold = t,
    n = length(d),
    skewness = (mean(log(d)) - median(log(d))) / sd(log(d)),  # Pearson's skewness
    mean_log = mean(log(d)),
    sd_log = sd(log(d))
  )
})
# A tibble: 6 × 5
  threshold     n skewness mean_log sd_log
      <dbl> <int>    <dbl>    <dbl>  <dbl>
1      1000 24486 -0.0925      12.4  0.845
2      5000 24418 -0.0871      12.4  0.816
3     10000 24340 -0.0819      12.4  0.795
4     20000 24176 -0.0610      12.4  0.765
5     40000 23702 -0.0332      12.5  0.713
6     50000 23372 -0.00926     12.5  0.687
Code
# We can see the skewness of threshold 50000 stay the most closely to 0
# It's obvious that 50000 makes more sense

phl_props_filtered <- phl_props_filtered %>%
  filter(sale_price > 50000)

Explanation of outlier identification methodology:

  • We observed a cluster of extremely low sale_price values unlikely to represent arm’s length market transactions (like 0,1,20,100,etc.).
  • We examined the raw and log-transformed sale_price distributions and tested several lower-bound thresholds to assess sensitivity.
  • We selected $50,000 as the cutoff because it removes the most implausible low-price records while preserving the overall distribution shape and a sufficient sample size.
  • We retained the high-value observations, as they are consistent with the upper tail of the distribution and do not appear to represent abnormal or implausible transactions.

Data Cleaning Summary

Code
# Before/after comparison table documenting cleaning decisions
data.frame(
  Step = c(
    "Raw data",
    "Filter: residential only (single/multi family)",
    "Filter: 2023-2024 sales",
    "Filter: sale price > $50,000",
    "Filter: living area > 0, bedrooms > 0",
    "Filter: state = PA",
    "Final analytic sample"
  ),
  Rows = c(
    nrow(opa),
    nrow(opa %>% filter(category_code_description %in% c("SINGLE FAMILY", "MULTI FAMILY"))),
    nrow(opa %>% filter(category_code_description %in% c("SINGLE FAMILY", "MULTI FAMILY"),
                        lubridate::year(as.Date(sale_date)) %in% c(2023, 2024))),
    nrow(opa %>% filter(category_code_description %in% c("SINGLE FAMILY", "MULTI FAMILY"),
                        lubridate::year(as.Date(sale_date)) %in% c(2023, 2024),
                        !is.na(sale_price), sale_price > 50000)),
    nrow(opa %>% filter(category_code_description %in% c("SINGLE FAMILY", "MULTI FAMILY"),
                        lubridate::year(as.Date(sale_date)) %in% c(2023, 2024),
                        !is.na(sale_price), sale_price > 50000,
                        !is.na(total_livable_area), total_livable_area > 0,
                        !is.na(number_of_bedrooms), number_of_bedrooms > 0)),
    nrow(phl_props_filtered %>% filter(state_code == "PA")),
    nrow(phl_props_filtered)
  )
) %>%
  mutate(Removed = lag(Rows, default = first(Rows)) - Rows) %>%
  kable(caption = "Data Cleaning Summary: Rows Before and After Each Step")
Data Cleaning Summary: Rows Before and After Each Step
Step Rows Removed
Raw data 583588 0
Filter: residential only (single/multi family) 504018 79570
Filter: 2023-2024 sales 37820 466198
Filter: sale price > $50,000 27642 10178
Filter: living area > 0, bedrooms > 0 26197 1445
Filter: state = PA 23372 2825
Final analytic sample 23372 0

Data Cleaning Decisions:

  • Residential only: We restricted the sample to single-family and multi-family properties, as they represent typical residential properties.
  • 2023–2024 sales: We focused on recent sales to ensure it reflects current market conditions.
  • Sale price > $50,000: Extremely low sale prices are unlikely to represent arm’s length market transactions. After testing multiple thresholds, $50,000 minimizes skewness in the log-transformed distribution while preserving a sufficient sample size.
  • Living area > 0, bedrooms > 0: Records with zero or missing values for these structural variables are likely data entry errors and cannot contribute meaningful information to the model.
  • State = PA: Restricting to Pennsylvania ensures geographic consistency.

Step 2: Load Secondary Data

ACS Census Data

Code
# Define census variables of interest
variables <- c(
  med_inc              = "B19013_001",  # Median household income
  burden_renter_total  = "B25070_001",  # Renter cost burden total
  burden_renter_30     = "B25070_007",  # 30-34.9%
  burden_renter_35     = "B25070_008",  # 35-39.9%
  burden_renter_40     = "B25070_009",  # 40-49.9%
  burden_renter_50     = "B25070_010",  # 50%+
  burden_owner_total   = "B25091_001",  # Owner cost burden total
  burden_owner_30      = "B25091_008",  # 30-34.9%
  burden_owner_35      = "B25091_009",  # 35-39.9%
  burden_owner_40      = "B25091_010",  # 40-49.9%
  burden_owner_50      = "B25091_011",  # 50%+
  poverty_total        = "B17001_001",  # Poverty total
  poverty_below        = "B17001_002",  # Below poverty line
  total_households     = "B11001_001",  # Total households
  single_family        = "B25024_002",  # 1-unit detached
  single_family_att    = "B25024_003",  # 1-unit attached
  multi_family         = "B25024_004",  # 2 units
  multi_family_3_4     = "B25024_005",  # 3-4 units
  multi_family_5_9     = "B25024_006",  # 5-9 units
  multi_family_10_19   = "B25024_007",  # 10-19 units
  multi_family_20_49   = "B25024_008",  # 20-49 units
  multi_family_50plus  = "B25024_009",  # 50+ units
  edu_total            = "B15003_001",  # Education total
  edu_bachelor         = "B15003_022",  # Bachelor's degree
  edu_master           = "B15003_023",  # Master's degree
  edu_professional     = "B15003_024",  # Professional degree
  edu_phd              = "B15003_025",  # Doctorate degree
  commute_total        = "B08301_001",  # Commute total
  commute_car          = "B08301_002",  # Car, truck, van
  commute_transit      = "B08301_010"   # Public transportation
)

# Pull ACS data at census tract level for Philadelphia County
acs_data <- get_acs(
  geography = "tract",
  variables = variables,
  state     = "PA",
  county    = "Philadelphia",
  year      = 2023,
  survey    = "acs5",
  geometry  = TRUE,
  output    = "wide"
) %>%
  st_transform(2272)

# Derive composite rates from raw counts
acs_data <- acs_data %>%
  mutate(
    poverty_rate           = if_else(poverty_totalE > 0, poverty_belowE / poverty_totalE, NA_real_),
    transit_share          = if_else(commute_totalE > 0, commute_transitE / commute_totalE, NA_real_),
    edu_ba_share           = if_else(edu_totalE > 0, edu_bachelorE / edu_totalE, NA_real_),
    burden_renter30_share  = if_else(burden_renter_totalE > 0, burden_renter_30E / burden_renter_totalE, NA_real_)
  )

cat("Census tracts loaded:", nrow(acs_data), "\n")

Spatial Amenity Data

Code
# Load SEPTA transit stops
septa_stops <- st_read("C:/Users/lizmc/Desktop/Y2S2/ppa/crouse/labs/lab_3/data/Transit_Stops_(Spring_2025).geojson", quiet = TRUE) %>%
  st_transform(2272)

# Load street tree inventory
phl_trees <- st_read("data/ppr_tree_inventory_2025.geojson", quiet = TRUE) %>%
  st_transform(2272)

# Load schools
schools <- st_read("data/Schools", quiet = TRUE) %>%
  st_transform(2272)

# Load neighborhood boundaries
phl_nhoods <- st_read("data/philadelphia-neighborhoods.geojson", quiet = TRUE) %>%
  st_transform(2272) %>%
  transmute(neighborhood = NAME)

# Transform property data to EPSG:2272 (PA State Plane) to match other layers
phl_props_filtered <- phl_props_filtered %>%
  st_transform(2272)

# Define Center City reference point (City Hall)
center_city <- st_sfc(
  st_point(c(-75.1638889, 39.95225)),
  crs = 4326
) %>%
  st_transform(2272)

cat("Transit stops:", nrow(septa_stops), "\n")
Transit stops: 22478 
Code
cat("Trees:", nrow(phl_trees), "\n")
Trees: 151726 
Code
cat("Schools:", nrow(schools), "\n")
Schools: 490 
Code
cat("Neighborhoods:", nrow(phl_nhoods), "\n")
Neighborhoods: 159 

Phase 2: Exploratory Data Analysis

1. Distribution of Sale Prices

Code
# Log-transform sale price to normalize the right-skewed distribution
phl_props_filtered <- phl_props_filtered %>%
  mutate(log_sale_price = log(sale_price))

ggplot(phl_props_filtered, aes(x = log_sale_price)) +
  geom_histogram(
    bins  = 40,
    color = "white",
    fill  = "steelblue"
  ) +
  labs(
    title = "Distribution of Log Sale Prices",
    x     = "Log Sale Price",
    y     = "Number of Properties"
  ) +
  theme_minimal()

  • The log-transformed sale price distribution is much more symmetric than the raw sale price distribution, with reduced right skew and less influence from extreme high-value sales. This supports using log sale price as the modeling outcome because it provides a more stable scale for regression-based prediction. It does not, however, by itself confirm that all model assumptions are satisfied.

2. Geographic Distribution of Sale Prices

Code
# Map log sale prices across Philadelphia to identify spatial patterns
ggplot(phl_props_filtered) +
  geom_sf(aes(color = log_sale_price), size = .001) +
  scale_color_viridis_c(option = "plasma", name = "Log Sale Price") +
  theme_void() +
  labs(title = "Geographic Distribution of Sale Prices")

  • Sale prices show clear spatial clustering across Philadelphia rather than random geographic variation. This indicates that market value is strongly shaped by location and that property-level characteristics alone are unlikely to explain prices fully. The pattern supports including neighborhood and spatial features in the model.

3. Price vs. Structural Features

Code
# Scatter plot of sale price vs number of bedrooms
ggplot(phl_props_filtered,
       aes(x = number_of_bedrooms, y = sale_price)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(
    title = "Sale Price vs. Number of Bedrooms",
    x     = "Number of Bedrooms",
    y     = "Sale Price ($)"
  )

  • Sale price generally rises with bedroom count at lower and moderate values, but the relationship is noisy and weakens at higher counts. This suggests that bedroom count contains useful structural information but is not sufficient on its own to explain price variation. Other structural and contextual variables are needed to capture the full range of market differences.

4. Price vs. Spatial Features

See Feature Engineering section below for spatial feature construction.

Join Census Data to Properties

Code
# Derive rate-based census variables from raw ACS counts
acs_small <- acs_data %>%
  transmute(
    GEOID,
    med_inc               = med_incE,
    poverty_rate          = if_else(poverty_totalE > 0, poverty_belowE / poverty_totalE, NA_real_),
    transit_share         = if_else(commute_totalE > 0, commute_transitE / commute_totalE, NA_real_),
    edu_ba_share          = if_else(edu_totalE > 0, edu_bachelorE / edu_totalE, NA_real_),
    burden_renter30_share = if_else(burden_renter_totalE > 0, burden_renter_30E / burden_renter_totalE, NA_real_)
  )

# Spatially join census variables to each property via tract intersection
phl_props_filtered <- st_join(phl_props_filtered, acs_small, left = TRUE)

cat("Census variables joined. Sample size:", nrow(phl_props_filtered), "\n")
Census variables joined. Sample size: 23372 

5. Creative Visualization

Code
# Summarize number of properties sold per census tract
prop_summary <- phl_props_filtered %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarise(
    mean_price   = mean(sale_price, na.rm = TRUE),
    n_properties = n()
  )

# Join back to census tract geometry for mapping
tract_analysis <- acs_data %>%
  left_join(prop_summary, by = "GEOID")

# Map number of properties sold by tract
ggplot(tract_analysis) +
  geom_sf(aes(fill = n_properties)) +
  scale_fill_viridis_c(option = "magma", name = "Number of\nProperties") +
  theme_void() +
  labs(
    title    = "Properties Sold by Census Tract",
    subtitle = "2023–2024 Residential Sales, Philadelphia"
  )

  • Residential sales are unevenly distributed across census tracts, with some areas contributing many more observations than others. This means the model is estimated from denser market information in some neighborhoods than in others, which may affect how evenly it performs across space. Areas with fewer sales may also be associated with greater prediction uncertainty.
Code
library(spdep)
phl_props_filtered <- phl_props_filtered %>%
  filter(!st_is_empty(geometry))

# Build spatial weights matrix directly from sf object
lw <- nb2listw(knn2nb(knearneigh(phl_props_filtered, k = 5)), style = "W")

# Global Moran's I
moran_global <- moran.test(phl_props_filtered$log_sale_price, listw = lw, zero.policy = TRUE)

# Local Moran's I (LISA)
lisa <- localmoran(phl_props_filtered$log_sale_price, listw = lw, zero.policy = TRUE)

# Classify LISA clusters
m <- mean(phl_props_filtered$log_sale_price)
phl_props_filtered <- phl_props_filtered %>%
  mutate(
    local_I      = lisa[, "Ii"],
    local_pval   = lisa[, "Pr(z != E(Ii))"],
    lag_logprice = lag.listw(lw, log_sale_price),
    lisa_cluster = case_when(
      local_pval < 0.05 & log_sale_price >  m & lag_logprice >  mean(lag_logprice) ~ "High-High",
      local_pval < 0.05 & log_sale_price <  m & lag_logprice <  mean(lag_logprice) ~ "Low-Low",
      local_pval < 0.05 & log_sale_price >  m & lag_logprice <= mean(lag_logprice) ~ "High-Low",
      local_pval < 0.05 & log_sale_price <= m & lag_logprice >  mean(lag_logprice) ~ "Low-High",
      TRUE ~ "Not Significant"
    )
  )

# LISA map
lisa_map <- ggplot(phl_props_filtered) +
  geom_sf(aes(color = lisa_cluster), size = 0.4, alpha = 0.8) +
  scale_color_manual(
    values = c(
      "High-High"       = "#d73027",
      "Low-Low"         = "#4575b4",
      "High-Low"        = "#fc8d59",
      "Low-High"        = "#91bfdb",
      "Not Significant" = "#d9d9d9"
    ),
    name = "LISA Cluster"
  ) +
  labs(
    title = "Similar-Priced Properties Cluster Together Across Philadelphia",
    subtitle = paste0("Moran's I = ", round(moran_global$estimate[1], 3), ", p < 0.001")
  ) +
  theme_void() +
  theme(
    plot.title      = element_text(size = 12, face = "bold"),
    plot.subtitle   = element_text(size = 10),
    plot.caption    = element_markdown(size = 8, lineheight = 1.5), 
    plot.background = element_rect(fill = "white", color = NA)
  )

lisa_map


Phase 3: Feature Engineering

Spatial Features

Price vs. Spatial Features

Code
# 2x2 scatter plots of log sale price vs each spatial feature
p1 <- ggplot(phl_props_filtered, aes(x = transit_500ft, y = log_sale_price)) +
  geom_point(alpha = 0.2, size = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Transit Stops within 500ft", x = "Count", y = "Log Sale Price") +
  theme_minimal()

p2 <- ggplot(phl_props_filtered, aes(x = n_trees_500ft, y = log_sale_price)) +
  geom_point(alpha = 0.2, size = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Street Trees within 500ft", x = "Count", y = "Log Sale Price") +
  theme_minimal()

p3 <- ggplot(phl_props_filtered, aes(x = school_knn_3, y = log_sale_price)) +
  geom_point(alpha = 0.2, size = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Mean Distance to 3 Nearest Schools (ft)", x = "Distance (ft)", y = "Log Sale Price") +
  theme_minimal()

p4 <- ggplot(phl_props_filtered, aes(x = dist_core_mi, y = log_sale_price)) +
  geom_point(alpha = 0.2, size = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Distance to Center City (miles)", x = "Distance (miles)", y = "Log Sale Price") +
  theme_minimal()

(p1 | p2) / (p3 | p4)

Interpretation: Properties closer to Center City and with better transit access tend to command higher prices. Street tree density shows a modest positive relationship, while greater distance to schools is associated with lower prices.

Explanation of spatial feature engineering:

  • transit_500ft: counts SEPTA stops within a 500ft buffer around each property, capturing walkable transit access.
  • n_trees_500ft: counts street trees within 500ft, used as a proxy for neighborhood greenery and quality.
  • school_knn_3: mean distance to the 3 nearest schools in feet, capturing school accessibility.
  • dist_core_mi: straight-line distance to Center City in miles, capturing urban centrality and accessibility to jobs and amenities.

Step 2: Prepare Model Data Frame

Code
# Create clean model data frame with all engineered variables
# Drop geometry for modeling, create age and log area variables
model_df <- phl_props_filtered %>%
  st_drop_geometry() %>%
  mutate(
    sale_year                  = as.factor(sale_year),
    category_code_description  = as.factor(category_code_description),
    log_area                   = log(total_livable_area),
    year_built                 = as.integer(year_built),
    sale_year_num              = as.integer(as.character(sale_year)),
    age                        = sale_year_num - year_built,
    age_c                      = age - mean(age, na.rm = TRUE)  # Mean-centered age
  ) %>%
  drop_na(
    log_sale_price, log_area,
    number_of_bedrooms, number_of_bathrooms,
    age_c,
    med_inc, poverty_rate, transit_share, edu_ba_share, burden_renter30_share,
    transit_500ft, n_trees_500ft, school_knn_3, dist_core_mi,
    sale_year, category_code_description
  )

cat("Final model sample size:", nrow(model_df), "\n")
Final model sample size: 22284 
Code
cat("Variables in model df:", ncol(model_df), "\n")
Variables in model df: 100 
Code
# Scatter plot of log sale price vs median household income by census tract
ggplot(phl_props_filtered, aes(x = med_inc, y = log_sale_price)) +
  geom_point(alpha = 0.3, size = 0.5, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  scale_x_continuous(labels = scales::dollar) +
  theme_minimal() +
  labs(
    title    = "Log Sale Price vs. Median Household Income",
    subtitle = "Each point represents a property, colored by census tract income",
    x        = "Median Household Income ($)",
    y        = "Log Sale Price"
  )

Explanation of variable construction:

  • log_area: log-transformed total livable area to reduce right skew and capture diminishing returns of size.
  • age_c: mean-centered age of the property, which allows the intercept to be interpreted at the average age and improves interpretability of polynomial terms.
  • All rows with missing values in key model variables are dropped to ensure a clean, complete dataset for modeling and cross-validation.

Feature Summary

Code
# Summary table of all engineered features
data.frame(
  Feature = c(
    "transit_500ft",
    "n_trees_500ft", 
    "school_knn_3",
    "dist_core_ft",
    "dist_core_mi",
    "med_inc",
    "poverty_rate",
    "transit_share",
    "edu_ba_share",
    "burden_renter30_share"
  ),
  Type = c(
    "Buffer count",
    "Buffer count",
    "kNN distance",
    "Point distance",
    "Point distance",
    "Census",
    "Census",
    "Census",
    "Census",
    "Census"
  ),
  Description = c(
    "Number of SEPTA stops within 500ft",
    "Number of street trees within 500ft",
    "Mean distance to 3 nearest schools (ft)",
    "Distance to Center City in feet",
    "Distance to Center City in miles",
    "Median household income by census tract",
    "Share of population below poverty line",
    "Share of commuters using public transit",
    "Share of adults with bachelor's degree or higher",
    "Share of renters spending 30%+ of income on housing"
  ),
  Justification = c(
    "Transit access is a key driver of urban property values",
    "Green infrastructure signals neighborhood quality and investment",
    "School proximity affects family housing decisions and demand",
    "Accessibility to jobs and amenities drives urban price premiums",
    "Same as above, expressed in miles for interpretability",
    "Neighborhood income level is a strong proxy for housing demand",
    "Poverty concentration depresses local property values",
    "Transit-dependent neighborhoods differ systematically in price",
    "Education level reflects neighborhood socioeconomic character",
    "Cost burden signals housing affordability pressure in the area"
  )
) %>%
  kable(caption = "Summary of Engineered Features")
Summary of Engineered Features
Feature Type Description Justification
transit_500ft Buffer count Number of SEPTA stops within 500ft Transit access is a key driver of urban property values
n_trees_500ft Buffer count Number of street trees within 500ft Green infrastructure signals neighborhood quality and investment
school_knn_3 kNN distance Mean distance to 3 nearest schools (ft) School proximity affects family housing decisions and demand
dist_core_ft Point distance Distance to Center City in feet Accessibility to jobs and amenities drives urban price premiums
dist_core_mi Point distance Distance to Center City in miles Same as above, expressed in miles for interpretability
med_inc Census Median household income by census tract Neighborhood income level is a strong proxy for housing demand
poverty_rate Census Share of population below poverty line Poverty concentration depresses local property values
transit_share Census Share of commuters using public transit Transit-dependent neighborhoods differ systematically in price
edu_ba_share Census Share of adults with bachelor’s degree or higher Education level reflects neighborhood socioeconomic character
burden_renter30_share Census Share of renters spending 30%+ of income on housing Cost burden signals housing affordability pressure in the area

Phase 4: Model Building

Step 1: Structural Models (M1–M3)

Code
# M1: Living area only — simplest baseline
m1 <- lm(log_sale_price ~ log_area, data = model_df)

# M2: + bedrooms and bathrooms
m2 <- lm(
  log_sale_price ~ log_area + number_of_bedrooms + number_of_bathrooms,
  data = model_df
)

# M3: + age (polynomial to capture U-shaped depreciation curve)
m3 <- lm(
  log_sale_price ~ log_area + number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2),
  data = model_df
)

# Comparison between structural models
data.frame(
  Model = c("M1: Area", "M2: Structure", "M3: + Age"),
  Rsquared = c(summary(m1)$r.squared, summary(m2)$r.squared, summary(m3)$r.squared)
) %>%
  arrange(desc(Rsquared)) %>%
  slice(1) %>%
  pull(Model)
[1] "M3: + Age"
Code
summary(m3)

Call:
lm(formula = log_sale_price ~ log_area + number_of_bedrooms + 
    number_of_bathrooms + age_c + I(age_c^2), data = model_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6863 -0.2910  0.0310  0.3103  3.1843 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          6.215e+00  1.037e-01   59.94   <2e-16 ***
log_area             8.884e-01  1.611e-02   55.13   <2e-16 ***
number_of_bedrooms  -1.642e-01  5.375e-03  -30.55   <2e-16 ***
number_of_bathrooms  2.094e-01  6.103e-03   34.31   <2e-16 ***
age_c               -1.211e-03  1.149e-04  -10.54   <2e-16 ***
I(age_c^2)           5.881e-05  1.809e-06   32.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.537 on 22278 degrees of freedom
Multiple R-squared:  0.3882,    Adjusted R-squared:  0.388 
F-statistic:  2827 on 5 and 22278 DF,  p-value: < 2.2e-16
  • The modeling process begins with core housing attributes, using area, number of bedrooms, number of bathrooms, and housing age as baseline predictors. This specification captures the most direct physical characteristics of a home and shows that sale price is partly determined by the property itself, with an R-squared of 0.363. In other words, housing size, layout, and age already explain part of the variation in sale prices, but their explanatory power remains limited. This also suggests that property-level features alone are not sufficient to account for the substantial price differences observed across Philadelphia neighborhoods.

Step 2: Add Census and Spatial Features (M4–M5b)

Code
# M4: + census
m4 <- lm(
  log_sale_price ~ log_area + number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share,
  data = model_df
)

summary(m4)

Call:
lm(formula = log_sale_price ~ log_area + number_of_bedrooms + 
    number_of_bathrooms + age_c + I(age_c^2) + med_inc + poverty_rate + 
    transit_share + edu_ba_share + burden_renter30_share, data = model_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4858 -0.2272  0.0220  0.2224  3.4079 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            7.711e+00  9.139e-02  84.379  < 2e-16 ***
log_area               5.797e-01  1.411e-02  41.085  < 2e-16 ***
number_of_bedrooms    -1.909e-02  4.829e-03  -3.954 7.71e-05 ***
number_of_bathrooms    1.624e-01  5.233e-03  31.037  < 2e-16 ***
age_c                 -9.434e-04  1.003e-04  -9.409  < 2e-16 ***
I(age_c^2)             2.756e-05  1.593e-06  17.299  < 2e-16 ***
med_inc                3.323e-06  1.800e-07  18.468  < 2e-16 ***
poverty_rate          -3.676e-01  3.595e-02 -10.225  < 2e-16 ***
transit_share         -4.851e-01  3.446e-02 -14.074  < 2e-16 ***
edu_ba_share           1.398e+00  4.254e-02  32.853  < 2e-16 ***
burden_renter30_share  2.653e-01  4.209e-02   6.303 2.98e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4563 on 22273 degrees of freedom
Multiple R-squared:  0.5583,    Adjusted R-squared:  0.5581 
F-statistic:  2815 on 10 and 22273 DF,  p-value: < 2.2e-16
  • The model then incorporates tract-level Census variables, including median income, poverty rate, educational attainment, renter cost burden, and transit commute share. After adding these neighborhood socioeconomic indicators, model fit improves substantially, with R-squared increasing from 0.363 in M3 to 0.512 in M4. This is the first major improvement in the modeling process and indicates that housing prices are shaped not only by the home itself, but also by broader neighborhood conditions. Put differently, two homes with similar physical characteristics may still sell at systematically different prices if they are located in different socioeconomic environments.
Code
# M5: + spatial features
m5 <- lm(
  log_sale_price ~ log_area + number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share +
    transit_500ft + n_trees_500ft + school_knn_3 + dist_core_mi,
  data = model_df
)

# M5b: + sale year and property type as controls
model_df_2 <- phl_props_filtered %>%
  st_drop_geometry() %>%
  mutate(
    sale_year                 = as.factor(sale_year),
    category_code_description = as.factor(category_code_description),
    log_area                  = log(total_livable_area),
    year_built                = as.integer(year_built),
    sale_year_num             = as.integer(as.character(sale_year)),
    age                       = sale_year_num - year_built,
    age_c                     = age - mean(age, na.rm = TRUE)
  ) %>%
  drop_na(
    log_sale_price, log_area,
    number_of_bedrooms, number_of_bathrooms,
    age_c,
    med_inc, poverty_rate, transit_share, edu_ba_share, burden_renter30_share,
    transit_500ft, n_trees_500ft, school_knn_3, dist_core_mi,
    sale_year, category_code_description
  )

m5b <- lm(
  log_sale_price ~ log_area + number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share +
    transit_500ft + n_trees_500ft + school_knn_3 + dist_core_mi +
    sale_year + category_code_description,
  data = model_df_2
)

# Compare M1 through M5
data.frame(
  Model = c("M5: + Spatial", "M5b: + Year/Type"),
  Rsquared = c(summary(m5)$r.squared, summary(m5b)$r.squared)
) %>%
  arrange(desc(Rsquared)) %>%
  slice(1) %>%
  pull(Model)
[1] "M5b: + Year/Type"
Code
summary(m5b)

Call:
lm(formula = log_sale_price ~ log_area + number_of_bedrooms + 
    number_of_bathrooms + age_c + I(age_c^2) + med_inc + poverty_rate + 
    transit_share + edu_ba_share + burden_renter30_share + transit_500ft + 
    n_trees_500ft + school_knn_3 + dist_core_mi + sale_year + 
    category_code_description, data = model_df_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3871 -0.2130  0.0236  0.2131  3.3840 

Coefficients:
                                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                             7.651e+00  9.154e-02  83.578  < 2e-16
log_area                                5.466e-01  1.381e-02  39.578  < 2e-16
number_of_bedrooms                     -1.359e-03  4.753e-03  -0.286 0.775017
number_of_bathrooms                     1.692e-01  5.304e-03  31.901  < 2e-16
age_c                                  -1.515e-03  1.022e-04 -14.829  < 2e-16
I(age_c^2)                              2.075e-05  1.587e-06  13.075  < 2e-16
med_inc                                 2.144e-06  1.797e-07  11.927  < 2e-16
poverty_rate                           -4.334e-01  3.686e-02 -11.760  < 2e-16
transit_share                          -4.343e-01  3.703e-02 -11.730  < 2e-16
edu_ba_share                            1.040e+00  4.505e-02  23.082  < 2e-16
burden_renter30_share                   1.394e-01  4.123e-02   3.381 0.000723
transit_500ft                           2.462e-03  5.935e-04   4.148 3.37e-05
n_trees_500ft                           2.637e-03  8.255e-05  31.944  < 2e-16
school_knn_3                            1.968e-05  4.690e-06   4.196 2.73e-05
dist_core_mi                            1.169e-02  1.418e-03   8.238  < 2e-16
sale_year2024                           4.340e-02  5.949e-03   7.296 3.06e-13
category_code_descriptionSINGLE FAMILY  1.468e-01  1.250e-02  11.743  < 2e-16
                                          
(Intercept)                            ***
log_area                               ***
number_of_bedrooms                        
number_of_bathrooms                    ***
age_c                                  ***
I(age_c^2)                             ***
med_inc                                ***
poverty_rate                           ***
transit_share                          ***
edu_ba_share                           ***
burden_renter30_share                  ***
transit_500ft                          ***
n_trees_500ft                          ***
school_knn_3                           ***
dist_core_mi                           ***
sale_year2024                          ***
category_code_descriptionSINGLE FAMILY ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4437 on 22267 degrees of freedom
Multiple R-squared:  0.5824,    Adjusted R-squared:  0.5821 
F-statistic:  1941 on 16 and 22267 DF,  p-value: < 2.2e-16
  • Next, the model adds finer-scale spatial variables, including the number of transit stops within 500 feet, the number of trees within 500 feet, the average distance to the three nearest schools, and distance to Center City. It also includes controls for sale year and property type. With these additions, model fit improves further, with R-squared rising from 0.512 in M4 to 0.582 in M5b. This suggests that local accessibility, environmental amenities, centrality, and time-related market differences all contribute additional explanatory power. However, the gain is more moderate than the jump from M3 to M4, which implies that larger-scale neighborhood context remains more important than local spatial conditions in explaining price variation in Philadelphia.

Step 3: Add Interaction (M6)

Code
# M6: + interaction term (living area x education share)
# Tests whether the value of space varies by neighborhood education level
m6 <- lm(
  log_sale_price ~ log_area * edu_ba_share +
    number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + burden_renter30_share +
    transit_500ft + n_trees_500ft + school_knn_3 + dist_core_mi +
    sale_year + category_code_description,
  data = model_df
)

summary(m6)

Call:
lm(formula = log_sale_price ~ log_area * edu_ba_share + number_of_bedrooms + 
    number_of_bathrooms + age_c + I(age_c^2) + med_inc + poverty_rate + 
    transit_share + burden_renter30_share + transit_500ft + n_trees_500ft + 
    school_knn_3 + dist_core_mi + sale_year + category_code_description, 
    data = model_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3871 -0.2127  0.0244  0.2137  3.3449 

Coefficients:
                                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                             8.508e+00  1.650e-01  51.556  < 2e-16
log_area                                4.282e-01  2.347e-02  18.240  < 2e-16
edu_ba_share                           -2.179e+00  5.179e-01  -4.208 2.59e-05
number_of_bedrooms                      5.672e-05  4.755e-03   0.012  0.99048
number_of_bathrooms                     1.679e-01  5.304e-03  31.655  < 2e-16
age_c                                  -1.485e-03  1.022e-04 -14.528  < 2e-16
I(age_c^2)                              2.068e-05  1.586e-06  13.042  < 2e-16
med_inc                                 2.097e-06  1.797e-07  11.667  < 2e-16
poverty_rate                           -4.536e-01  3.697e-02 -12.271  < 2e-16
transit_share                          -4.244e-01  3.703e-02 -11.460  < 2e-16
burden_renter30_share                   1.334e-01  4.120e-02   3.236  0.00121
transit_500ft                           2.641e-03  5.937e-04   4.448 8.70e-06
n_trees_500ft                           2.655e-03  8.253e-05  32.168  < 2e-16
school_knn_3                            2.098e-05  4.691e-06   4.472 7.79e-06
dist_core_mi                            1.202e-02  1.418e-03   8.477  < 2e-16
sale_year2024                           4.319e-02  5.944e-03   7.267 3.81e-13
category_code_descriptionSINGLE FAMILY  1.337e-01  1.267e-02  10.549  < 2e-16
log_area:edu_ba_share                   4.489e-01  7.195e-02   6.239 4.47e-10
                                          
(Intercept)                            ***
log_area                               ***
edu_ba_share                           ***
number_of_bedrooms                        
number_of_bathrooms                    ***
age_c                                  ***
I(age_c^2)                             ***
med_inc                                ***
poverty_rate                           ***
transit_share                          ***
burden_renter30_share                  ** 
transit_500ft                          ***
n_trees_500ft                          ***
school_knn_3                           ***
dist_core_mi                           ***
sale_year2024                          ***
category_code_descriptionSINGLE FAMILY ***
log_area:edu_ba_share                  ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4434 on 22266 degrees of freedom
Multiple R-squared:  0.5831,    Adjusted R-squared:  0.5828 
F-statistic:  1832 on 17 and 22266 DF,  p-value: < 2.2e-16
  • The analysis also tests an interaction term in M6 by adding log_area × edu_ba_share. This specification examines whether additional housing space commands a larger price premium in neighborhoods with higher shares of residents holding a bachelor’s degree or above. The improvement is minimal, with in-sample R-squared increasing only from 0.582 in M5b to 0.583 in M6. This suggests that, under the current specification, the return to housing size does not vary strongly enough across education contexts to meaningfully improve model performance. For that reason, the interaction term was not retained as part of the preferred final specification.

Step 4: Neighborhood Fixed Effects (M7)

Code
# Load neighborhood boundaries and join to properties
phl_nhoods <- st_read(
  "data/philadelphia-neighborhoods.geojson",
  quiet = TRUE
) %>%
  st_transform(2272) %>%
  transmute(neighborhood = NAME)

phl_props_filtered <- phl_props_filtered %>%
  st_join(phl_nhoods, join = st_intersects)

# Prepare model data frame with neighborhood variable
model_df_fe <- phl_props_filtered %>%
  st_drop_geometry() %>%
  mutate(
    neighborhood              = as.factor(neighborhood),
    sale_year                 = as.factor(sale_year),
    category_code_description = as.factor(category_code_description),
    log_area                  = log(total_livable_area),
    year_built                = as.integer(year_built),
    sale_year_num             = as.integer(as.character(sale_year)),
    age                       = sale_year_num - year_built,
    age_c                     = age - mean(age, na.rm = TRUE)
  ) %>%
  drop_na(
    log_sale_price, log_area,
    number_of_bedrooms, number_of_bathrooms,
    age_c,
    med_inc, poverty_rate, transit_share, edu_ba_share, burden_renter30_share,
    transit_500ft, n_trees_500ft, school_knn_3, dist_core_mi,
    sale_year, category_code_description, neighborhood
  )

# M7: + neighborhood fixed effects
# Each neighborhood gets its own intercept, capturing unmeasured locational qualities
m7 <- lm(
  log_sale_price ~ log_area +
    number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share +
    transit_500ft + n_trees_500ft + school_knn_3 + dist_core_mi +
    sale_year + category_code_description +
    as.factor(neighborhood),
  data = model_df_fe
)

summary(m7)

Call:
lm(formula = log_sale_price ~ log_area + number_of_bedrooms + 
    number_of_bathrooms + age_c + I(age_c^2) + med_inc + poverty_rate + 
    transit_share + edu_ba_share + burden_renter30_share + transit_500ft + 
    n_trees_500ft + school_knn_3 + dist_core_mi + sale_year + 
    category_code_description + as.factor(neighborhood), data = model_df_fe)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3390 -0.1865  0.0290  0.1881  3.3763 

Coefficients:
                                              Estimate Std. Error t value
(Intercept)                                  7.861e+00  1.440e-01  54.598
log_area                                     5.384e-01  1.363e-02  39.491
number_of_bedrooms                           2.437e-02  4.790e-03   5.088
number_of_bathrooms                          1.434e-01  5.184e-03  27.657
age_c                                       -1.481e-03  1.049e-04 -14.116
I(age_c^2)                                   1.324e-05  1.630e-06   8.123
med_inc                                      5.224e-07  2.169e-07   2.408
poverty_rate                                -1.459e-01  4.766e-02  -3.060
transit_share                               -8.016e-02  4.730e-02  -1.695
edu_ba_share                                 4.280e-01  6.441e-02   6.644
burden_renter30_share                        1.668e-01  4.688e-02   3.557
transit_500ft                               -1.745e-03  6.289e-04  -2.774
n_trees_500ft                                1.127e-03  1.007e-04  11.194
school_knn_3                                 1.476e-05  6.034e-06   2.447
dist_core_mi                                 3.042e-02  8.988e-03   3.385
sale_year2024                                4.500e-02  5.669e-03   7.937
category_code_descriptionSINGLE FAMILY       1.207e-01  1.205e-02  10.013
as.factor(neighborhood)ALLEGHENY_WEST       -2.622e-01  8.828e-02  -2.970
as.factor(neighborhood)ANDORRA               3.551e-03  8.180e-02   0.043
as.factor(neighborhood)ASTON_WOODBRIDGE     -1.148e-01  7.130e-02  -1.610
as.factor(neighborhood)BARTRAM_VILLAGE      -9.628e-02  1.305e-01  -0.738
as.factor(neighborhood)BELLA_VISTA           5.129e-01  1.103e-01   4.649
as.factor(neighborhood)BELMONT              -2.465e-01  1.164e-01  -2.117
as.factor(neighborhood)BLUE_BELL_HILL        1.810e-01  1.490e-01   1.215
as.factor(neighborhood)BREWERYTOWN           4.277e-02  9.726e-02   0.440
as.factor(neighborhood)BRIDESBURG            3.972e-03  7.587e-02   0.052
as.factor(neighborhood)BURHOLME             -6.529e-02  9.151e-02  -0.713
as.factor(neighborhood)BUSTLETON            -1.244e-01  5.020e-02  -2.478
as.factor(neighborhood)BYBERRY              -2.379e-01  1.161e-01  -2.049
as.factor(neighborhood)CALLOWHILL            3.365e-01  1.364e-01   2.468
as.factor(neighborhood)CARROLL_PARK         -4.067e-01  8.527e-02  -4.769
as.factor(neighborhood)CEDAR_PARK            2.263e-01  9.594e-02   2.359
as.factor(neighborhood)CEDARBROOK           -8.597e-02  6.345e-02  -1.355
as.factor(neighborhood)CENTER_CITY           4.751e-01  1.587e-01   2.993
as.factor(neighborhood)CHESTNUT_HILL         4.599e-01  6.248e-02   7.361
as.factor(neighborhood)CHINATOWN             3.473e-01  1.554e-01   2.235
as.factor(neighborhood)CLEARVIEW            -2.236e-01  9.757e-02  -2.292
as.factor(neighborhood)COBBS_CREEK          -2.124e-01  8.123e-02  -2.615
as.factor(neighborhood)CRESCENTVILLE        -7.758e-02  2.188e-01  -0.355
as.factor(neighborhood)DEARNLEY_PARK        -1.292e-01  8.977e-02  -1.439
as.factor(neighborhood)DICKINSON_NARROWS     2.532e-01  1.027e-01   2.465
as.factor(neighborhood)DUNLAP               -3.544e-01  1.131e-01  -3.133
as.factor(neighborhood)EAST_FALLS            1.392e-01  8.259e-02   1.685
as.factor(neighborhood)EAST_KENSINGTON       1.910e-01  9.217e-02   2.072
as.factor(neighborhood)EAST_OAK_LANE        -3.033e-01  7.661e-02  -3.959
as.factor(neighborhood)EAST_PARK             6.304e-01  4.318e-01   1.460
as.factor(neighborhood)EAST_PARKSIDE        -2.270e-01  1.142e-01  -1.988
as.factor(neighborhood)EAST_PASSYUNK         4.743e-01  1.004e-01   4.726
as.factor(neighborhood)EAST_POPLAR           1.948e-01  1.510e-01   1.290
as.factor(neighborhood)EASTWICK             -1.011e-01  9.274e-02  -1.090
as.factor(neighborhood)ELMWOOD              -4.147e-01  8.211e-02  -5.051
as.factor(neighborhood)FAIRHILL             -1.773e-01  1.124e-01  -1.577
as.factor(neighborhood)FAIRMOUNT             4.689e-01  1.020e-01   4.595
as.factor(neighborhood)FELTONVILLE          -3.650e-01  7.640e-02  -4.777
as.factor(neighborhood)FERN_ROCK            -2.185e-01  8.892e-02  -2.457
as.factor(neighborhood)FISHTOWN              2.661e-01  9.124e-02   2.917
as.factor(neighborhood)FITLER_SQUARE         7.866e-01  1.227e-01   6.412
as.factor(neighborhood)FOX_CHASE            -1.825e-02  5.427e-02  -0.336
as.factor(neighborhood)FRANCISVILLE          3.005e-01  1.085e-01   2.770
as.factor(neighborhood)FRANKFORD            -4.240e-01  6.845e-02  -6.195
as.factor(neighborhood)FRANKLINVILLE        -4.460e-01  9.505e-02  -4.692
as.factor(neighborhood)GARDEN_COURT          2.465e-01  1.234e-01   1.997
as.factor(neighborhood)GERMANTOWN_EAST      -4.061e-01  7.062e-02  -5.751
as.factor(neighborhood)GERMANTOWN_MORTON    -3.438e-01  7.940e-02  -4.329
as.factor(neighborhood)GERMANTOWN_PENN_KNOX -2.055e-01  1.211e-01  -1.697
as.factor(neighborhood)GERMANTOWN_SOUTHWEST -1.738e-01  8.312e-02  -2.091
as.factor(neighborhood)GERMANTOWN_WEST_CENT  1.717e-02  8.726e-02   0.197
as.factor(neighborhood)GERMANTOWN_WESTSIDE  -2.821e-01  1.035e-01  -2.726
as.factor(neighborhood)GERMANY_HILL          7.130e-02  8.152e-02   0.875
as.factor(neighborhood)GIRARD_ESTATES        2.122e-01  9.574e-02   2.216
as.factor(neighborhood)GLENWOOD             -4.400e-01  1.052e-01  -4.182
as.factor(neighborhood)GRADUATE_HOSPITAL     5.555e-01  1.051e-01   5.286
as.factor(neighborhood)GRAYS_FERRY          -4.729e-02  9.844e-02  -0.480
as.factor(neighborhood)GREENWICH             2.095e-01  1.073e-01   1.952
as.factor(neighborhood)HADDINGTON           -4.675e-01  8.402e-02  -5.564
as.factor(neighborhood)HARROWGATE           -4.860e-01  8.211e-02  -5.919
as.factor(neighborhood)HARTRANFT            -5.614e-01  9.580e-02  -5.860
as.factor(neighborhood)HAVERFORD_NORTH      -2.820e-01  1.339e-01  -2.106
as.factor(neighborhood)HAWTHORNE             4.777e-01  1.138e-01   4.198
as.factor(neighborhood)HOLMESBURG           -2.036e-01  5.331e-02  -3.819
as.factor(neighborhood)HUNTING_PARK         -1.268e-01  8.094e-02  -1.567
as.factor(neighborhood)JUNIATA_PARK         -1.901e-01  7.559e-02  -2.515
as.factor(neighborhood)KINGSESSING          -1.475e-01  8.502e-02  -1.734
as.factor(neighborhood)LAWNDALE             -1.514e-01  6.228e-02  -2.432
as.factor(neighborhood)LEXINGTON_PARK        5.909e-02  7.682e-02   0.769
as.factor(neighborhood)LOGAN                -2.805e-01  7.499e-02  -3.741
as.factor(neighborhood)LOGAN_SQUARE          6.693e-01  1.109e-01   6.037
as.factor(neighborhood)LOWER_MOYAMENSING     4.868e-02  9.460e-02   0.515
as.factor(neighborhood)LUDLOW                8.915e-02  1.429e-01   0.624
as.factor(neighborhood)MANAYUNK              5.746e-02  7.046e-02   0.816
as.factor(neighborhood)MANTUA                3.994e-02  1.085e-01   0.368
as.factor(neighborhood)MAYFAIR              -1.371e-01  5.783e-02  -2.371
as.factor(neighborhood)MCGUIRE              -7.426e-01  1.123e-01  -6.614
as.factor(neighborhood)MECHANICSVILLE       -2.719e-01  2.175e-01  -1.250
as.factor(neighborhood)MELROSE_PARK_GARDENS -1.476e-01  9.158e-02  -1.612
as.factor(neighborhood)MILL_CREEK           -4.120e-01  9.932e-02  -4.148
as.factor(neighborhood)MILLBROOK            -1.034e-01  6.733e-02  -1.536
as.factor(neighborhood)MODENA               -1.076e-01  5.821e-02  -1.849
as.factor(neighborhood)MORRELL_PARK         -1.198e-01  5.782e-02  -2.071
as.factor(neighborhood)MOUNT_AIRY_EAST      -4.273e-02  6.321e-02  -0.676
as.factor(neighborhood)MOUNT_AIRY_WEST       1.199e-01  6.737e-02   1.780
as.factor(neighborhood)NEWBOLD               1.859e-01  1.020e-01   1.822
as.factor(neighborhood)NICETOWN              1.365e-01  1.387e-01   0.984
as.factor(neighborhood)NORMANDY_VILLAGE     -9.720e-02  1.106e-01  -0.879
as.factor(neighborhood)NORTH_CENTRAL        -1.204e-01  9.962e-02  -1.208
as.factor(neighborhood)NORTHERN_LIBERTIES    3.268e-01  1.040e-01   3.143
as.factor(neighborhood)NORTHWOOD            -3.088e-01  7.974e-02  -3.873
as.factor(neighborhood)OGONTZ               -3.147e-01  7.325e-02  -4.295
as.factor(neighborhood)OLD_CITY              4.918e-01  1.093e-01   4.501
as.factor(neighborhood)OLD_KENSINGTON        1.666e-01  1.019e-01   1.634
as.factor(neighborhood)OLNEY                -3.180e-01  6.762e-02  -4.703
as.factor(neighborhood)OVERBROOK            -1.998e-01  7.558e-02  -2.644
as.factor(neighborhood)OXFORD_CIRCLE        -1.772e-01  5.725e-02  -3.095
as.factor(neighborhood)PACKER_PARK           3.052e-01  1.018e-01   2.999
as.factor(neighborhood)PARKWOOD_MANOR       -1.519e-01  5.817e-02  -2.612
as.factor(neighborhood)PASCHALL             -3.613e-01  8.395e-02  -4.304
as.factor(neighborhood)PASSYUNK_SQUARE       3.658e-01  1.043e-01   3.508
as.factor(neighborhood)PENNSPORT             2.102e-01  1.013e-01   2.074
as.factor(neighborhood)PENNYPACK            -7.604e-02  5.723e-02  -1.329
as.factor(neighborhood)PENNYPACK_WOODS      -3.827e-02  9.020e-02  -0.424
as.factor(neighborhood)PENROSE              -2.158e-01  9.643e-02  -2.238
as.factor(neighborhood)POINT_BREEZE          1.668e-01  9.996e-02   1.668
as.factor(neighborhood)POWELTON              3.402e-01  1.651e-01   2.061
as.factor(neighborhood)QUEEN_VILLAGE         5.006e-01  1.063e-01   4.710
as.factor(neighborhood)RHAWNHURST           -3.761e-02  5.463e-02  -0.688
as.factor(neighborhood)RICHMOND             -4.106e-02  8.201e-02  -0.501
as.factor(neighborhood)RITTENHOUSE           8.368e-01  1.088e-01   7.690
as.factor(neighborhood)RIVERFRONT            2.933e-01  1.231e-01   2.383
as.factor(neighborhood)ROXBOROUGH            1.145e-01  7.048e-02   1.624
as.factor(neighborhood)ROXBOROUGH_PARK       5.259e-02  1.039e-01   0.506
as.factor(neighborhood)SHARSWOOD            -4.091e-02  1.096e-01  -0.373
as.factor(neighborhood)SOCIETY_HILL          6.544e-01  1.078e-01   6.069
as.factor(neighborhood)SOMERTON             -1.285e-01  5.481e-02  -2.345
as.factor(neighborhood)SOUTHWEST_SCHUYLKILL -2.834e-01  9.371e-02  -3.024
as.factor(neighborhood)SPRING_GARDEN         4.112e-01  1.086e-01   3.788
as.factor(neighborhood)SPRUCE_HILL           3.606e-01  1.057e-01   3.411
as.factor(neighborhood)STADIUM_DISTRICT      2.104e-01  1.012e-01   2.079
as.factor(neighborhood)STANTON              -2.901e-01  9.432e-02  -3.075
as.factor(neighborhood)STRAWBERRY_MANSION   -4.222e-01  9.270e-02  -4.554
as.factor(neighborhood)SUMMERDALE           -2.894e-01  7.176e-02  -4.032
as.factor(neighborhood)TACONY               -2.749e-01  5.916e-02  -4.647
as.factor(neighborhood)TIOGA                -5.451e-01  8.799e-02  -6.195
as.factor(neighborhood)TORRESDALE           -1.891e-01  5.393e-02  -3.506
as.factor(neighborhood)UNIVERSITY_CITY       1.599e-01  2.614e-01   0.612
as.factor(neighborhood)UPPER_KENSINGTON     -2.843e-01  8.494e-02  -3.347
as.factor(neighborhood)UPPER_ROXBOROUGH      7.832e-02  6.458e-02   1.213
as.factor(neighborhood)WALNUT_HILL          -2.496e-02  1.041e-01  -0.240
as.factor(neighborhood)WASHINGTON_SQUARE     5.724e-01  1.129e-01   5.068
as.factor(neighborhood)WEST_KENSINGTON      -1.415e-01  9.555e-02  -1.481
as.factor(neighborhood)WEST_OAK_LANE        -2.084e-01  5.902e-02  -3.531
as.factor(neighborhood)WEST_PARKSIDE        -3.917e-01  2.054e-01  -1.907
as.factor(neighborhood)WEST_PASSYUNK        -1.402e-02  9.808e-02  -0.143
as.factor(neighborhood)WEST_POPLAR           3.284e-01  1.469e-01   2.236
as.factor(neighborhood)WEST_POWELTON         1.138e-02  1.170e-01   0.097
as.factor(neighborhood)WHITMAN               1.237e-01  9.622e-02   1.286
as.factor(neighborhood)WINCHESTER_PARK       8.737e-02  9.248e-02   0.945
as.factor(neighborhood)WISSAHICKON           9.570e-02  8.074e-02   1.185
as.factor(neighborhood)WISSAHICKON_HILLS     1.979e-01  9.901e-02   1.999
as.factor(neighborhood)WISSINOMING          -2.805e-01  6.214e-02  -4.513
as.factor(neighborhood)WISTER               -2.764e-01  8.807e-02  -3.138
as.factor(neighborhood)WOODLAND_TERRACE      4.365e-01  1.950e-01   2.238
as.factor(neighborhood)WYNNEFIELD           -1.178e-01  8.387e-02  -1.404
as.factor(neighborhood)WYNNEFIELD_HEIGHTS   -2.411e-01  9.373e-02  -2.573
as.factor(neighborhood)YORKTOWN              1.401e-01  1.485e-01   0.943
                                            Pr(>|t|)    
(Intercept)                                  < 2e-16 ***
log_area                                     < 2e-16 ***
number_of_bedrooms                          3.65e-07 ***
number_of_bathrooms                          < 2e-16 ***
age_c                                        < 2e-16 ***
I(age_c^2)                                  4.78e-16 ***
med_inc                                     0.016031 *  
poverty_rate                                0.002213 ** 
transit_share                               0.090112 .  
edu_ba_share                                3.11e-11 ***
burden_renter30_share                       0.000375 ***
transit_500ft                               0.005538 ** 
n_trees_500ft                                < 2e-16 ***
school_knn_3                                0.014428 *  
dist_core_mi                                0.000713 ***
sale_year2024                               2.17e-15 ***
category_code_descriptionSINGLE FAMILY       < 2e-16 ***
as.factor(neighborhood)ALLEGHENY_WEST       0.002982 ** 
as.factor(neighborhood)ANDORRA              0.965374    
as.factor(neighborhood)ASTON_WOODBRIDGE     0.107325    
as.factor(neighborhood)BARTRAM_VILLAGE      0.460698    
as.factor(neighborhood)BELLA_VISTA          3.35e-06 ***
as.factor(neighborhood)BELMONT              0.034280 *  
as.factor(neighborhood)BLUE_BELL_HILL       0.224452    
as.factor(neighborhood)BREWERYTOWN          0.660162    
as.factor(neighborhood)BRIDESBURG           0.958250    
as.factor(neighborhood)BURHOLME             0.475581    
as.factor(neighborhood)BUSTLETON            0.013221 *  
as.factor(neighborhood)BYBERRY              0.040441 *  
as.factor(neighborhood)CALLOWHILL           0.013587 *  
as.factor(neighborhood)CARROLL_PARK         1.86e-06 ***
as.factor(neighborhood)CEDAR_PARK           0.018349 *  
as.factor(neighborhood)CEDARBROOK           0.175435    
as.factor(neighborhood)CENTER_CITY          0.002764 ** 
as.factor(neighborhood)CHESTNUT_HILL        1.90e-13 ***
as.factor(neighborhood)CHINATOWN            0.025416 *  
as.factor(neighborhood)CLEARVIEW            0.021936 *  
as.factor(neighborhood)COBBS_CREEK          0.008926 ** 
as.factor(neighborhood)CRESCENTVILLE        0.722918    
as.factor(neighborhood)DEARNLEY_PARK        0.150230    
as.factor(neighborhood)DICKINSON_NARROWS    0.013715 *  
as.factor(neighborhood)DUNLAP               0.001735 ** 
as.factor(neighborhood)EAST_FALLS           0.091977 .  
as.factor(neighborhood)EAST_KENSINGTON      0.038298 *  
as.factor(neighborhood)EAST_OAK_LANE        7.56e-05 ***
as.factor(neighborhood)EAST_PARK            0.144313    
as.factor(neighborhood)EAST_PARKSIDE        0.046836 *  
as.factor(neighborhood)EAST_PASSYUNK        2.31e-06 ***
as.factor(neighborhood)EAST_POPLAR          0.196895    
as.factor(neighborhood)EASTWICK             0.275575    
as.factor(neighborhood)ELMWOOD              4.44e-07 ***
as.factor(neighborhood)FAIRHILL             0.114770    
as.factor(neighborhood)FAIRMOUNT            4.35e-06 ***
as.factor(neighborhood)FELTONVILLE          1.79e-06 ***
as.factor(neighborhood)FERN_ROCK            0.014004 *  
as.factor(neighborhood)FISHTOWN             0.003540 ** 
as.factor(neighborhood)FITLER_SQUARE        1.46e-10 ***
as.factor(neighborhood)FOX_CHASE            0.736651    
as.factor(neighborhood)FRANCISVILLE         0.005618 ** 
as.factor(neighborhood)FRANKFORD            5.95e-10 ***
as.factor(neighborhood)FRANKLINVILLE        2.72e-06 ***
as.factor(neighborhood)GARDEN_COURT         0.045875 *  
as.factor(neighborhood)GERMANTOWN_EAST      8.97e-09 ***
as.factor(neighborhood)GERMANTOWN_MORTON    1.50e-05 ***
as.factor(neighborhood)GERMANTOWN_PENN_KNOX 0.089754 .  
as.factor(neighborhood)GERMANTOWN_SOUTHWEST 0.036577 *  
as.factor(neighborhood)GERMANTOWN_WEST_CENT 0.844036    
as.factor(neighborhood)GERMANTOWN_WESTSIDE  0.006409 ** 
as.factor(neighborhood)GERMANY_HILL         0.381787    
as.factor(neighborhood)GIRARD_ESTATES       0.026685 *  
as.factor(neighborhood)GLENWOOD             2.90e-05 ***
as.factor(neighborhood)GRADUATE_HOSPITAL    1.26e-07 ***
as.factor(neighborhood)GRAYS_FERRY          0.630921    
as.factor(neighborhood)GREENWICH            0.050958 .  
as.factor(neighborhood)HADDINGTON           2.67e-08 ***
as.factor(neighborhood)HARROWGATE           3.29e-09 ***
as.factor(neighborhood)HARTRANFT            4.68e-09 ***
as.factor(neighborhood)HAVERFORD_NORTH      0.035213 *  
as.factor(neighborhood)HAWTHORNE            2.70e-05 ***
as.factor(neighborhood)HOLMESBURG           0.000134 ***
as.factor(neighborhood)HUNTING_PARK         0.117156    
as.factor(neighborhood)JUNIATA_PARK         0.011920 *  
as.factor(neighborhood)KINGSESSING          0.082864 .  
as.factor(neighborhood)LAWNDALE             0.015037 *  
as.factor(neighborhood)LEXINGTON_PARK       0.441799    
as.factor(neighborhood)LOGAN                0.000184 ***
as.factor(neighborhood)LOGAN_SQUARE         1.60e-09 ***
as.factor(neighborhood)LOWER_MOYAMENSING    0.606845    
as.factor(neighborhood)LUDLOW               0.532647    
as.factor(neighborhood)MANAYUNK             0.414771    
as.factor(neighborhood)MANTUA               0.712787    
as.factor(neighborhood)MAYFAIR              0.017759 *  
as.factor(neighborhood)MCGUIRE              3.82e-11 ***
as.factor(neighborhood)MECHANICSVILLE       0.211374    
as.factor(neighborhood)MELROSE_PARK_GARDENS 0.107047    
as.factor(neighborhood)MILL_CREEK           3.37e-05 ***
as.factor(neighborhood)MILLBROOK            0.124616    
as.factor(neighborhood)MODENA               0.064446 .  
as.factor(neighborhood)MORRELL_PARK         0.038329 *  
as.factor(neighborhood)MOUNT_AIRY_EAST      0.498995    
as.factor(neighborhood)MOUNT_AIRY_WEST      0.075102 .  
as.factor(neighborhood)NEWBOLD              0.068399 .  
as.factor(neighborhood)NICETOWN             0.325062    
as.factor(neighborhood)NORMANDY_VILLAGE     0.379461    
as.factor(neighborhood)NORTH_CENTRAL        0.227005    
as.factor(neighborhood)NORTHERN_LIBERTIES   0.001677 ** 
as.factor(neighborhood)NORTHWOOD            0.000108 ***
as.factor(neighborhood)OGONTZ               1.75e-05 ***
as.factor(neighborhood)OLD_CITY             6.81e-06 ***
as.factor(neighborhood)OLD_KENSINGTON       0.102276    
as.factor(neighborhood)OLNEY                2.58e-06 ***
as.factor(neighborhood)OVERBROOK            0.008204 ** 
as.factor(neighborhood)OXFORD_CIRCLE        0.001969 ** 
as.factor(neighborhood)PACKER_PARK          0.002710 ** 
as.factor(neighborhood)PARKWOOD_MANOR       0.009005 ** 
as.factor(neighborhood)PASCHALL             1.68e-05 ***
as.factor(neighborhood)PASSYUNK_SQUARE      0.000452 ***
as.factor(neighborhood)PENNSPORT            0.038103 *  
as.factor(neighborhood)PENNYPACK            0.183995    
as.factor(neighborhood)PENNYPACK_WOODS      0.671332    
as.factor(neighborhood)PENROSE              0.025205 *  
as.factor(neighborhood)POINT_BREEZE         0.095272 .  
as.factor(neighborhood)POWELTON             0.039356 *  
as.factor(neighborhood)QUEEN_VILLAGE        2.49e-06 ***
as.factor(neighborhood)RHAWNHURST           0.491161    
as.factor(neighborhood)RICHMOND             0.616627    
as.factor(neighborhood)RITTENHOUSE          1.53e-14 ***
as.factor(neighborhood)RIVERFRONT           0.017169 *  
as.factor(neighborhood)ROXBOROUGH           0.104323    
as.factor(neighborhood)ROXBOROUGH_PARK      0.612613    
as.factor(neighborhood)SHARSWOOD            0.708940    
as.factor(neighborhood)SOCIETY_HILL         1.31e-09 ***
as.factor(neighborhood)SOMERTON             0.019034 *  
as.factor(neighborhood)SOUTHWEST_SCHUYLKILL 0.002496 ** 
as.factor(neighborhood)SPRING_GARDEN        0.000152 ***
as.factor(neighborhood)SPRUCE_HILL          0.000649 ***
as.factor(neighborhood)STADIUM_DISTRICT     0.037624 *  
as.factor(neighborhood)STANTON              0.002107 ** 
as.factor(neighborhood)STRAWBERRY_MANSION   5.29e-06 ***
as.factor(neighborhood)SUMMERDALE           5.54e-05 ***
as.factor(neighborhood)TACONY               3.38e-06 ***
as.factor(neighborhood)TIOGA                5.94e-10 ***
as.factor(neighborhood)TORRESDALE           0.000457 ***
as.factor(neighborhood)UNIVERSITY_CITY      0.540738    
as.factor(neighborhood)UPPER_KENSINGTON     0.000820 ***
as.factor(neighborhood)UPPER_ROXBOROUGH     0.225247    
as.factor(neighborhood)WALNUT_HILL          0.810480    
as.factor(neighborhood)WASHINGTON_SQUARE    4.06e-07 ***
as.factor(neighborhood)WEST_KENSINGTON      0.138585    
as.factor(neighborhood)WEST_OAK_LANE        0.000416 ***
as.factor(neighborhood)WEST_PARKSIDE        0.056596 .  
as.factor(neighborhood)WEST_PASSYUNK        0.886322    
as.factor(neighborhood)WEST_POPLAR          0.025385 *  
as.factor(neighborhood)WEST_POWELTON        0.922459    
as.factor(neighborhood)WHITMAN              0.198580    
as.factor(neighborhood)WINCHESTER_PARK      0.344805    
as.factor(neighborhood)WISSAHICKON          0.235940    
as.factor(neighborhood)WISSAHICKON_HILLS    0.045621 *  
as.factor(neighborhood)WISSINOMING          6.41e-06 ***
as.factor(neighborhood)WISTER               0.001702 ** 
as.factor(neighborhood)WOODLAND_TERRACE     0.025203 *  
as.factor(neighborhood)WYNNEFIELD           0.160261    
as.factor(neighborhood)WYNNEFIELD_HEIGHTS   0.010093 *  
as.factor(neighborhood)YORKTOWN             0.345579    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4207 on 22120 degrees of freedom
Multiple R-squared:  0.627, Adjusted R-squared:  0.6243 
F-statistic: 228.1 on 163 and 22120 DF,  p-value: < 2.2e-16
Code
# Improvement comparison: M3, M5b, M6, M7
modelsummary(
  list(
    "M3: Structural Only"             = m3,
    "M5b: + Census"      = m5b,
    "M6: + Spatial"     = m6,
    "M7: + Interactions/FE" = m7
  ),
  gof_map = c("nobs", "r.squared", "adj.r.squared")
)
M3: Structural Only M5b: + Census M6: + Spatial M7: + Interactions/FE
(Intercept) 6.215 7.651 8.508 7.861
(0.104) (0.092) (0.165) (0.144)
log_area 0.888 0.547 0.428 0.538
(0.016) (0.014) (0.023) (0.014)
number_of_bedrooms -0.164 -0.001 0.000 0.024
(0.005) (0.005) (0.005) (0.005)
number_of_bathrooms 0.209 0.169 0.168 0.143
(0.006) (0.005) (0.005) (0.005)
age_c -0.001 -0.002 -0.001 -0.001
(0.000) (0.000) (0.000) (0.000)
I(age_c^2) 0.000 0.000 0.000 0.000
(0.000) (0.000) (0.000) (0.000)
med_inc 0.000 0.000 0.000
(0.000) (0.000) (0.000)
poverty_rate -0.433 -0.454 -0.146
(0.037) (0.037) (0.048)
transit_share -0.434 -0.424 -0.080
(0.037) (0.037) (0.047)
edu_ba_share 1.040 -2.179 0.428
(0.045) (0.518) (0.064)
burden_renter30_share 0.139 0.133 0.167
(0.041) (0.041) (0.047)
transit_500ft 0.002 0.003 -0.002
(0.001) (0.001) (0.001)
n_trees_500ft 0.003 0.003 0.001
(0.000) (0.000) (0.000)
school_knn_3 0.000 0.000 0.000
(0.000) (0.000) (0.000)
dist_core_mi 0.012 0.012 0.030
(0.001) (0.001) (0.009)
sale_year2024 0.043 0.043 0.045
(0.006) (0.006) (0.006)
category_code_descriptionSINGLE FAMILY 0.147 0.134 0.121
(0.013) (0.013) (0.012)
log_area × edu_ba_share 0.449
(0.072)
as.factor(neighborhood)ALLEGHENY_WEST -0.262
(0.088)
as.factor(neighborhood)ANDORRA 0.004
(0.082)
as.factor(neighborhood)ASTON_WOODBRIDGE -0.115
(0.071)
as.factor(neighborhood)BARTRAM_VILLAGE -0.096
(0.131)
as.factor(neighborhood)BELLA_VISTA 0.513
(0.110)
as.factor(neighborhood)BELMONT -0.246
(0.116)
as.factor(neighborhood)BLUE_BELL_HILL 0.181
(0.149)
as.factor(neighborhood)BREWERYTOWN 0.043
(0.097)
as.factor(neighborhood)BRIDESBURG 0.004
(0.076)
as.factor(neighborhood)BURHOLME -0.065
(0.092)
as.factor(neighborhood)BUSTLETON -0.124
(0.050)
as.factor(neighborhood)BYBERRY -0.238
(0.116)
as.factor(neighborhood)CALLOWHILL 0.337
(0.136)
as.factor(neighborhood)CARROLL_PARK -0.407
(0.085)
as.factor(neighborhood)CEDAR_PARK 0.226
(0.096)
as.factor(neighborhood)CEDARBROOK -0.086
(0.063)
as.factor(neighborhood)CENTER_CITY 0.475
(0.159)
as.factor(neighborhood)CHESTNUT_HILL 0.460
(0.062)
as.factor(neighborhood)CHINATOWN 0.347
(0.155)
as.factor(neighborhood)CLEARVIEW -0.224
(0.098)
as.factor(neighborhood)COBBS_CREEK -0.212
(0.081)
as.factor(neighborhood)CRESCENTVILLE -0.078
(0.219)
as.factor(neighborhood)DEARNLEY_PARK -0.129
(0.090)
as.factor(neighborhood)DICKINSON_NARROWS 0.253
(0.103)
as.factor(neighborhood)DUNLAP -0.354
(0.113)
as.factor(neighborhood)EAST_FALLS 0.139
(0.083)
as.factor(neighborhood)EAST_KENSINGTON 0.191
(0.092)
as.factor(neighborhood)EAST_OAK_LANE -0.303
(0.077)
as.factor(neighborhood)EAST_PARK 0.630
(0.432)
as.factor(neighborhood)EAST_PARKSIDE -0.227
(0.114)
as.factor(neighborhood)EAST_PASSYUNK 0.474
(0.100)
as.factor(neighborhood)EAST_POPLAR 0.195
(0.151)
as.factor(neighborhood)EASTWICK -0.101
(0.093)
as.factor(neighborhood)ELMWOOD -0.415
(0.082)
as.factor(neighborhood)FAIRHILL -0.177
(0.112)
as.factor(neighborhood)FAIRMOUNT 0.469
(0.102)
as.factor(neighborhood)FELTONVILLE -0.365
(0.076)
as.factor(neighborhood)FERN_ROCK -0.219
(0.089)
as.factor(neighborhood)FISHTOWN 0.266
(0.091)
as.factor(neighborhood)FITLER_SQUARE 0.787
(0.123)
as.factor(neighborhood)FOX_CHASE -0.018
(0.054)
as.factor(neighborhood)FRANCISVILLE 0.300
(0.108)
as.factor(neighborhood)FRANKFORD -0.424
(0.068)
as.factor(neighborhood)FRANKLINVILLE -0.446
(0.095)
as.factor(neighborhood)GARDEN_COURT 0.246
(0.123)
as.factor(neighborhood)GERMANTOWN_EAST -0.406
(0.071)
as.factor(neighborhood)GERMANTOWN_MORTON -0.344
(0.079)
as.factor(neighborhood)GERMANTOWN_PENN_KNOX -0.205
(0.121)
as.factor(neighborhood)GERMANTOWN_SOUTHWEST -0.174
(0.083)
as.factor(neighborhood)GERMANTOWN_WEST_CENT 0.017
(0.087)
as.factor(neighborhood)GERMANTOWN_WESTSIDE -0.282
(0.103)
as.factor(neighborhood)GERMANY_HILL 0.071
(0.082)
as.factor(neighborhood)GIRARD_ESTATES 0.212
(0.096)
as.factor(neighborhood)GLENWOOD -0.440
(0.105)
as.factor(neighborhood)GRADUATE_HOSPITAL 0.555
(0.105)
as.factor(neighborhood)GRAYS_FERRY -0.047
(0.098)
as.factor(neighborhood)GREENWICH 0.209
(0.107)
as.factor(neighborhood)HADDINGTON -0.467
(0.084)
as.factor(neighborhood)HARROWGATE -0.486
(0.082)
as.factor(neighborhood)HARTRANFT -0.561
(0.096)
as.factor(neighborhood)HAVERFORD_NORTH -0.282
(0.134)
as.factor(neighborhood)HAWTHORNE 0.478
(0.114)
as.factor(neighborhood)HOLMESBURG -0.204
(0.053)
as.factor(neighborhood)HUNTING_PARK -0.127
(0.081)
as.factor(neighborhood)JUNIATA_PARK -0.190
(0.076)
as.factor(neighborhood)KINGSESSING -0.147
(0.085)
as.factor(neighborhood)LAWNDALE -0.151
(0.062)
as.factor(neighborhood)LEXINGTON_PARK 0.059
(0.077)
as.factor(neighborhood)LOGAN -0.281
(0.075)
as.factor(neighborhood)LOGAN_SQUARE 0.669
(0.111)
as.factor(neighborhood)LOWER_MOYAMENSING 0.049
(0.095)
as.factor(neighborhood)LUDLOW 0.089
(0.143)
as.factor(neighborhood)MANAYUNK 0.057
(0.070)
as.factor(neighborhood)MANTUA 0.040
(0.109)
as.factor(neighborhood)MAYFAIR -0.137
(0.058)
as.factor(neighborhood)MCGUIRE -0.743
(0.112)
as.factor(neighborhood)MECHANICSVILLE -0.272
(0.218)
as.factor(neighborhood)MELROSE_PARK_GARDENS -0.148
(0.092)
as.factor(neighborhood)MILL_CREEK -0.412
(0.099)
as.factor(neighborhood)MILLBROOK -0.103
(0.067)
as.factor(neighborhood)MODENA -0.108
(0.058)
as.factor(neighborhood)MORRELL_PARK -0.120
(0.058)
as.factor(neighborhood)MOUNT_AIRY_EAST -0.043
(0.063)
as.factor(neighborhood)MOUNT_AIRY_WEST 0.120
(0.067)
as.factor(neighborhood)NEWBOLD 0.186
(0.102)
as.factor(neighborhood)NICETOWN 0.136
(0.139)
as.factor(neighborhood)NORMANDY_VILLAGE -0.097
(0.111)
as.factor(neighborhood)NORTH_CENTRAL -0.120
(0.100)
as.factor(neighborhood)NORTHERN_LIBERTIES 0.327
(0.104)
as.factor(neighborhood)NORTHWOOD -0.309
(0.080)
as.factor(neighborhood)OGONTZ -0.315
(0.073)
as.factor(neighborhood)OLD_CITY 0.492
(0.109)
as.factor(neighborhood)OLD_KENSINGTON 0.167
(0.102)
as.factor(neighborhood)OLNEY -0.318
(0.068)
as.factor(neighborhood)OVERBROOK -0.200
(0.076)
as.factor(neighborhood)OXFORD_CIRCLE -0.177
(0.057)
as.factor(neighborhood)PACKER_PARK 0.305
(0.102)
as.factor(neighborhood)PARKWOOD_MANOR -0.152
(0.058)
as.factor(neighborhood)PASCHALL -0.361
(0.084)
as.factor(neighborhood)PASSYUNK_SQUARE 0.366
(0.104)
as.factor(neighborhood)PENNSPORT 0.210
(0.101)
as.factor(neighborhood)PENNYPACK -0.076
(0.057)
as.factor(neighborhood)PENNYPACK_WOODS -0.038
(0.090)
as.factor(neighborhood)PENROSE -0.216
(0.096)
as.factor(neighborhood)POINT_BREEZE 0.167
(0.100)
as.factor(neighborhood)POWELTON 0.340
(0.165)
as.factor(neighborhood)QUEEN_VILLAGE 0.501
(0.106)
as.factor(neighborhood)RHAWNHURST -0.038
(0.055)
as.factor(neighborhood)RICHMOND -0.041
(0.082)
as.factor(neighborhood)RITTENHOUSE 0.837
(0.109)
as.factor(neighborhood)RIVERFRONT 0.293
(0.123)
as.factor(neighborhood)ROXBOROUGH 0.114
(0.070)
as.factor(neighborhood)ROXBOROUGH_PARK 0.053
(0.104)
as.factor(neighborhood)SHARSWOOD -0.041
(0.110)
as.factor(neighborhood)SOCIETY_HILL 0.654
(0.108)
as.factor(neighborhood)SOMERTON -0.129
(0.055)
as.factor(neighborhood)SOUTHWEST_SCHUYLKILL -0.283
(0.094)
as.factor(neighborhood)SPRING_GARDEN 0.411
(0.109)
as.factor(neighborhood)SPRUCE_HILL 0.361
(0.106)
as.factor(neighborhood)STADIUM_DISTRICT 0.210
(0.101)
as.factor(neighborhood)STANTON -0.290
(0.094)
as.factor(neighborhood)STRAWBERRY_MANSION -0.422
(0.093)
as.factor(neighborhood)SUMMERDALE -0.289
(0.072)
as.factor(neighborhood)TACONY -0.275
(0.059)
as.factor(neighborhood)TIOGA -0.545
(0.088)
as.factor(neighborhood)TORRESDALE -0.189
(0.054)
as.factor(neighborhood)UNIVERSITY_CITY 0.160
(0.261)
as.factor(neighborhood)UPPER_KENSINGTON -0.284
(0.085)
as.factor(neighborhood)UPPER_ROXBOROUGH 0.078
(0.065)
as.factor(neighborhood)WALNUT_HILL -0.025
(0.104)
as.factor(neighborhood)WASHINGTON_SQUARE 0.572
(0.113)
as.factor(neighborhood)WEST_KENSINGTON -0.142
(0.096)
as.factor(neighborhood)WEST_OAK_LANE -0.208
(0.059)
as.factor(neighborhood)WEST_PARKSIDE -0.392
(0.205)
as.factor(neighborhood)WEST_PASSYUNK -0.014
(0.098)
as.factor(neighborhood)WEST_POPLAR 0.328
(0.147)
as.factor(neighborhood)WEST_POWELTON 0.011
(0.117)
as.factor(neighborhood)WHITMAN 0.124
(0.096)
as.factor(neighborhood)WINCHESTER_PARK 0.087
(0.092)
as.factor(neighborhood)WISSAHICKON 0.096
(0.081)
as.factor(neighborhood)WISSAHICKON_HILLS 0.198
(0.099)
as.factor(neighborhood)WISSINOMING -0.280
(0.062)
as.factor(neighborhood)WISTER -0.276
(0.088)
as.factor(neighborhood)WOODLAND_TERRACE 0.437
(0.195)
as.factor(neighborhood)WYNNEFIELD -0.118
(0.084)
as.factor(neighborhood)WYNNEFIELD_HEIGHTS -0.241
(0.094)
as.factor(neighborhood)YORKTOWN 0.140
(0.149)
Num.Obs. 22284 22284 22284 22284
R2 0.388 0.582 0.583 0.627
R2 Adj. 0.388 0.582 0.583 0.624
  • Another major improvement appears when neighborhood fixed effects are introduced. Starting from M5b, M7 adds neighborhood fixed effects and raises R-squared to 0.628, with adjusted R-squared reaching 0.625. This result indicates that even after controlling for housing structure, Census-based neighborhood characteristics, and spatial accessibility measures, there is still substantial neighborhood-specific variation that cannot be fully captured by observed variables alone. The fixed effects likely absorb place-based factors such as neighborhood reputation, built environment quality, market perception, historical image, and other locally specific characteristics.

Takeaway:

Taken together, the results suggest that besides the property structure, their location, reflecting in both census and fixed effects, drives Philadelphia housing prices more than physical property characteristics alone.


Phase 5: Model Validation

Step 1: Prepare Cross-Validation Data Frame

Code
# Prepare CV data frame with neighborhood variable
model_df_cv <- phl_props_filtered %>%
  st_drop_geometry() %>%
  mutate(
    neighborhood               = as.factor(neighborhood),
    sale_year                  = as.factor(sale_year),
    category_code_description  = as.factor(category_code_description),
    log_area                   = log(total_livable_area),
    year_built                 = as.integer(year_built),
    sale_year_num              = as.integer(as.character(sale_year)),
    age                        = sale_year_num - year_built,
    age_c                      = age - mean(age, na.rm = TRUE)
  ) %>%
  drop_na(
    log_sale_price, log_area,
    number_of_bedrooms, number_of_bathrooms,
    age_c,
    med_inc, poverty_rate, transit_share, edu_ba_share, burden_renter30_share,
    transit_500ft, n_trees_500ft, school_knn_3, dist_core_mi,
    sale_year, category_code_description, neighborhood
  )

# Group sparse neighborhoods (n < 10) to avoid CV errors
# Neighborhoods with too few observations cause "new levels" error in CV
model_df_cv <- model_df_cv %>%
  add_count(neighborhood) %>%
  mutate(
    neighborhood_cv = if_else(
      n < 10,
      "Small_Neighborhoods",
      as.character(neighborhood)
    ),
    neighborhood_cv = as.factor(neighborhood_cv)
  )

cat("CV sample size:", nrow(model_df_cv), "\n")
CV sample size: 22284 
Code
cat("Neighborhoods after grouping:", n_distinct(model_df_cv$neighborhood_cv), "\n")
Neighborhoods after grouping: 143 

Step 2: Run 10-Fold Cross-Validation

Code
# Set up 10-fold CV control
ctrl <- trainControl(
  method          = "cv",
  number          = 10,
  savePredictions = "final"
)

# M3: Structure + Age
cv_m3 <- train(
  log_sale_price ~ log_area +
    number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2),
  data      = model_df_cv,
  method    = "lm",
  trControl = ctrl
)

# M4: + Census
cv_m4 <- train(
  log_sale_price ~ log_area +
    number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share,
  data      = model_df_cv,
  method    = "lm",
  trControl = ctrl
)

# M5b: + Spatial + Year/Type
cv_m5b <- train(
  log_sale_price ~ log_area +
    number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share +
    transit_500ft + n_trees_500ft + school_knn_3 + dist_core_mi +
    sale_year + category_code_description,
  data      = model_df_cv,
  method    = "lm",
  trControl = ctrl
)

# M7: + Neighborhood Fixed Effects
cv_m7 <- train(
  log_sale_price ~ log_area +
    number_of_bedrooms + number_of_bathrooms +
    age_c + I(age_c^2) +
    med_inc + poverty_rate + transit_share + edu_ba_share + burden_renter30_share +
    transit_500ft + n_trees_500ft + school_knn_3 + dist_core_mi +
    sale_year + category_code_description +
    as.factor(neighborhood_cv),
  data      = model_df_cv,
  method    = "lm",
  trControl = ctrl
)

Step 3: CV Results Table

Code
# Compile cross-validation results for all four models
data.frame(
  Model = c(
    "M3: Structural Only",
    "M4: + Census",
    "M5b: + Spatial",
    "M7: + Interactions/FE"
  ),
  RMSE_log = c(
    cv_m3$results$RMSE,
    cv_m4$results$RMSE,
    cv_m5b$results$RMSE,
    cv_m7$results$RMSE
  ),
  Approx_Percent_Error = 100 * (exp(c(
    cv_m3$results$RMSE,
    cv_m4$results$RMSE,
    cv_m5b$results$RMSE,
    cv_m7$results$RMSE
  )) - 1),
  MAE_log = c(
    cv_m3$results$MAE,
    cv_m4$results$MAE,
    cv_m5b$results$MAE,
    cv_m7$results$MAE
  ),
  Rsquared = c(
    cv_m3$results$Rsquared,
    cv_m4$results$Rsquared,
    cv_m5b$results$Rsquared,
    cv_m7$results$Rsquared
  )
) %>%
  arrange(RMSE_log) %>%
  kable(
    digits = c(0, 4, 1, 4),
    caption = "Model Performance Improves with Each Layer"
  )
Model Performance Improves with Each Layer
Model RMSE_log Approx_Percent_Error MAE_log Rsquared
M7: + Interactions/FE 0.4221 52.5 0.2736 1
M5b: + Spatial 0.4436 55.8 0.2974 1
M4: + Census 0.4561 57.8 0.3088 1
M3: Structural Only 0.5369 71.1 0.3949 0

Interpretation:

Cross-validation results confirm that each additional layer of features improves out-of-sample prediction. The baseline structural model (M3) yields an RMSE of 0.537, meaning predictions deviate by roughly 54% on the log scale on average. Adding census variables (M4) reduces error to 0.456, and further incorporating spatial features and time controls (M5b) brings it to 0.444. The full model with neighborhood fixed effects (M7) achieves the lowest RMSE of 0.422 and the highest R² of 0.622, confirming that neighborhood identity captures unmeasured locational value that no single measurable feature can fully replicate. All results are based on 10-fold cross-validation, ensuring that performance estimates reflect generalization to unseen data rather than in-sample fit.

Step 4: Predicted vs. Actual Plot

Code
# Visualize predicted vs actual log sale price for best model (M7)
ggplot(cv_m7$pred, aes(x = obs, y = pred)) +
  geom_point(alpha = 0.3, size = 0.5, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(
    title    = "Predicted vs. Actual Log Sale Price",
    subtitle = "M7: Neighborhood Fixed Effects Model, 10-Fold CV",
    x        = "Actual Log Sale Price",
    y        = "Predicted Log Sale Price",
    caption  = "Red dashed line = perfect prediction"
  )


Phase 6: Model Diagnostics

Step 1: Add Residuals and Fitted Values

Code
# Add residuals and fitted values for all four models
model_df_cv <- model_df_cv %>%
  mutate(
    fitted_m7    = fitted(m7),
    residuals_m7 = residuals(m7)
  )

Step 2: Residuals vs. Fitted Plots

Code
# Helper function to create residual vs fitted plot
make_rp <- function(fitted_col, resid_col, title) {
  ggplot(model_df_cv, aes(x = .data[[fitted_col]], y = .data[[resid_col]])) +
    geom_point(alpha = 0.3, size = 0.5, color = "steelblue") +
    geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
    labs(title = title, x = "Fitted Values", y = "Residuals") +
    theme_minimal()
}

rp_m7  <- make_rp("fitted_m7",  "residuals_m7",  "M7: + Neighborhood FE")

rp_m7

Step 3: Q-Q Plots

Code
# Helper function to create Q-Q plot
make_qq <- function(resid_col, title) {
  ggplot(model_df_cv, aes(sample = .data[[resid_col]])) +
    stat_qq(alpha = 0.3, size = 0.5, color = "steelblue") +
    stat_qq_line(color = "red") +
    labs(title = title, x = "Theoretical Quantiles", y = "Sample Quantiles") +
    theme_minimal()
}

qq_m7  <- make_qq("residuals_m7",  "M7: + Neighborhood FE")

qq_m7

Step 4: Cook’s Distance

Code
# Calculate Cook's distance and leverage for all models
model_df_cv <- model_df_cv %>%
  mutate(
    cd_m7             = cooks.distance(m7),
    is_influential_m7 = cd_m7 > 4 / nrow(model_df_cv)
  )

# Helper function to create Cook's distance plot
make_cd <- function(cd_col, influential_col, title) {
  ggplot(model_df_cv,
         aes(x = 1:nrow(model_df_cv), y = .data[[cd_col]])) +
    geom_point(aes(color = .data[[influential_col]]), size = 0.8) +
    geom_hline(yintercept = 4 / nrow(model_df_cv),
               linetype = "dashed", color = "red") +
    scale_color_manual(values = c("grey60", "red")) +
    labs(title = title, x = "Observation", y = "Cook's D") +
    theme_minimal() +
    theme(legend.position = "none")
}

cd_m7  <- make_cd("cd_m7",  "is_influential_m7",  "M7: + Neighborhood FE")

cd_m7

Code
{ rp_m7| qq_m7 | cd_m7 }

Interpretation of diagnostics:

  • Residual plots show whether the model captures the linear relationship well. Patterns in residuals suggest remaining non-linearity or heteroscedasticity.
  • Q-Q plots assess normality of residuals. Deviations at the tails are common with large real estate datasets due to luxury properties.
  • Cook’s distance identifies influential observations that disproportionately affect model estimates. Points above the threshold (4/n) warrant further inspection.

Phase 7: Conclusions & Recommendations


Save Objects for Slides


References

Data Sources

  • City of Philadelphia, Office of Property Assessment. OPA Properties Public Dataset. Retrieved March 2026. https://opendataphilly.org
  • U.S. Census Bureau. American Community Survey 5-Year Estimates, 2023 (Table B19013, B17001, B15003, B25070, B25091, B08301). Retrieved via tidycensus R package.
  • Southeastern Pennsylvania Transportation Authority (SEPTA). Transit Stops, Spring 2025. Retrieved from OpenDataPhilly.
  • City of Philadelphia, Parks & Recreation. Street Tree Inventory, 2025. Retrieved from OpenDataPhilly.
  • School District of Philadelphia. Schools Dataset. Retrieved from OpenDataPhilly.
  • City of Philadelphia. Neighborhood Boundaries. Retrieved from OpenDataPhilly.

Methods

  • Moran, P.A.P. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17-23.
  • Bivand, R., Pebesma, E., & Gomez-Rubio, V. (2013). Applied Spatial Data Analysis with R (2nd ed.). Springer.
  • Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1-26.
  • Pebesma, E. (2018). Simple features for R: Standardized support for spatial vector data. The R Journal, 10(1), 439-446.