---
title: "Bikeshare_SpaceTime_Prediction"
subtitle: "CPLN 5080 - Spring 2026"
author: "Liz Crouse"
date: today
format:
html:
code-fold: show
code-tools: true
toc: true
toc-depth: 3
toc-location: left
theme: cosmo
embed-resources: true
editor: visual
execute:
warning: false
message: false
---
```{r setup}
#| message: false
#| warning: false
#| echo: FALSE
library(tidyverse)
library(lubridate)
library(sf)
library(tigris)
library(tidycensus)
library(riem)
library(viridis)
library(knitr)
library(kableExtra)
library(patchwork)
library(here)
options(scipen = 999)
```
```{r theme}
##Create plot theme
plotTheme <- theme_minimal() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 11, face = "bold"),
panel.grid.major = element_line(colour = "#D0D0D0", linewidth = 0.2),
panel.grid.minor = element_blank(),
axis.ticks = element_blank(),
legend.position = "right"
)
mapTheme <- theme_void() +
theme(
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10),
plot.caption = element_text(size = 8),
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, "cm"),
legend.key.height = unit(1, "cm"),
legend.key.width = unit(0.2, "cm")
)
palette5 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c")
```
```{r set census key}
#| echo: FALSE
census_api_key(Sys.getenv("CENSUS_API_KEY"))
```
```{r data setup}
indego_q3 <- read_csv(here("labs/lab_5/data/indego-trips-2025-q3.csv"))
indego_q1 <- read_csv(here("labs/lab_5/data/indego-trips-2025-q1.csv"))
```
```{r data exploration}
#Initial data exploration and analysis
# How many trips?
cat("Total trips in Q3 2025:", nrow(indego_q3), "\n")
##465,464
# Date range
cat("Date range:",
min(mdy_hm(indego_q3$start_time)), "to",
max(mdy_hm(indego_q3$start_time)), "\n")
##July 1 to Sept 30
# How many unique stations?
cat("Unique start stations:", length(unique(indego_q3$start_station)), "\n")
##284 unique stations
# Trip types
table(indego_q3$trip_route_category)
# Passholder types
table(indego_q3$passholder_type)
# Bike types
table(indego_q3$bike_type)
```
```{r create bins}
#Creating hourlong bins for start time
indego_q3 <- indego_q3 %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
# Look at temporal features
head(indego_q3 %>% select(start_datetime, interval60, week, dotw, hour, weekend))
indego_q1 <- indego_q1 %>%
mutate(
# Parse datetime
start_datetime = mdy_hm(start_time),
end_datetime = mdy_hm(end_time),
# Create hourly bins
interval60 = floor_date(start_datetime, unit = "hour"),
# Extract time features
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
# Create useful indicators
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
# Look at temporal features
head(indego_q1 %>% select(start_datetime, interval60, week, dotw, hour, weekend))
```
```{r trips over time}
#| message: false
#| warning: false
# Daily trip counts Q3
daily_trips_q3 <- indego_q3 %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips_q3, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q3 2025",
subtitle = "Late summer demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
# Daily trip counts Q1
daily_trips_q1 <- indego_q1 %>%
group_by(date) %>%
summarize(trips = n())
ggplot(daily_trips_q1, aes(x = date, y = trips)) +
geom_line(color = "#3182bd", linewidth = 1) +
geom_smooth(se = FALSE, color = "red", linetype = "dashed") +
labs(
title = "Indego Daily Ridership - Q1 2025",
subtitle = "Winter demand patterns in Philadelphia",
x = "Date",
y = "Daily Trips",
caption = "Source: Indego bike share"
) +
plotTheme
#Q1 shows a steady increase over time likely do to warming weather conditions; Q3 has more variation but an overall upward trend. This could be due to the weather being too hot, school being out of session, etc. Overall, Q3 has a significantly greater volume of trips.
```
```{r time patterns}
# Average trips by hour and day type
hourly_patterns_q3 <- indego_q3 %>%
group_by(hour, weekend) %>%
summarize(avg_trips = n() / n_distinct(date)) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(hourly_patterns_q3, aes(x = hour, y = avg_trips, color = day_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Average Hourly Ridership Patterns Q3",
subtitle = "Clear commute patterns on weekdays",
x = "Hour of Day",
y = "Average Trips per Hour",
color = "Day Type"
) +
plotTheme
hourly_patterns_q1 <- indego_q1 %>%
group_by(hour, weekend) %>%
summarize(avg_trips = n() / n_distinct(date)) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(hourly_patterns_q1, aes(x = hour, y = avg_trips, color = day_type)) +
geom_line(linewidth = 1.2) +
scale_color_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Average Hourly Ridership Patterns Q1",
subtitle = "Clear commute patterns on weekdays",
x = "Hour of Day",
y = "Average Trips per Hour",
color = "Day Type"
) +
plotTheme
#Both Q1 and Q3 show clear patterns of commutes on weekdays. Ridership is higher overall in Q3, but the trends mirror each other almost exactly.
```
```{r top stations}
# Most popular origin stations
top_stations_q3 <- indego_q3 %>%
count(start_station, start_lat, start_lon, name = "trips") %>%
arrange(desc(trips)) %>%
head(20)
kable(top_stations_q3,
caption = "Top 20 Indego Stations by Trip Origins Q3",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
top_stations_q1 <- indego_q1 %>%
count(start_station, start_lat, start_lon, name = "trips") %>%
arrange(desc(trips)) %>%
head(20)
kable(top_stations_q1,
caption = "Top 20 Indego Stations by Trip Origins Q1",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
```{r census data}
#| message: false
#| warning: false
# Get Philadelphia census data
philly_census <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total population
"B19013_001", # Median household income
"B08301_001", # Total commuters
"B08301_010", # Commute by transit
"B02001_002", # White alone
"B25077_001" # Median home value
),
state = "PA",
county = "Philadelphia",
year = 2022,
geometry = TRUE,
output = "wide"
) %>%
rename(
Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Total_Commuters = B08301_001E,
Transit_Commuters = B08301_010E,
White_Pop = B02001_002E,
Med_Home_Value = B25077_001E
) %>%
mutate(
Percent_Taking_Transit = (Transit_Commuters / Total_Commuters) * 100,
Percent_White = (White_Pop / Total_Pop) * 100
) %>%
st_transform(crs = 4326) #transforming to WGS84 for lat/lon matching, currently in NAD83
# Check the data
glimpse(philly_census)
```
```{r map philly}
# Map median income
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = NA) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Context for understanding bike share demand patterns"
) +
# Stations
geom_point(
data = indego_q3,
aes(x = start_lon, y = start_lat),
color = "red", size = 0.25, alpha = 0.6
) +
mapTheme
#Stations cluster around areas with higher median income, but those also tend to be in Center City, which has more bike lanes, greater population density, etc.
```
```{r join to station data}
# Create sf object for stations
stations_sf <- indego_q3 %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326)
# Spatial join to get census tract for each station
stations_census <- st_join(stations_sf, philly_census, left = TRUE) %>%
st_drop_geometry()
stations_for_map <- indego_q3 %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Add back to trip data
indego_census_q3 <- indego_q3 %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
indego_census_q1 <- indego_q1 %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
# Prepare data for visualization
stations_for_map <- indego_q3 %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat), !is.na(start_lon)) %>%
left_join(
stations_census %>% select(start_station, Med_Inc),
by = "start_station"
) %>%
mutate(has_census = !is.na(Med_Inc))
# Create the map showing problem stations
ggplot() +
geom_sf(data = philly_census, aes(fill = Med_Inc), color = "white", size = 0.05) +
scale_fill_viridis(
option = "viridis",
name = "Median\nIncome",
labels = scales::dollar,
na.value = "grey90"
) + geom_point(
data = stations_for_map %>% filter(has_census),
aes(x = start_lon, y = start_lat),
color = "grey30", size = 1, alpha = 0.6
) +
geom_point(
data = stations_for_map %>% filter(!has_census),
aes(x = start_lon, y = start_lat),
color = "red", size = 1, shape = 4, stroke = 1.5
) +
labs(
title = "Philadelphia Median Household Income by Census Tract",
subtitle = "Indego stations shown (RED = no census data match)",
caption = "Red X marks indicate stations that didn't join to census tracts"
) +
mapTheme
#Stations along the river and peripheral, non-residential parts of the city must be accounted for when doing demographic research.
```
```{r valid stations}
# Identify which stations to keep
valid_stations <- stations_census %>%
filter(!is.na(Med_Inc)) %>%
pull(start_station)
# Filter trip data to valid stations only
indego_census <- indego_q3 %>%
filter(start_station %in% valid_stations) %>%
left_join(
stations_census %>%
select(start_station, Med_Inc, Percent_Taking_Transit,
Percent_White, Total_Pop),
by = "start_station"
)
#Here we identify and filter out those stations that do not match a demographic census area.
```
```{r weather}
# Get weather from Philadelphia International Airport (KPHL)
# This covers Q3 2025: July 1 - September 30
weather_data_q3 <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-07-01",
date_end = "2025-09-30"
)
# Process weather data
weather_processed_q3 <- weather_data_q3 %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
distinct(interval60, .keep_all = TRUE)
# Check for missing hours and interpolate if needed
weather_complete_q3 <- weather_processed_q3 %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_complete_q3 %>% select(Temperature, Precipitation, Wind_Speed))
# This covers Q1 2025: January 1 - March 31
weather_data_q1 <- riem_measures(
station = "PHL", # Philadelphia International Airport
date_start = "2025-01-01",
date_end = "2025-03-31"
)
# Process weather data
weather_processed_q1 <- weather_data_q1 %>%
mutate(
interval60 = floor_date(valid, unit = "hour"),
Temperature = tmpf, # Temperature in Fahrenheit
Precipitation = ifelse(is.na(p01i), 0, p01i), # Hourly precip in inches
Wind_Speed = sknt # Wind speed in knots
) %>%
select(interval60, Temperature, Precipitation, Wind_Speed) %>%
distinct(interval60, .keep_all = TRUE)
# Check for missing hours and interpolate if needed
weather_complete_q1 <- weather_processed_q1 %>%
complete(interval60 = seq(min(interval60), max(interval60), by = "hour")) %>%
fill(Temperature, Precipitation, Wind_Speed, .direction = "down")
# Look at the weather
summary(weather_complete_q1 %>% select(Temperature, Precipitation, Wind_Speed))
```
```{r visualize weather}
#| message: false
#| warning: false
#|
ggplot(weather_complete_q3, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q3 2025",
subtitle = "Summer to early fall transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
ggplot(weather_complete_q1, aes(x = interval60, y = Temperature)) +
geom_line(color = "#3182bd", alpha = 0.7) +
geom_smooth(se = FALSE, color = "red") +
labs(
title = "Philadelphia Temperature - Q1 2025",
subtitle = "Winter to early spring transition",
x = "Date",
y = "Temperature (°F)"
) +
plotTheme
#Clear temperature trend upward in Q1 with most days averaging below 60. Q3 patterns have less of a trend, with the line fit mostly staying between 70 and 80 on average.
```
```{r aggregate data}
#| message: false
#| warning: false
# Count trips by station-hour
trips_panel_q3 <- indego_census_q3 %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
trips_panel_q1 <- indego_census_q1 %>%
group_by(interval60, start_station, start_lat, start_lon,
Med_Inc, Percent_Taking_Transit, Percent_White, Total_Pop) %>%
summarize(Trip_Count = n()) %>%
ungroup()
# How many station-hour observations?
nrow(trips_panel_q3)
# How many unique stations?
length(unique(trips_panel_q3$start_station))
# How many unique hours?
length(unique(trips_panel_q3$interval60))
hist(trips_panel_q3$Trip_Count)
```
```{r complete panel}
# Calculate expected panel size
n_stations_q1 <- length(unique(trips_panel_q1$start_station))
n_hours_q1 <- length(unique(trips_panel_q1$interval60))
expected_rows_q1 <- n_stations_q1 * n_hours_q1
cat("Expected panel rows:", format(expected_rows_q1, big.mark = ","), "\n")
cat("Current rows:", format(nrow(trips_panel_q1), big.mark = ","), "\n")
cat("Missing rows:", format(expected_rows_q1 - nrow(trips_panel_q1), big.mark = ","), "\n")
n_stations_q3 <- length(unique(trips_panel_q3$start_station))
n_hours_q3 <- length(unique(trips_panel_q3$interval60))
expected_rows_q3 <- n_stations_q3 * n_hours_q3
cat("Expected panel rows:", format(expected_rows_q3, big.mark = ","), "\n")
cat("Current rows:", format(nrow(trips_panel_q3), big.mark = ","), "\n")
cat("Missing rows:", format(expected_rows_q3 - nrow(trips_panel_q3), big.mark = ","), "\n")
# Create complete panel Q3
study_panel_q3 <- expand_grid(
interval60 = unique(trips_panel_q3$interval60),
start_station = unique(trips_panel_q3$start_station)
) %>%
# Join trip counts
left_join(trips_panel_q3 %>%
select(interval60, start_station, Trip_Count),
by = c("interval60", "start_station")) %>%
# Replace NA trip counts with 0
mutate(Trip_Count = replace_na(Trip_Count, 0))
glimpse(study_panel_q3)
hist(study_panel_q3$Trip_Count)
# Create complete panel Q1
study_panel_q1 <- expand_grid(
interval60 = unique(trips_panel_q1$interval60),
start_station = unique(trips_panel_q1$start_station)
) %>%
left_join(trips_panel_q1 %>%
select(interval60, start_station, Trip_Count),
by = c("interval60", "start_station")) %>%
mutate(Trip_Count = replace_na(Trip_Count, 0))
glimpse(study_panel_q1)
hist(study_panel_q1$Trip_Count)
# Fill in station attributes (they're the same for all hours)
station_attributes_q3 <- trips_panel_q3 %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
station_attributes_q1 <- trips_panel_q1 %>%
group_by(start_station) %>%
summarize(
start_lat = first(start_lat),
start_lon = first(start_lon),
Med_Inc = first(Med_Inc),
Percent_Taking_Transit = first(Percent_Taking_Transit),
Percent_White = first(Percent_White),
Total_Pop = first(Total_Pop)
)
glimpse(station_attributes_q3)
study_panel_q1 <- study_panel_q1 %>%
left_join(station_attributes_q1, by = "start_station")
study_panel_q3 <- study_panel_q3 %>%
left_join(station_attributes_q3, by = "start_station")
glimpse(study_panel_q1)
glimpse(study_panel_q3)
# Verify we have complete panel
cat("Complete panel rows:", format(nrow(study_panel_q3), big.mark = ","), "\n")
cat("Complete panel rows:", format(nrow(study_panel_q1), big.mark = ","), "\n")
#The trends on both Q1 and Q3 look similar, but Q3 operates on a much higher scale.
```
```{r add time features}
study_panel_q3 <- study_panel_q3 %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
glimpse(study_panel_q3)
study_panel_q1 <- study_panel_q1 %>%
mutate(
week = week(interval60),
month = month(interval60, label = TRUE),
dotw = wday(interval60, label = TRUE),
hour = hour(interval60),
date = as.Date(interval60),
weekend = ifelse(dotw %in% c("Sat", "Sun"), 1, 0),
rush_hour = ifelse(hour %in% c(7, 8, 9, 16, 17, 18), 1, 0)
)
glimpse(study_panel_q1)
```
```{r join weather}
study_panel_q3 <- study_panel_q3 %>%
left_join(weather_complete_q3, by = "interval60")
glimpse(study_panel_q3)
study_panel_q1 <- study_panel_q1 %>%
left_join(weather_complete_q1, by = "interval60")
glimpse(study_panel_q1)
# Check for missing values
summary(study_panel_q3 %>% select(Trip_Count, Temperature, Precipitation))
summary(study_panel_q1 %>% select(Trip_Count, Temperature, Precipitation))
```
```{r create lags}
# Sort by station and time
study_panel_q3 <- study_panel_q3 %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel_q3 <- study_panel_q3 %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Sort by station and time
study_panel_q1 <- study_panel_q1 %>%
arrange(start_station, interval60)
# Create lag variables WITHIN each station
study_panel_q1 <- study_panel_q1 %>%
group_by(start_station) %>%
mutate(
lag1Hour = lag(Trip_Count, 1),
lag2Hours = lag(Trip_Count, 2),
lag3Hours = lag(Trip_Count, 3),
lag12Hours = lag(Trip_Count, 12),
lag1day = lag(Trip_Count, 24)
) %>%
ungroup()
# Remove rows with NA lags (first 24 hours for each station)
study_panel_complete_q3 <- study_panel_q3 %>%
filter(!is.na(lag1day))
study_panel_complete_q1 <- study_panel_q1 %>%
filter(!is.na(lag1day))
cat("Rows after removing NA lags:", format(nrow(study_panel_complete_q1), big.mark = ","), "\n")
cat("Rows after removing NA lags:", format(nrow(study_panel_complete_q3), big.mark = ","), "\n")
```
```{r lag visualizations}
# Sample one station to visualize
example_station <- study_panel_complete_q3 %>%
filter(start_station == 3188) %>%
head(168) # One week
# Plot actual vs lagged demand
ggplot(example_station, aes(x = interval60)) +
geom_line(aes(y = Trip_Count, color = "Current"), linewidth = 1) +
geom_line(aes(y = lag1Hour, color = "1 Hour Ago"), linewidth = 1, alpha = 0.7) +
geom_line(aes(y = lag1day, color = "24 Hours Ago"), linewidth = 1, alpha = 0.7) +
scale_color_manual(values = c(
"Current" = "#08519c",
"1 Hour Ago" = "#3182bd",
"24 Hours Ago" = "#6baed6"
)) +
labs(
title = "Temporal Lag Patterns at One Station",
subtitle = "Past demand predicts future demand",
x = "Date-Time",
y = "Trip Count",
color = "Time Period"
) +
plotTheme
```
```{r temporal test train}
# Split by week
# Q3 has weeks 26-39 (Jul-Sep)
# Train on weeks 26-35 (Jul 1- early September)
# Test on weeks 36-39 (rest of September)
# Split by week
# Q1 has weeks 1-13 (Jan-Mar)
# Train on weeks 1-9 (Jan 1 - early March)
# Test on weeks 10-13 (rest of March)
# Which stations have trips in BOTH early and late periods?
early_stations_q3 <- study_panel_complete_q3 %>%
filter(week < 36) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations_q3 <- study_panel_complete_q3 %>%
filter(week >= 36) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
early_stations_q1 <- study_panel_complete_q1 %>%
filter(week < 10) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
late_stations_q1 <- study_panel_complete_q1 %>%
filter(week >= 10) %>%
filter(Trip_Count > 0) %>%
distinct(start_station) %>%
pull(start_station)
# Keep only stations that appear in BOTH periods
common_stations_q3 <- intersect(early_stations_q3, late_stations_q3)
common_stations_q1 <- intersect(early_stations_q1, late_stations_q1)
# Filter panel to only common stations
study_panel_complete_q3 <- study_panel_complete_q3 %>%
filter(start_station %in% common_stations_q3)
study_panel_complete_q1 <- study_panel_complete_q1 %>%
filter(start_station %in% common_stations_q1)
# Create train/test split
train_q3 <- study_panel_complete_q3 %>%
filter(week < 36)
test_q3 <- study_panel_complete_q3 %>%
filter(week >= 36)
train_q1 <- study_panel_complete_q1 %>%
filter(week < 10)
test_q1 <- study_panel_complete_q1 %>%
filter(week >= 10)
# Convert start_station to factor in training data first,
# then align test levels to training levels so predict() doesn't error
# Convert to factor FIRST
train_q3 <- train_q3 %>%
mutate(start_station = factor(start_station))
train_q1 <- train_q1 %>%
mutate(start_station = factor(start_station))
train_q3 <- train_q3 %>% droplevels()
train_q1 <- train_q1 %>% droplevels()
cat("Training observations Q3:", format(nrow(train_q3), big.mark = ","), "\n")
cat("Testing observations Q3:", format(nrow(test_q3), big.mark = ","), "\n")
cat("Training date range Q3:", min(train_q3$date), "to", max(train_q3$date), "\n")
cat("Testing date range Q3:", min(test_q3$date), "to", max(test_q3$date), "\n")
cat("Training observations Q1:", format(nrow(train_q1), big.mark = ","), "\n")
cat("Testing observations Q1:", format(nrow(test_q1), big.mark = ","), "\n")
cat("Training date range Q1:", min(train_q1$date), "to", max(train_q1$date), "\n")
cat("Testing date range Q1:", min(test_q1$date), "to", max(test_q1$date), "\n")
```
```{r station volume}
# Q3
station_volume_q3 <- train_q3 %>%
mutate(start_station = as.character(start_station)) %>%
group_by(start_station) %>%
summarize(avg_trips = mean(Trip_Count)) %>%
mutate(volume_tier = case_when(
avg_trips >= quantile(avg_trips, 0.67) ~ "High",
avg_trips >= quantile(avg_trips, 0.33) ~ "Medium",
TRUE ~ "Low"
)) %>%
select(start_station, volume_tier)
train_q3 <- train_q3 %>%
mutate(start_station = as.character(start_station)) %>%
left_join(station_volume_q3, by = "start_station") %>%
mutate(start_station = factor(start_station))
test_q3 <- test_q3 %>%
mutate(start_station = as.character(start_station)) %>%
left_join(station_volume_q3, by = "start_station") %>%
mutate(
volume_tier = replace_na(volume_tier, "Low"),
start_station = factor(start_station, levels = levels(train_q3$start_station))
)
# Q1
station_volume_q1 <- train_q1 %>%
mutate(start_station = as.character(start_station)) %>%
group_by(start_station) %>%
summarize(avg_trips = mean(Trip_Count)) %>%
mutate(volume_tier = case_when(
avg_trips >= quantile(avg_trips, 0.67) ~ "High",
avg_trips >= quantile(avg_trips, 0.33) ~ "Medium",
TRUE ~ "Low"
)) %>%
select(start_station, volume_tier)
train_q1 <- train_q1 %>%
mutate(start_station = as.character(start_station)) %>%
left_join(station_volume_q1, by = "start_station") %>%
mutate(start_station = factor(start_station))
test_q1 <- test_q1 %>%
mutate(start_station = as.character(start_station)) %>%
left_join(station_volume_q1, by = "start_station") %>%
mutate(
volume_tier = replace_na(volume_tier, "Low"),
start_station = factor(start_station, levels = levels(train_q1$start_station))
)
# Factor levels
train_q3 <- train_q3 %>% mutate(volume_tier = factor(volume_tier, levels = c("Low", "Medium", "High")))
test_q3 <- test_q3 %>% mutate(volume_tier = factor(volume_tier, levels = c("Low", "Medium", "High")))
train_q1 <- train_q1 %>% mutate(volume_tier = factor(volume_tier, levels = c("Low", "Medium", "High")))
test_q1 <- test_q1 %>% mutate(volume_tier = factor(volume_tier, levels = c("Low", "Medium", "High")))
```
```{r distance to city hall}
# Center City Philadelphia coordinates (City Hall)
center_city_lat <- 39.9526
center_city_lon <- -75.1652
# Haversine function
haversine_miles <- function(lat1, lon1, lat2, lon2) {
R <- 3958.8
dlat <- (lat2 - lat1) * pi / 180
dlon <- (lon2 - lon1) * pi / 180
a <- sin(dlat/2)^2 + cos(lat1 * pi/180) * cos(lat2 * pi/180) * sin(dlon/2)^2
2 * R * asin(sqrt(a))
}
# Calculate distances from training data only
station_distances_q3 <- train_q3 %>%
mutate(start_station = as.character(start_station)) %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat)) %>%
mutate(dist_to_center = haversine_miles(start_lat, start_lon,
center_city_lat, center_city_lon))
station_distances_q1 <- train_q1 %>%
mutate(start_station = as.character(start_station)) %>%
distinct(start_station, start_lat, start_lon) %>%
filter(!is.na(start_lat)) %>%
mutate(dist_to_center = haversine_miles(start_lat, start_lon,
center_city_lat, center_city_lon))
# Join to train (remove first in case chunk is re-run)
train_q3 <- train_q3 %>%
mutate(start_station = as.character(start_station)) %>%
select(-any_of("dist_to_center")) %>%
left_join(station_distances_q3 %>% select(start_station, dist_to_center),
by = "start_station") %>%
mutate(start_station = factor(start_station))
train_q1 <- train_q1 %>%
mutate(start_station = as.character(start_station)) %>%
select(-any_of("dist_to_center")) %>%
left_join(station_distances_q1 %>% select(start_station, dist_to_center),
by = "start_station") %>%
mutate(start_station = factor(start_station))
# Join to test
test_q3 <- test_q3 %>%
mutate(start_station = as.character(start_station)) %>%
select(-any_of("dist_to_center")) %>%
left_join(station_distances_q3 %>% select(start_station, dist_to_center),
by = "start_station") %>%
mutate(
dist_to_center = replace_na(dist_to_center,
median(station_distances_q3$dist_to_center)),
start_station = factor(start_station, levels = levels(train_q3$start_station))
)
test_q1 <- test_q1 %>%
mutate(start_station = as.character(start_station)) %>%
select(-any_of("dist_to_center")) %>%
left_join(station_distances_q1 %>% select(start_station, dist_to_center),
by = "start_station") %>%
mutate(
dist_to_center = replace_na(dist_to_center,
median(station_distances_q1$dist_to_center)),
start_station = factor(start_station, levels = levels(train_q1$start_station))
)
# Verify
cat("Q3 train dist_to_center - min:", round(min(train_q3$dist_to_center, na.rm=TRUE), 2),
"max:", round(max(train_q3$dist_to_center, na.rm=TRUE), 2), "miles\n")
cat("Q1 train dist_to_center - min:", round(min(train_q1$dist_to_center, na.rm=TRUE), 2),
"max:", round(max(train_q1$dist_to_center, na.rm=TRUE), 2), "miles\n")
cat("Q3 test NAs in dist_to_center:", sum(is.na(test_q3$dist_to_center)), "\n")
cat("Q1 test NAs in dist_to_center:", sum(is.na(test_q1$dist_to_center)), "\n")
```
```{r model 1}
# Create day of week factor with treatment (dummy) coding
train_q3 <- train_q3 %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train_q3$dotw_simple) <- contr.treatment(7)
# Now run the model
model1_q3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train_q3
)
summary(model1_q3)
# Create day of week factor with treatment (dummy) coding
train_q1 <- train_q1 %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(train_q1$dotw_simple) <- contr.treatment(7)
# Now run the model
model1_q1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation,
data = train_q1
)
summary(model1_q1)
```
```{r model 2}
model2_q3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train_q3
)
summary(model2_q3)
model2_q1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day,
data = train_q1
)
summary(model2_q1)
```
```{r model 3}
model3_q3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc + Percent_Taking_Transit + Percent_White,
data = train_q3
)
summary(model3_q3)
model3_q1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc + Percent_Taking_Transit + Percent_White,
data = train_q1
)
summary(model3_q1)
```
```{r model 4}
model4_q3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc + Percent_Taking_Transit + Percent_White +
start_station,
data = train_q3
)
cat("Model 4 R-squared Q3:", summary(model4_q3)$r.squared, "\n")
cat("Model 4 Adj R-squared Q3:", summary(model4_q3)$adj.r.squared, "\n")
model4_q1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day +
Med_Inc + Percent_Taking_Transit + Percent_White +
start_station,
data = train_q1
)
cat("Model 4 R-squared Q1:", summary(model4_q1)$r.squared, "\n")
cat("Model 4 Adj R-squared Q1:", summary(model4_q1)$adj.r.squared, "\n")
```
```{r model 5}
model5_q3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc + Percent_Taking_Transit + Percent_White +
start_station +
rush_hour * weekend, # Rush hour effects different on weekends
data = train_q3
)
cat("Model 5 R-squared Q3:", summary(model5_q3)$r.squared, "\n")
cat("Model 5 Adj R-squared Q3:", summary(model5_q3)$adj.r.squared, "\n")
model5_q1 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc + Percent_Taking_Transit + Percent_White +
start_station +
rush_hour * weekend, # Rush hour effects different on weekends
data = train_q1
)
cat("Model 5 R-squared Q1:", summary(model5_q1)$r.squared, "\n")
cat("Model 5 Adj R-squared Q1:", summary(model5_q1)$adj.r.squared, "\n")
```
```{r mae}
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test_q3 <- test_q3 %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test_q3$dotw_simple) <- contr.treatment(7)
test_q3$start_station <- factor(
test_q3$start_station,
levels = model4_q3$xlevels$start_station
)
test_q1$start_station <- factor(
test_q1$start_station,
levels = model4_q1$xlevels$start_station
)
test_q3 <- test_q3 %>% droplevels()
test_q1 <- test_q1 %>% droplevels()
test_q3 <- test_q3 %>%
mutate(
pred1 = predict(model1_q3, newdata = test_q3),
pred2 = predict(model2_q3, newdata = test_q3),
pred3 = predict(model3_q3, newdata = test_q3),
pred4 = predict(model4_q3, newdata = test_q3),
pred5 = predict(model5_q3, newdata = test_q3)
)
# Calculate MAE for each model
mae_results_q3 <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test_q3$Trip_Count - test_q3$pred1), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred2), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred3), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred4), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred5), na.rm = TRUE)
)
)
kable(mae_results_q3,
digits = 2,
caption = "Mean Absolute Error by Model, Q3 (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
# Create day of week factor with treatment (dummy) coding
test_q1 <- test_q1 %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test_q1$dotw_simple) <- contr.treatment(7)
test_q1 <- test_q1 %>%
mutate(
pred1 = predict(model1_q1, newdata = test_q1),
pred2 = predict(model2_q1, newdata = test_q1),
pred3 = predict(model3_q1, newdata = test_q1),
pred4 = predict(model4_q1, newdata = test_q1),
pred5 = predict(model5_q1, newdata = test_q1)
)
# Calculate MAE for each model
mae_results_q1 <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction"
),
MAE = c(
mean(abs(test_q1$Trip_Count - test_q1$pred1), na.rm = TRUE),
mean(abs(test_q1$Trip_Count - test_q1$pred2), na.rm = TRUE),
mean(abs(test_q1$Trip_Count - test_q1$pred3), na.rm = TRUE),
mean(abs(test_q1$Trip_Count - test_q1$pred4), na.rm = TRUE),
mean(abs(test_q1$Trip_Count - test_q1$pred5), na.rm = TRUE)
)
)
kable(mae_results_q1,
digits = 2,
caption = "Mean Absolute Error by Model, q1 (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
#MAE is higher for Q3 than Q1. Trends such as weather/temperature and weekend/weekday divide are trend more clearly in Q1, whereas more outliers exist in the Q3 dataset. This makes the models predict significantly better in Q1 than Q3. For Q1, the model perfoms best at stage 4 after adding fixed effects. For Q3, stage 4 and 5 half equal MAE.
```
```{r visualize comparisons}
ggplot(mae_results_q3, aes(x = reorder(Model, -MAE), y = MAE)) +
geom_col(fill = "#3182bd", alpha = 0.8) +
geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
labs(
title = "Model Performance Comparison Q3",
subtitle = "Lower MAE = Better Predictions",
x = "Model",
y = "Mean Absolute Error (trips)"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(mae_results_q1, aes(x = reorder(Model, -MAE), y = MAE)) +
geom_col(fill = "#3182bd", alpha = 0.8) +
geom_text(aes(label = round(MAE, 2)), vjust = -0.5) +
labs(
title = "Model Performance Comparison Q1",
subtitle = "Lower MAE = Better Predictions",
x = "Model",
y = "Mean Absolute Error (trips)"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
```{r observations vs predictions}
#| message: false
#| warning: false
test_q3 <- test_q3 %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
# Scatter plot by time and day type
ggplot(test_q3, aes(x = Trip_Count, y = pred2)) +
geom_point(alpha = 0.2, color = "#3182bd") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
facet_grid(weekend ~ time_of_day) +
labs(
title = "Observed vs. Predicted Bike Trips Q3",
subtitle = "Model 2 performance by time period Q3",
x = "Observed Trips",
y = "Predicted Trips",
caption = "Red line = perfect predictions; Green line = actual model fit"
) +
plotTheme
test_q1 <- test_q1 %>%
mutate(
error = Trip_Count - pred2,
abs_error = abs(error),
time_of_day = case_when(
hour < 7 ~ "Overnight",
hour >= 7 & hour < 10 ~ "AM Rush",
hour >= 10 & hour < 15 ~ "Mid-Day",
hour >= 15 & hour <= 18 ~ "PM Rush",
hour > 18 ~ "Evening"
)
)
# Scatter plot by time and day type
ggplot(test_q1, aes(x = Trip_Count, y = pred2)) +
geom_point(alpha = 0.2, color = "#3182bd") +
geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_smooth(method = "lm", se = FALSE, color = "darkgreen") +
facet_grid(weekend ~ time_of_day) +
labs(
title = "Observed vs. Predicted Bike Trips q1",
subtitle = "Model 2 performance by time period q1",
x = "Observed Trips",
y = "Predicted Trips",
caption = "Red line = perfect predictions; Green line = actual model fit"
) +
plotTheme
#There is less variation in the errors in Q1. The Q1 model does well at overnight and decently well with mid-day frames. Q3 shows more variation in error overall, especially struggling with the PM rush period.
```
```{r spatial errors}
# Calculate MAE by station
station_errors_q3 <- test_q3 %>%
group_by(start_station, start_lat, start_lon) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat), !is.na(start_lon))
station_errors_q1 <- test_q1 %>%
group_by(start_station, start_lat, start_lon) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat), !is.na(start_lon))
## Create Two Maps Side-by-Side with Proper Legends (sorry these maps are ugly)
# Calculate station errors
station_errors_q3 <- test_q3 %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat, start_lon) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat), !is.na(start_lon))
station_errors_q1 <- test_q1 %>%
filter(!is.na(pred2)) %>%
group_by(start_station, start_lat, start_lon) %>%
summarize(
MAE = mean(abs(Trip_Count - pred2), na.rm = TRUE),
avg_demand = mean(Trip_Count, na.rm = TRUE),
.groups = "drop"
) %>%
filter(!is.na(start_lat), !is.na(start_lon))
# Map 1: Prediction Errors
p1_q3 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors_q3,
aes(x = start_lon, y = start_lat, color = MAE),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE\n(trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5), # Fewer, cleaner breaks
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors Q3",
subtitle = "Higher in Center City") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
p1_q1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors_q1,
aes(x = start_lon, y = start_lat, color = MAE),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "plasma",
name = "MAE\n(trips)",
direction = -1,
breaks = c(0.5, 1.0, 1.5), # Fewer, cleaner breaks
labels = c("0.5", "1.0", "1.5")
) +
labs(title = "Prediction Errors Q1",
subtitle = "Higher in Center City") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Map 2: Average Demand
p2_q3 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors_q3,
aes(x = start_lon, y = start_lat, color = avg_demand),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg\nDemand",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5), # Clear breaks
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand Q3",
subtitle = "Trips per station-hour") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
p2_q1 <- ggplot() +
geom_sf(data = philly_census, fill = "grey95", color = "white", size = 0.2) +
geom_point(
data = station_errors_q1,
aes(x = start_lon, y = start_lat, color = avg_demand),
size = 3.5,
alpha = 0.7
) +
scale_color_viridis(
option = "viridis",
name = "Avg\nDemand",
direction = -1,
breaks = c(0.5, 1.0, 1.5, 2.0, 2.5), # Clear breaks
labels = c("0.5", "1.0", "1.5", "2.0", "2.5")
) +
labs(title = "Average Demand q1",
subtitle = "Trips per station-hour") +
mapTheme +
theme(
legend.position = "right",
legend.title = element_text(size = 10, face = "bold"),
legend.text = element_text(size = 9),
plot.title = element_text(size = 14, face = "bold"),
plot.subtitle = element_text(size = 10)
) +
guides(color = guide_colorbar(
barwidth = 1.5,
barheight = 12,
title.position = "top",
title.hjust = 0.5
))
# Combine
(p1_q3 | p2_q3)
(p1_q1 | p2_q1)
#Stations with higher volumes have higher MAE, and Q3 has higher MAE overall than Q1. This shows us the model predicts better on stations further from Center City where there is a lower volume of trips.
```
```{r temporal errors}
# MAE by time of day and day type
temporal_errors_q3 <- test_q3 %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors_q3, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period Q3",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
temporal_errors_q1 <- test_q1 %>%
group_by(time_of_day, weekend) %>%
summarize(
MAE = mean(abs_error, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(day_type = ifelse(weekend == 1, "Weekend", "Weekday"))
ggplot(temporal_errors_q1, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period Q1",
subtitle = "When is the model struggling most?",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#Following with the trends observed so far, both models struggle more with high volume times, especially the PM rush. Both models excel during the overnight time period.
```
```{r demographic errors}
#| message: false
#| warning: false
# Join demographic data to station errors
station_errors_demo_q3 <- station_errors_q3 %>%
mutate(start_station = as.character(start_station)) %>%
left_join(
station_attributes_q3 %>%
mutate(start_station = as.character(start_station)) %>%
select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
station_errors_demo_q1 <- station_errors_q1 %>%
mutate(start_station = as.character(start_station)) %>%
left_join(
station_attributes_q1 %>%
mutate(start_station = as.character(start_station)) %>%
select(start_station, Med_Inc, Percent_Taking_Transit, Percent_White),
by = "start_station"
) %>%
filter(!is.na(Med_Inc))
# Create plots
p1_q3 <- ggplot(station_errors_demo_q3, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income Q3", x = "Median Income", y = "MAE") +
plotTheme
p2_q3 <- ggplot(station_errors_demo_q3, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage Q3", x = "% Taking Transit", y = "MAE") +
plotTheme
p3_q3 <- ggplot(station_errors_demo_q3, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race Q3", x = "% White", y = "MAE") +
plotTheme
(p1_q3 | p2_q3) / (p3_q3 | plot_spacer())
p1_q1 <- ggplot(station_errors_demo_q1, aes(x = Med_Inc, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
scale_x_continuous(labels = scales::dollar) +
labs(title = "Errors vs. Median Income q1", x = "Median Income", y = "MAE") +
plotTheme
p2_q1 <- ggplot(station_errors_demo_q1, aes(x = Percent_Taking_Transit, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Transit Usage q1", x = "% Taking Transit", y = "MAE") +
plotTheme
p3_q1 <- ggplot(station_errors_demo_q1, aes(x = Percent_White, y = MAE)) +
geom_point(alpha = 0.5, color = "#3182bd") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Errors vs. Race q1", x = "% White", y = "MAE") +
plotTheme
(p1_q1 | p2_q1) / (p3_q1 | plot_spacer())
#Q1 and Q3models have errors distributed fairly randomly around their trend lines, which show clear trends that differ based on the demographic measure. While the scales of error and volume of trips varies widely, the slopes of the fitted error lines are nearly identical. The models have more errors in higher income areas and areas with a higher percentage of white people. Both models see a decrease in error as the % of transit ridership increases.
```
```{r new models}
# Model 6: Best previous model (5) + new features
# Adding dist_to_center and volume_tier to model 5's formula
model6_q3 <- lm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc + Percent_Taking_Transit + Percent_White +
start_station +
rush_hour * weekend +
dist_to_center + volume_tier,
data = train_q3
)
cat("Model 6 Q3 R-squared:", summary(model6_q3)$r.squared, "\n")
cat("Model 6 Q3 Adj R-squared:", summary(model6_q3)$adj.r.squared, "\n")
```
```{r poisson model}
# Poisson is appropriate for count data like trip counts
# It assumes variance increases with the mean (unlike OLS which assumes constant variance)
model_poisson_q3 <- glm(
Trip_Count ~ as.factor(hour) + dotw_simple + Temperature + Precipitation +
lag1Hour + lag3Hours + lag1day + rush_hour + as.factor(month) +
Med_Inc + Percent_Taking_Transit + Percent_White +
start_station +
rush_hour * weekend +
dist_to_center + volume_tier,
data = train_q3,
family = poisson(link = "log")
)
cat("Poisson Q3 AIC:", AIC(model_poisson_q3), "\n")
cat("OLS Model 6 Q3 AIC:", AIC(model6_q3), "\n")
```
```{r mae on new models}
# Get predictions on test set
# Create day of week factor with treatment (dummy) coding
test_q3 <- test_q3 %>%
mutate(dotw_simple = factor(dotw, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")))
# Set contrasts to treatment coding (dummy variables)
contrasts(test_q3$dotw_simple) <- contr.treatment(7)
test_q3$start_station <- factor(
test_q3$start_station,
levels = model4_q3$xlevels$start_station
)
test_q3 <- test_q3 %>% droplevels()
test_q3 <- test_q3 %>%
mutate(
pred1 = predict(model1_q3, newdata = test_q3),
pred2 = predict(model2_q3, newdata = test_q3),
pred3 = predict(model3_q3, newdata = test_q3),
pred4 = predict(model4_q3, newdata = test_q3),
pred5 = predict(model5_q3, newdata = test_q3),
pred6 = predict(model6_q3, newdata = test_q3),
pred_poisson = predict(model_poisson_q3, newdata = test_q3, type = "response") )
# Calculate MAE for each model
mae_results_newmodels <- data.frame(
Model = c(
"1. Time + Weather",
"2. + Temporal Lags",
"3. + Demographics",
"4. + Station FE",
"5. + Rush Hour Interaction",
"6. + Station Volume + Distance to Center City",
"7. Poisson fit"
),
MAE = c(
mean(abs(test_q3$Trip_Count - test_q3$pred1), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred2), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred3), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred4), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred5), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred6), na.rm = TRUE),
mean(abs(test_q3$Trip_Count - test_q3$pred_poisson), na.rm = TRUE)
)
)
kable(mae_results_newmodels,
digits = 2,
caption = "Mean Absolute Error by Model, Q3 (Test Set)",
col.names = c("Model", "MAE (trips)")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
#Station volume and distance to center city had no effect on the MAE, which I was surprised by. By creating volume tiers, I thought that the model would improve because the volume of trips seemed to have a large effect on the errors, as seen in the maps showing trip volume vs. error. Again, because Center City seemed to be a hotspot of errors, I thought that distance to City Hall might improve the model. In the future, perhaps a different engineering method of these two features would result in a better model. However, the Poisson fit improved the model greatly. This makes sense; the trip volumes are not continuous, so using a regression model that allows for a longer tail (like Poisson) improved the model.
```
```{r error by time of day}
# MAE by time of day and day type
# Calculate errors for new models
test_q3 <- test_q3 %>%
mutate(
abs_error_m6 = abs(Trip_Count - pred6),
abs_error_poisson = abs(Trip_Count - pred_poisson)
)
# Pivot longer so both models can be faceted/compared
temporal_errors_newmodels <- test_q3 %>%
select(time_of_day, weekend, abs_error_m6, abs_error_poisson) %>%
pivot_longer(
cols = c(abs_error_m6, abs_error_poisson),
names_to = "model",
values_to = "abs_error"
) %>%
mutate(
model = recode(model,
"abs_error_m6" = "Model 6: + Distance + Volume Tier",
"abs_error_poisson" = "Poisson"),
day_type = ifelse(weekend == 1, "Weekend", "Weekday")
) %>%
group_by(time_of_day, weekend, day_type, model) %>%
summarize(MAE = mean(abs_error, na.rm = TRUE), .groups = "drop")
ggplot(temporal_errors_newmodels, aes(x = time_of_day, y = MAE, fill = day_type)) +
geom_col(position = "dodge") +
facet_wrap(~ model) +
scale_fill_manual(values = c("Weekday" = "#08519c", "Weekend" = "#6baed6")) +
labs(
title = "Prediction Errors by Time Period - New Models",
subtitle = "Model 6 vs. Poisson",
x = "Time of Day",
y = "Mean Absolute Error (trips)",
fill = "Day Type"
) +
plotTheme +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#The model 6 MAE by hour is similar to that of the first five models, but the Poisson model reduces the overall MAE. It still follows the same temporal error patterns, showing that while the Poisson model improves the model overall, it does not account for some of the temporal error.
```
```{r reflection}
#Operational Limitations
#The best OLS models (4 and 5) achieve MAE around 0.74 trips per station-hour in Q3, with the Poisson model improving on this further (MAE = 0.69). For low-volume stations averaging 0.3-0.5 trips per hour in Q1, an MAE of 0.8 represents substantial error; this shows the the model is often off by more than the average demand itself. For high-volume Center City stations, the same absolute error is more tolerable. The model really begins to struggle PM rush on weekdays, which is likely when rebalancing demand is highest. Incorrect predictions could leave commuters, especially at lower-volume stations without a bike. I would recommend deploying this system only for high-volume stations where the model performs more reliably until more features can improve the model predictions.
#Equity Considerations
#The demographic error analysis reveals a consistent pattern across both Q1 and Q3: prediction errors are higher in higher-income, majority-white neighborhoods, and lower in higher-transit-usage areas. At first glance this seems to favor lower-income neighborhoods, but higher errors in wealthier Center City areas likely reflect those stations' higher volumes, not better service for lower-income communities. We can see this as the high median income areas that also have stations are in Center City, where population density and trip volume are higher. One possible negative implication is that if Indego uses this model to allocate rebalancing , lower-income and minority neighborhoods that the model predicts more "accurately" (because they have lower and more stable demand) may receive less attention, entrenching existing patterns of underservice. A system that performs well where demand is already high and poorly where demand risks creating a circular logic and perpetuating service flaws for minority communities or lower income areas. I recommend reevaluating decisions by neighborhood demographics, setting a floor on service level for all stations, and not using prediction accuracy as a sole proxy for evaluating service levels.
#Model Limitations
#The most significant pattern the model is missing is event-driven demand — outlier events (such as the spike in Feb 2025 on the day the Eagles won the Super Bowl) that temporal lag can anticipate because the model is not trained to factor in such irregularities. This could be one reason that Q1 performs so much better than Q3 overall; summer brings more irregular events, tourism, and changing school schedules. Further, using the temporal train/test split (training on July through early September and testing on late September) is important to not create a circular model through a different validation (such as k-fold), but if demand changes heavily based on holidays and summer patterns, this means the model is trained on summer demand patterns, which may differ from fall in ways the month fixed effect only shows a part of. With more time, the most valuable additions would be special event calendars and a cross-validation strategy that tests across multiple time windows while still paying attention to creating a logically sound temporal validation.
```