library(tidyverse)
library(readr)
library(haven)
library(lubridate)
library(tidymodels)
library(tidycensus)
library(janitor)
library(sf)
library(plotly)
library(themis)
library(vip)
library(ranger)
library(xgboost)
library(rlang)
library(bestNormalize)
1 Literature Review
Urban cities, characterized by their high population density, diverse socioeconomic landscapes, and intricate social interactions, often present environments where crime rates are elevated compared to less densely populated areas (Cheng & Chen, 2021). The District of Columbia, a city marked by pronounced neighborhood disparities and shifting crime dynamics, presents a timely case for predictive crime modeling (Golash-Boza & Oh, 2021). Our project investigates three key questions: To what extent can crime characteristics and spatial context predict whether an incident is a felony? How accurately can we classify the type of felony? And can we estimate the felony severity score associated with a newly reported crime in DC? These questions not only reflect academic interest in machine learning for social applications, but also offer practical insights for public-sector decision-making on law enforcement and urban resource allocation.
A robust body of literature underscores the importance of both spatial and socioeconomic factors in shaping crime patterns. Environmental criminology highlights how the built environment—liquor outlets, transit hubs, vending zones, and other public facilities—can serve as vectors of accessibility that increase the likelihood of certain offenses (Bowers & Johnson, 2011). Risk-terrain modeling shows that crimes often cluster around these spatial features. In parallel, socioeconomic disadvantage at the neighborhood level has long been associated with elevated rates of both violent and property crime. Multilevel studies confirm that factors such as poverty, unemployment, and educational attainment are closely linked to victimization risk(Van et al.,2006). By incorporating ACS 5-year estimates alongside geo-spatial indicators from D.C. Open Data, our model reflects a multidimensional understanding of crime exposure and allows for fine-grained prediction at the incident level.
Felony crimes represent the most severe and socially costly offenses in the criminal justice system. Legally defined in D.C. as offenses punishable by more than one year of imprisonment (Council of the District of Columbia, 2025), felonies include acts such as assault with a dangerous weapon, robbery, and homicide. These crimes disproportionately consume public safety resources and contribute to long-term social harm. The societal cost of violent felonies can reach tens of thousands of dollars per incident, factoring in medical care, lost productivity, and incarceration (McCollister et al., 2010). Prioritizing felony classification enables the development of targeted interventions and more efficient policing. In our study, the felony severity score serves as the core outcome in a regression task aimed at quantifying the harm associated with newly reported incidents.
While spatial and socioeconomic features help explain where and why crimes occur, effective crime prevention also depends on how institutions respond to these risks. Problem-oriented policing (POP) emphasizes proactive, context-sensitive interventions based on the SARA model—Scanning, Analysis, Response, and Assessment.(Weisburd et al.2010). Meanwhile, hot-spots policing strategies that concentrate patrols at high-risk micro-locations have been shown to reduce crime not only in targeted areas but also in adjacent zones through diffusion effects (Braga et al., 2014). These insights align closely with our project’s emphasis on prediction-informed intervention. Our modeling approach provides a practical tool for supporting targeted, evidence-based law enforcement strategies in DC.
2 Set up
3 Prepare Data Set
In this project, we use a government and policy-relevant dataset to develop a three different types of models to predict:
the likelihood of a crime being a felony in Washington, D.C. (binary classification task);
which felony category a new reported crime in DC falls under (multi-classification task);
what felony severity score of a crime new reported in DC would be (regression task).
The data is compiled primarily from the DC Open Data Portal. The main source is a series of datasets titled Crime Incidents, released annually from 2008 to 2025, comprising a total of eighteen datasets. These include metadata for each crime incident, such as the exact date and time of occurrence, geographic coordinates (latitude and longitude), offense classification (e.g., theft, robbery), police shift (e.g., evening, midnight).
To enrich the crime records with contextual and spatial information, we merged several additional geospatial datasets also provided by DC Open Data. These include:
Liquor license moratorium zones
Locations and proximity of liquor stores, grocery stores, and banks
Public WiFi zones and vending zones
Low food access areas
Police sector boundaries and military base zones
We also integrated tract-level socioeconomic indicators from the American Community Survey (ACS5) using the tidycensus
package. These variables include poverty rate, income, renter percentage, race/ethnicity composition, vehicle ownership, and age structure etc. Notably, since ACS5 provides rolling estimates across five-year spans, we aligned each crime observation with the most recent ACS data available for that year. For example, the 2005–2009 ACS estimates are used to represent conditions in 2009. While not a perfect snapshot, this method can act as a consistent way to approximate local context year by year.
The final merged dataset (crime_enriched_acs_nona.csv
) is available via this shared link.
The outcome variable, felony_flag
, is derived from the reported offense type, indicating whether a reported crime is classified as a felony. This classification task is critical, as felonies typically represent more serious crimes and require distinct responses in terms of police allocation. Our goal is to build a predictive model that helps law enforcement deploy patrol resources more efficiently by forecasting felony likelihood using supervised machine learning techniques.
To ensure real-world model evaluation, we split the dataset by time: observations from 2020 to 2022 are used for training, while 2023 data is reserved for implementation testing. The detailed data processing and cleaning steps are documented in the Appendix section. The raw data sources are listed in the README.
# Load the cleaned and pre-merged dataset combining DC crime data and ACS indicators
<- read_csv("data/crime_enriched_acs_nona.csv")
crime_enriched_no_na
# Select variables of interest for modeling
<- c(
keep_vars "felony_flag", "felony_severity", # outcome variables
"x", "y", "latitude", "longitude", "year",
"shift", "method", "tract", "report_date_parsed",
# GIS contextual features
"in_liquor_moratorium_zone", "nearest_liquorstore_dist", "near_liquorstore_200m",
"near_wifi_100m", "nearest_wifi_dist",
"in_lowfood_zone",
"nearest_grocery_dist", "near_grocery_300m",
"nearest_bank_dist",
"near_bank_250m",
"in_vending_zone",
"police_sector",
"in_military_base",
# ACS socioeconomic features
"poverty_rate", "unemployed_rate",
"black_pct", "hispanic_pct","foreign_born_pct",
"hsplus_pct",
"under18_pct", "age65plus_pct"
)
# Keep only selected variables
<- crime_enriched_no_na |>
crime_enriched_small select(all_of(keep_vars))
# Define training data (2020-2022)
<- crime_enriched_small |>
crime_enriched_mod filter(year >= 2020, year < 2023)
# Define implementation/test data (2023)
<- crime_enriched_small |>
crime_enriched_implement filter(year == 2023)
# Adjust variable data types for training data
<- crime_enriched_mod |>
crime_enriched_mod mutate(
report_date_parsed = as.Date(report_date_parsed),
felony_flag = factor(felony_flag)
)
# Adjust variable data types for implementation data
<- crime_enriched_implement |>
crime_enriched_implement mutate(
report_date_parsed = as.Date(report_date_parsed),
felony_flag = factor(felony_flag)
)
# Apply Yeo-Johnson transformation to the felony_severity column
<- yeojohnson(crime_enriched_mod$felony_severity)
yj_obj
# Add transformed column to the training dataset
$felony_severity_trans <- predict(yj_obj)
crime_enriched_mod
# Apply Yeo-Johnson transformation to the felony_severity column
<- yeojohnson(crime_enriched_implement$felony_severity)
yj_obj_impl
# Add transformed column to the implement dataset
$felony_severity_trans <- predict(yj_obj_impl) crime_enriched_implement
4 Binary Classification
4.1 Split Data
set.seed(123)
<- initial_split(crime_enriched_mod, prop = 0.8, strata = felony_flag)
crime_split
<- training(crime_split)
crime_train
<- testing(crime_split) crime_test
Note: This train-test split is performed only on data from 2020 to 2022. The year 2023 is held out separately for final implementation testing, simulating real-world deployment.
4.2 EDA on Training Set
First, let’s examine the distribution of the outcome variable felony_flag
for binary classification task.
# Check class imbalance in the target variable
|>
crime_train count(felony_flag) |>
mutate(pct = n / sum(n))
felony_flag | n | pct |
---|---|---|
0 | 45804 | 0.6867887 |
1 | 20889 | 0.3132113 |
According to the table, about 30% of crimes in the training set are felonies, while about 70% are non-felonies. This class imbalance could affect how the model learns. We’ll apply a downsampling method to help the model treat both classes fairly during training.
Second, we explore whether felony crimes are more likely on specific days of the week by plotting the proportion of crimes that are felonies for each weekday.
|>
crime_train mutate(dow = lubridate::wday(report_date_parsed, label = TRUE)) |>
group_by(dow) |>
summarize(felony_rate = mean(as.numeric(as.character(felony_flag)))) |>
ggplot(aes(x = dow, y = felony_rate, fill = dow)) +
geom_col() +
scale_fill_manual(values = rep("steelblue", 7)) +
labs(
title = "Felony Rate by Weekday",
x = "Day of Week",
y = "Proportion of Felony"
+
) theme_classic() +
theme(
legend.position = "none",
text = element_text(family = "serif")
)
The plot shows felonies are slightly more frequent on weekends compared to weekdays.
Third, we assess whether ACS5 predictors we selected relate meaningfully to one of our outcome variables felony_flag
. Here we focus on poverty_rate
.
ggplot(crime_train, aes(x = poverty_rate, y = as.numeric(as.character(felony_flag)))) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, color = "steelblue") +
labs(
title = "Probability of Felony by Poverty Rate",
x = "Poverty Rate",
y = "Estimated Probability of Felony"
+
) theme_classic() +
theme(
text = element_text(family = "serif")
)
Using a logistic smoother to examine the relationship between poverty rate and the likelihood of felony classification, the plot shows a clear positive association: as poverty increases, the probability of a crime being classified as a felony also rises.
Furthermore, we calculate average felony rates by census tract and display them on a choropleth map in order to examine geographic variation in felony risk.
<- st_read("data/tl_2023_11_tract/tl_2023_11_tract.shp") # load tract shp dc_tract_sf
<- crime_train |>
tract_felrate group_by(tract) |>
summarize(
felony_rate = mean(as.numeric(as.character(felony_flag)))
)
<- dc_tract_sf |>
tract_sf_joined left_join(tract_felrate, by = c("TRACTCE" = "tract"))
ggplot(tract_sf_joined) +
geom_sf(aes(fill = felony_rate), color = "white", size = 0.2) +
scale_fill_gradient(
low = "#d6e4f0",
high = "#274472",
na.value = "grey90"
+
) labs(title = "Felony Rate by Census Tract") +
theme_classic() +
theme(
text = element_text(family ="serif")
)
unique(crime_enriched_no_na$tract)
[1] "009602" "008904" "004100" "005802" "001100" "007603" "009905" "004802"
[9] "003400" "007503" "001303" "008802" "006801" "007304" "007100" "007803"
[17] "010602" "003100" "002503" "001004" "005003" "002101" "009201" "002002"
[25] "008702" "007202" "000702" "009906" "010500" "008803" "010601" "004902"
[33] "009604" "010603" "008301" "004600" "003000" "000102" "004703" "007901"
[41] "009204" "009102" "007804" "003600" "002001" "004001" "003902" "004402"
[49] "005203" "003702" "007605" "008002" "004300" "001600" "005502" "001200"
[57] "009603" "007000" "003500" "003200" "007406" "004401" "003301" "004801"
[65] "003901" "004702" "009807" "007601" "001304" "009504" "007203" "010700"
[73] "004901" "007504" "009508" "005801" "002400" "007604" "009907" "009902"
[81] "002703" "007407" "005001" "005004" "001702" "002301" "000300" "003302"
[89] "001402" "007806" "010100" "007408" "007502" "010300" "005202" "007404"
[97] "007707" "003801" "001902" "009803" "007409" "006500" "009400" "005302"
[105] "010400" "002600" "011100" "005303" "000400" "002802" "001901" "000202"
[113] "007808" "006900" "008402" "009801" "009000" "009904" "000201" "006400"
[121] "009802" "003701" "007703" "002504" "007809" "000501" "008701" "004202"
[129] "000904" "007401" "006802" "002501" "007708" "005601" "009203" "002801"
[137] "007403" "010201" "007807" "001401" "002102" "002702" "007903" "002704"
[145] "008804" "001500" "001804" "008903" "009510" "003802" "009301" "006600"
[153] "009901" "002201" "002900" "009511" "000903" "008200" "010900" "009811"
[161] "000704" "009302" "009700" "010202" "011001" "008001" "009804" "002202"
[169] "009503" "008302" "008410" "005900" "007201" "009509" "005503" "009601"
[177] "001301" "004002" "004201" "002302" "011002" "000600" "009507" "008100"
[185] "009505" "009810" "001003" "000804" "000502" "009903" "005602" "007709"
[193] "006804" "001803" "010800" "005501" "006700" "980000" "000802" "004704"
[201] "000902" "000803" "000101" "001002" "000703" "007301"
As shown in the map, felony rates are unevenly distributed across D.C. The southern and eastern tracts report substantially higher proportions of felony incidents, while felony rates are lower in northern areas.
Lastly, we plot each crime by its location to spot patterns across the city. Because there are too many points for a static map, we use an interactive one. Red dots are felonies. Blue dots are non-felonies.
plot_ly(
data = crime_train,
x = ~longitude,
y = ~latitude,
type = "scattergl",
mode = "markers",
color = ~felony_flag,
colors = c("blue", "red"),
text = ~paste("Felony:", felony_flag),
marker = list(size = 2, opacity = 0.5)
|>
) layout(
title = list(
text = "Interactive Felony Crime Map (DC)",
font = list(family = "serif", size = 18)
),xaxis = list(
title = "Longitude",
zeroline = FALSE,
showgrid = FALSE,
font = list(family = "serif"),
titlefont = list(family = "serif")
),yaxis = list(
title = "Latitude",
zeroline = FALSE,
showgrid = FALSE,
scaleanchor = "x",
font = list(family = "serif"),
titlefont = list(family = "serif")
),margin = list(l = 40, r = 40, b = 40, t = 60),
showlegend = FALSE
)
4.3 Select Error Metric
In this binary classification task, the outcome variable is felony_flag
, indicating whether a reported crime is classified as a felony. We aim to build a predictive model that helps law enforcement allocate patrol resources more efficiently by identifying potential felony incidents in advance.
For this binary classification problem, we propose to use 1 − Recall
as our main evaluation metric, which intuitively represents the share of felonies the model fails to detect:
\[ 1 - \text{Recall} = 1 - \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{FN}}{\text{TP} + \text{FN}} \]
This metric directly reflects how often the model misses serious crimes. Since missing a felony is riskier than flagging a non-felony. Our goal is to keep 1 − Recall
as low as possible. In practice, we would aim for a 1 − Recall
below 0.2, meaning the model correctly identifies at least 80% of actual felony cases.
Although metrics like accuracy and ROC AUC
provide an overall picture, they may be misleading under class imbalance. Here, felonies represent about 30% of cases, so a model could achieve high accuracy while failing to catch most felonies.
In this context, a false negative
(missed felonies) is more costly than a false positive
(flagging non-felonies). A false positive might result in unnecessary police allocation, which is manageable. While a false negative may mean failing to prevent or respond to a serious crime, posing higher social and safety costs.
4.4 Come up with Models
We test three model specifications as below. Each model uses different combinations of feature sets, pre-processing (recipe) and algorithms, so that we can compare their effectiveness and usefulness in the evaluation part.
4.4.1 Model 1: Logistic Regression with LASSO
This model uses logistic regression with L1 regularization to predict felony likelihood, prioritizing interpretability and simplicity. It focuses on demographic and temporal predictors that reflect structural and situational risk. The L1 penalty helps reduce noise by shrinking less relevant variables toward zero.
We include seven ACS-derived variables: poverty rate
, Black and Hispanic population percentages
, foreign-born share
, and the proportions of residents under 18 and over 65. We also include two crime-level factors: police shift
and method of offense
.
Preprocessing steps include:
- Converting weekday and day-of-year into features from the report date
- One-hot encoding categorical variables
- Standardizing numeric variables
- Downsampling to address class imbalance
The specific recipe is as below:
set.seed(123)
<- recipe(felony_flag ~ poverty_rate + unemployed_rate +
crime_rec_log + hispanic_pct + foreign_born_pct +
black_pct +
hsplus_pct + age65plus_pct +
under18_pct + method + report_date_parsed,
shift data = crime_train) |>
step_downsample(felony_flag) |>
step_date(report_date_parsed, features = "dow", label = TRUE) |>
step_mutate(
yday = lubridate::yday(report_date_parsed),
yday_cos = cos(2 * pi * yday / 365),
yday_sin = sin(2 * pi * yday / 365)
|>
) step_rm(report_date_parsed) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
The specification and workflow is as below:
<- logistic_reg(
log_spec mode = "classification",
penalty = tune(), # L1 penalty strength (lambda)
mixture = 1 # 1 means LASSO (L1 only)
|>
) set_engine("glmnet") |>
set_mode("classification")
<- workflow() |>
log_wf add_model(log_spec) |>
add_recipe(crime_rec_log)
4.4.2 Model 2: Random Forest with Spatial Features
This model focuses on spatial context and aims to detect felony patterns based on location, proximity to public infrastructure and tract-level zoning conditions. Unlike Model 1, we intentionally exclude demographic variables to isolate the predictive power of GIS features.
The predictors include: tract
, police shift
and method
, proximity to liquor stores, grocery stores, banks, public WiFi zones, vending zones, military bases, and food access zones. These variables capture the built environment and local regulations that may influence criminal activity.
Preprocessing includes removing high correlated variables, converting categorical variables to dummies, normalizing numeric variables, extracting temporal signals and downsampling to reduce class imbalance. We retain tract
as a spatial unit, based on earlier EDA results showing geographic variation in felony rates. This model uses a Random Forest from the ranger
package. It handles nonlinear relationships and complex interactions among predictors well. We may tune settings like the number of trees or how many variables are used at each split later to improve accuracy.
The specific recipe is as below:
set.seed(123)
<- recipe(felony_flag ~ tract + shift + method +
crime_rec_rf +
report_date_parsed +
in_liquor_moratorium_zone + near_liquorstore_200m +
nearest_liquorstore_dist + nearest_wifi_dist +
near_wifi_100m +
in_lowfood_zone + near_grocery_300m +
nearest_grocery_dist + near_bank_250m +
nearest_bank_dist + police_sector + in_military_base,
in_vending_zone data = crime_train) |>
step_downsample(felony_flag) |>
step_date(report_date_parsed, features = "dow", label = TRUE) |>
step_mutate(
yday = lubridate::yday(report_date_parsed),
yday_cos = cos(2 * pi * yday / 365),
yday_sin = sin(2 * pi * yday / 365)
|>
) step_rm(report_date_parsed) |>
step_corr(all_numeric_predictors(), threshold = 0.9) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
The specification and workflow is as below:
<- rand_forest(
rf_spec mtry = tune(), # number of variables sampled at each split (to be tuned)
trees = 100, # total number of trees (fixed for now)
min_n = tune() # minimum number of observations in a node (to be tuned)
|>
) set_engine("ranger", importance = "impurity") |>
set_mode("classification")
<- workflow() |>
rf_wf add_model(rf_spec) |>
add_recipe(crime_rec_rf)
4.4.3 Model 3: XGBoost with All Features
This model maximizes predictive power by incorporating the full set of spatial, socioeconomic, and temporal predictors available in the dataset. Unlike Models 1 and 2, which focus on interpretability or spatial patterns, this model includes feature broadly and lets the boosting algorithm handle interactions and variable selection.
The predictors include:
Spatial and Proximity Features:
in_liquor_moratorium_zone
,nearest_liquorstore_dist
,near_liquorstore_200m
,near_wifi_100m
,nearest_wifi_dist
,in_lowfood_zone
,nearest_grocery_dist
,near_grocery_300m
,nearest_bank_dist
,near_bank_250m
,in_vending_zone
,police_sector
,in_military_base
.Demographic and Socioeconomic Features from ACS:
poverty_rate
,unemployed_rate
,black_pct
,hispanic_pct
,foreign_born_pct
,hsplus_pct
,under18_pct
,age65plus_pct
.Time and Categorical Contextual Features:
shift
,method
,tract
,report_date_parsed
.
Preprocessing includes removing high correlated variables, converting categorical variables to dummies, normalizing numeric variables, extracting temporal signals and downsampling to reduce class imbalance. We use the XGBoost
algorithm,
which can capture complex nonlinear interactions and works well with tabular data. It supports regularization and tree-based boosting, making it a strong candidate when overfitting is a concern.
The specific recipe is as below:
set.seed(123)
<- recipe(felony_flag ~ ., data = crime_train) |>
crime_rec_xgb step_rm(felony_severity, felony_severity_trans) |>
step_downsample(felony_flag) |>
step_date(report_date_parsed, features = "dow", label = TRUE) |>
step_mutate(
yday = lubridate::yday(report_date_parsed),
yday_cos = cos(2 * pi * yday / 365),
yday_sin = sin(2 * pi * yday / 365)
|>
) step_rm(
x, y, latitude, longitude, year, report_date_parsed|>
) step_nzv(all_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.9) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
The specification and workflow is as below:
<- boost_tree(
xgb_spec mode = "classification",
trees = 100, # total number of trees (fixed)
tree_depth = tune(), # max depth of each tree (to be tuned)
learn_rate = tune(), # learning rate (to be tuned)
loss_reduction = tune() # min loss reduction to split further (to be tuned)
|>
) set_engine("xgboost") |>
set_mode("classification")
<- workflow() |>
xgb_wf add_model(xgb_spec) |>
add_recipe(crime_rec_xgb)
4.5 Estimation
We use 5-fold cross-validation and tune the L1 penalty (penalty
) to find the best balance between model complexity and predictive performance. Our tuning metric is 1 - Recall
, measuring model performance on minimizing false negatives.
# Create a metric set that includes 1 - Recall
<- metric_set(recall, roc_auc, accuracy)
custom_metrics
# Set up 5-fold cross-validation
set.seed(123)
<- vfold_cv(crime_train, v = 5, strata = felony_flag)
crime_folds
# Define the tuning grid for L1 penalty
<- grid_regular(penalty(range = c(-4, 0)), levels = 30)
lambda_grid
# Tune the logistic regression model
set.seed(123)
<- tune_grid(
log_res object = log_wf,
resamples = crime_folds,
grid = lambda_grid,
metrics = custom_metrics,
control = control_grid(save_pred = TRUE)
)
# Show top results by 1-recall
|>
log_res collect_metrics() |>
filter(.metric == "recall") |>
mutate(one_minus_recall = 1 - mean) |>
arrange(one_minus_recall)
penalty | .metric | .estimator | mean | n | std_err | .config | one_minus_recall |
---|---|---|---|---|---|---|---|
0.0573615 | recall | binary | 0.9981006 | 5 | 0.0001489 | Preprocessor1_Model21 | 0.0018994 |
0.0788046 | recall | binary | 0.9981006 | 5 | 0.0001489 | Preprocessor1_Model22 | 0.0018994 |
0.1082637 | recall | binary | 0.9981006 | 5 | 0.0001489 | Preprocessor1_Model23 | 0.0018994 |
0.1487352 | recall | binary | 0.9981006 | 5 | 0.0001489 | Preprocessor1_Model24 | 0.0018994 |
0.2043360 | recall | binary | 0.9981006 | 5 | 0.0001489 | Preprocessor1_Model25 | 0.0018994 |
0.0417532 | recall | binary | 0.9526023 | 5 | 0.0023107 | Preprocessor1_Model20 | 0.0473977 |
0.0303920 | recall | binary | 0.9393720 | 5 | 0.0023572 | Preprocessor1_Model19 | 0.0606280 |
0.0221222 | recall | binary | 0.9319927 | 5 | 0.0025707 | Preprocessor1_Model18 | 0.0680073 |
0.0161026 | recall | binary | 0.9200069 | 5 | 0.0034577 | Preprocessor1_Model17 | 0.0799931 |
0.0117210 | recall | binary | 0.8991791 | 5 | 0.0045617 | Preprocessor1_Model16 | 0.1008209 |
0.0085317 | recall | binary | 0.8824776 | 5 | 0.0048504 | Preprocessor1_Model15 | 0.1175224 |
0.0062102 | recall | binary | 0.8717362 | 5 | 0.0034181 | Preprocessor1_Model14 | 0.1282638 |
0.0045204 | recall | binary | 0.8645534 | 5 | 0.0028163 | Preprocessor1_Model13 | 0.1354466 |
0.0032903 | recall | binary | 0.8594229 | 5 | 0.0026934 | Preprocessor1_Model12 | 0.1405771 |
0.0023950 | recall | binary | 0.8560607 | 5 | 0.0027278 | Preprocessor1_Model11 | 0.1439393 |
0.0017433 | recall | binary | 0.8535282 | 5 | 0.0026044 | Preprocessor1_Model10 | 0.1464718 |
0.0012690 | recall | binary | 0.8515633 | 5 | 0.0025004 | Preprocessor1_Model09 | 0.1484367 |
0.0009237 | recall | binary | 0.8499477 | 5 | 0.0023134 | Preprocessor1_Model08 | 0.1500523 |
0.0006723 | recall | binary | 0.8491835 | 5 | 0.0024442 | Preprocessor1_Model07 | 0.1508165 |
0.0004894 | recall | binary | 0.8486377 | 5 | 0.0024858 | Preprocessor1_Model06 | 0.1513623 |
0.0003562 | recall | binary | 0.8482666 | 5 | 0.0024964 | Preprocessor1_Model05 | 0.1517334 |
0.0002593 | recall | binary | 0.8480919 | 5 | 0.0024355 | Preprocessor1_Model04 | 0.1519081 |
0.0001000 | recall | binary | 0.8480482 | 5 | 0.0024448 | Preprocessor1_Model01 | 0.1519518 |
0.0001374 | recall | binary | 0.8480482 | 5 | 0.0024448 | Preprocessor1_Model02 | 0.1519518 |
0.0001887 | recall | binary | 0.8480482 | 5 | 0.0024448 | Preprocessor1_Model03 | 0.1519518 |
0.2807216 | recall | binary | 0.2000000 | 5 | 0.2000000 | Preprocessor1_Model26 | 0.8000000 |
0.3856620 | recall | binary | 0.2000000 | 5 | 0.2000000 | Preprocessor1_Model27 | 0.8000000 |
0.5298317 | recall | binary | 0.2000000 | 5 | 0.2000000 | Preprocessor1_Model28 | 0.8000000 |
0.7278954 | recall | binary | 0.2000000 | 5 | 0.2000000 | Preprocessor1_Model29 | 0.8000000 |
1.0000000 | recall | binary | 0.2000000 | 5 | 0.2000000 | Preprocessor1_Model30 | 0.8000000 |
# Select best bundle of hyperparameters
<- select_best(log_res, metric = "recall")
best_log best_log
penalty | .config |
---|---|
0.0573615 | Preprocessor1_Model21 |
# Last fit with best hyperparameters bundle
<- finalize_workflow(log_wf, best_log)
final_log_wf
set.seed(123)
<- fit(final_log_wf, data = crime_train) final_log_fit
4.6 Interpretation
4.6.1 Interpret the results
Variable Importance
First, let’s take a look at the variable importance.
# Extract fitted model from final workflow
<- extract_fit_parsnip(final_log_fit)
log_model
# Plot variable importance
vip(log_model, num_features = 20)
The plot shows that offense method
and police shift
are the most important predictors. Features like method_OTHERS
, method_KNIFE
and shift_MIDNIGHT
, shift_EVENING
strongly contribute to felony predictions. Most temporal and demographic variables like report_date_parsed_dow_Wed
, black_pct
, hsplus_pct
, foreign_born_pct
have much lower influence. This shows that situational factors of how the crime happened and when it happened carry more predictive weight than background demographic characteristics in this model.
Next, let’s assess our tuned logistic regression model using both the test set (from 2020–2022) and the implementation set (from 2023). We focus on the model’s ability to identify felony incidents, particularly minimizing false negatives.
In-sample Performance on Test Set
We apply the trained model to the hold-out test set and calculate standard classification metrics.
# Apply the recipe and model to crime_test
<- predict(final_log_fit, new_data = crime_test, type = "prob") |>
crime_test_results bind_cols(predict(final_log_fit, new_data = crime_test)) |>
bind_cols(crime_test |> select(felony_flag))
# Evaluate on test set
<- crime_test_results |>
metrics_test ::metrics(truth = felony_flag, estimate = .pred_class)
yardstick
# Recall and 1 - Recall
<- recall(data = crime_test_results, truth = felony_flag, estimate = .pred_class)$.estimate
recall_val <- 1 - recall_val
one_minus_recall_val
# Print metrics
metrics_test
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.7837481 |
kap | binary | 0.3825064 |
recall_val
[1] 0.9985155
one_minus_recall_val
[1] 0.001484457
The model achieves an accuracy
of 78.4% and a 1 − Recall
of only 0.0015 on the test set. This means fewer than 0.2% of true felonies are misclassified as non-felonies. This is crucial in a public safety context where failing to detect a felony could result in inadequate response.
Out-of-sample Generalization to 2023 Implementation Set
We then test the model on unseen 2023 data to evaluate its generalization in real-world applicability.
# Apply the recipe and model to crime_enriched_implement
<- predict(final_log_fit, new_data = crime_enriched_implement, type = "prob") |>
crime_implement_results bind_cols(predict(final_log_fit, new_data = crime_enriched_implement)) |>
bind_cols(crime_enriched_implement |> select(felony_flag))
# Evaluate on implementation set
<- crime_implement_results |>
metrics_implement metrics(truth = felony_flag, estimate = .pred_class)
# Recall and 1 - Recall
<- recall(data = crime_implement_results, truth = felony_flag, estimate = .pred_class)$.estimate
recall_val_impl <- 1 - recall_val_impl
one_minus_recall_val_impl
# Print metrics
metrics_implement
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.7241279 |
kap | binary | 0.3309441 |
recall_val_impl
[1] 0.9957171
one_minus_recall_val_impl
[1] 0.004282859
On the implementation set, the model maintains a recall of 99.6% and an accuracy of 72.4%, with a 1 − Recall
of 0.0043. This confirms the model generalizes well and continues to flag nearly all felony cases, even in new data.
In conclusion, the model is effective because it captures the majority of serious crimes with very low false negatives. In policing strategy, this is far more important than overall accuracy. The model’s simplicity and high recall make it practical for daily patrol planning and hotspot targeting.
4.6.2 Improve Model with Different Features and Modeling Options
We explore two different combinations of feature engineering and modeling options to improve our initial regularized logistic regression model, which focused on demographic and temporal features.
First, we test Model 2 which replaces demographics with spatial features such as proximity to liquor stores, grocery stores and banks, as well as tract, special zones like military bases and vending areas. These geographical indicators may capture hot-spots with high felony risk that demographic variables may miss in Model 1. The model shifts from logistic regression to random forest, which handles nonlinear patterns and variable interactions better. It requires less preprocessing and includes tuning for mtry
and min_n
to boost performance.
Second, we try out Model 3, which keeps all features from Model 1 and Model 2, including demographics, spatial context and timing. It lets the model learn the best combinations without us limiting scope. This helps test if more context leads to better predictions. We use XGBoost for its ability to learn complex patterns with regularization. It improves on Model 1 by tuning tree depth
, learning rate
, and loss reduction
, allowing better balance between fit and generalization.
Model 2 Estimation with Cross-Validation
# Define a metric set
<- metric_set(recall, accuracy, roc_auc)
custom_metrics
# Create resampling folds
set.seed(123)
<- vfold_cv(crime_train, v = 5, strata = felony_flag)
crime_folds
# Define a random grid of hyperparameters for tuning
set.seed(123)
<- grid_random(
rf_grid mtry(range = c(3, 10)),
min_n(range = c(2, 10)),
size = 5
)
# Tune the model
<- tune_grid(
rf_res
rf_wf,resamples = crime_folds,
grid = rf_grid,
metrics = custom_metrics,
control = control_grid(save_pred = TRUE)
)
# Select best hyperparameters combinations by recall
<- select_best(rf_res, metric = "recall")
best_rf
# Finalize the model
<- finalize_workflow(rf_wf, best_rf)
final_rf_wf <- fit(final_rf_wf, data = crime_train) final_rf_fit
Model 2 Evaluate on Test and Implementation Sets
# Extract fitted model from final workflow
<- extract_fit_parsnip(final_rf_fit)
rf_model
# Plot top 20 important variables
vip(rf_model, num_features = 20)
The Random Forest model shows a different importance pattern than the logistic model. As seen in the plot above, crime-related features like method_OTHERS
dominate, followed by temporal seasonality features (yday_sin
, yday_cos
, yday
) and spatial proximity variables (nearest_bank_dist
, nearest_liquorstore_dist
, nearest_grocery_dist
, `nearest_grocery_wifi
). This suggests that environmental context and time-of-year carry more predictive weight than demographics like black_pct
in this model.
# Predict on test set
<- predict(final_rf_fit, new_data = crime_test, type = "prob") |>
rf_test_results bind_cols(predict(final_rf_fit, new_data = crime_test)) |>
bind_cols(crime_test |> select(felony_flag))
# Metrics on test set
<- rf_test_results |>
metrics_test_rf metrics(truth = felony_flag, estimate = .pred_class)
<- recall(rf_test_results, truth = felony_flag, estimate = .pred_class)$.estimate
recall_rf
<- 1 - recall_rf
one_minus_recall_rf
# Print metrics
metrics_test_rf
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.7358921 |
kap | binary | 0.3933248 |
recall_rf
[1] 0.7977646
one_minus_recall_rf
[1] 0.2022354
On the test set, this model reaches an accuracy
of 71.5% and a recall
of 73.97%. Roughly 26% of actual felonies go undetected according to 1 - Recall
. This means it misses more than a quarter of actual felonies, a substantial decline in sensitivity compared to Model 1. While the Model 2 may capture some useful spatial patterns, it underperforms on its ability to flag serious crimes when compared to the baseline.
# Predict on implementation set
<- predict(final_rf_fit, new_data = crime_enriched_implement, type = "prob") |>
rf_impl_results bind_cols(predict(final_rf_fit, new_data = crime_enriched_implement)) |>
bind_cols(crime_enriched_implement |> select(felony_flag))
# Convert felony_flag to factor and calculate metrics
<- rf_impl_results |>
metrics_impl_rf mutate(felony_flag = factor(felony_flag)) |>
metrics(truth = felony_flag, estimate = .pred_class)
# Calculate recall
<- rf_impl_results |>
recall_rf_impl mutate(felony_flag = factor(felony_flag)) |>
recall(truth = felony_flag, estimate = .pred_class) |>
pull(.estimate)
# 1 - recall
<- 1 - recall_rf_impl
one_minus_recall_rf_impl
# Print metrics
metrics_impl_rf
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.7094022 |
kap | binary | 0.3765945 |
recall_rf_impl
[1] 0.7914724
one_minus_recall_rf_impl
[1] 0.2085276
On the implementation set, Model 2 shows stronger accuracy (83.4%) but weaker recall (76.8%) than Model 1. Compared to its own test set results, the results improves on the implementation set: accuracy
rises to 83.4% and recall
improves to 76.8%, with 1 − Recall
dropping to 0.232. This shows that Model 2 is less stable over time, due to the randomness of spatial signals or the smaller hyperparameter grid we used. To improve this model, we may need a larger tuning search or better feature engineering to stabilize performance.
Model 2 generalizes better in terms of accuracy but struggles with sensitivity and consistency. For real-world use, especially in crime forecasting, a more reliable recall rate is needed.
Model 3 Estimation with Cross-Validation
# Define a metric set
<- metric_set(recall, accuracy, roc_auc)
custom_metrics
# Create 5-fold resampling folds
set.seed(123)
<- vfold_cv(crime_train, v = 5, strata = felony_flag)
crime_folds
# Define a random grid for XGBoost hyperparameters
set.seed(123)
<- grid_random(
xgb_grid tree_depth(range = c(3, 10)),
learn_rate(range = c(0.01, 0.3)),
loss_reduction(range = c(0, 1)),
size = 5
)
# Tune the model
set.seed(123)
<- tune_grid(
xgb_res
xgb_wf,resamples = crime_folds,
grid = xgb_grid,
metrics = custom_metrics,
control = control_grid(save_pred = TRUE)
)
# Select best hyperparameters based on recall
<- select_best(xgb_res, metric = "recall")
best_xgb
# Finalize workflow
<- finalize_workflow(xgb_wf, best_xgb)
final_xgb_wf
set.seed(123)
<- fit(final_xgb_wf, data = crime_train) final_xgb_fit
Model 3 Evaluate on Test and Implementation Sets
# Extract and visualize variable importance
<- extract_fit_parsnip(final_xgb_fit)
xgb_model
vip(xgb_model, num_features = 20)
The variable importance plot shows black_pct
as top drivers, followed by proximity and temporal measures like nearest_grocery_dist
, yday_cos
, nearest_wifi_dist
and yday
.
# Apply XGBoost model to test set
<- predict(final_xgb_fit, new_data = crime_test, type = "prob") |>
xgb_test_results bind_cols(predict(final_xgb_fit, new_data = crime_test)) |>
bind_cols(crime_test |> select(felony_flag))
# Evaluate metrics on test set
<- xgb_test_results |>
metrics_xgb_test ::metrics(truth = felony_flag, estimate = .pred_class)
yardstick
# Compute recall and 1 - recall
<- recall(xgb_test_results, truth = felony_flag, estimate = .pred_class)$.estimate
xgb_recall_test
<- 1 - xgb_recall_test
xgb_one_minus_recall_test
# Print metrics
metrics_xgb_test
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.6666267 |
kap | binary | 0.2837879 |
xgb_recall_test
[1] 0.6886133
xgb_one_minus_recall_test
[1] 0.3113867
On the test set, Model 3 achieves an accuracy
of 73.3% and a recall
of 77.85%, yielding a 1 − Recall
of 0.2215.
# Convert datetime to date to match the training recipe
<- crime_enriched_implement |>
crime_enriched_implement mutate(report_date_parsed = as.Date(report_date_parsed))
# Apply the model
<- predict(final_xgb_fit, new_data = crime_enriched_implement, type = "prob") |>
xgb_implement_results bind_cols(predict(final_xgb_fit, new_data = crime_enriched_implement)) |>
bind_cols(crime_enriched_implement |> select(felony_flag))
# Evaluate metrics on implementation set
<- xgb_implement_results |>
metrics_xgb_implement mutate(felony_flag = factor(felony_flag)) |>
metrics(truth = felony_flag, estimate = .pred_class)
# Compute recall and 1 - recall
<- xgb_implement_results |>
xgb_recall_implement mutate(felony_flag = factor(felony_flag)) |>
recall(truth = felony_flag, estimate = .pred_class) |>
pull(.estimate)
<- xgb_implement_results |>
one_minus_recall_xgb_impl mutate(felony_flag = factor(felony_flag)) |>
recall(truth = felony_flag, estimate = .pred_class) |>
pull(.estimate) |>
1 - x}()
{\(x)
# Print metrics
metrics_xgb_implement
.metric | .estimator | .estimate |
---|---|---|
accuracy | binary | 0.6140069 |
kap | binary | 0.2155158 |
xgb_recall_implement
[1] 0.6213001
one_minus_recall_xgb_impl
[1] 0.3786999
On the implementation set, accuracy
drops slightly to 73.8%, while recall
declines to 70.9% with a 1 − Recall
of 0.2905. This suggests moderate sensitivity and stable accuracy, but a relatively high miss rate for felonies.
Compared to Model 1, Model 3 captures less of the actual felonies and is more prone to false negatives. While Model 1 maintained a 1 − Recall
under 0.004, Model 3’s miss rate is over 20x higher.
Compared to Model 2, Model 3 performs more consistently in accuracy
across test and implementation sets, but its recall
drops more sharply. While adding both spatial and demographic features improves general balance, it does not enhance sensitivity. This suggests the combined model may generalize slightly better overall, but still misses more felonies than Model 1.
Visualize 3 Models’ Performances
<- tibble(
metric_plot_df model = rep(c("Log", "Rf", "XGBoost"), each = 2),
dataset = rep(c("Test", "Implementation"), times = 3),
accuracy = c(
0.785, 0.728, # Model 1
0.715, 0.834, # Model 2
0.733, 0.738 # Model 3
),one_minus_recall = c(
0.0021, 0.0039, # Model 1
0.2603, 0.2317, # Model 2
0.2215, 0.2905 # Model 3
)|>
) mutate(dataset = factor(dataset, levels = c("Test", "Implementation")))
# Accuracy
ggplot(metric_plot_df, aes(x = model, y = accuracy, fill = dataset)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
labs(
title = "Accuracy by Model and Dataset",
x = "Model",
y = "Accuracy"
+
) theme_classic() +
theme(
text = element_text(family = "serif")
)
# 1 - Recall
ggplot(metric_plot_df, aes(x = model, y = one_minus_recall, fill = dataset)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
labs(
title = "1 - Recall by Model and Dataset",
x = "Model",
y = "1 - Recall"
+
) theme_classic() +
theme(
text = element_text(family = "serif")
)
5 Multi-classification Task
# read in the data
<- read_csv("data/crime_enriched_acs_nona2.csv") crime
Note: the dataset crime_enriched_acs_nona2.csv is the same as dataset crime_enriched_acs_nona.csv. As we worked separately, we named our variables differently.
In addition to knowing whether a crime is a felony, it is also useful for law enforcement to know the severity of the felony, since several different crime categories are classified as felonies and each require different resources and police response.
In this section, we will build on the binary classification of crimes as felony or not felony by creating a multi-classification model to classify crimes based on the felony severity level. The outcome variable is ‘felony_severity,’ which was derived from the ‘offense’ variable in the Crime Incidents datasets (2008-2025) from DC Open Data. Felony_severity contains seven different severity levels, ranging from 0 (not a felony) to 7 (homicide) (See appendix). However, our multiclass classification candidate models will only include felonies, so felony severity level 0 will be removed from the dataset.
The goal of this multi-classification model is to further enhance the ability of law enforcement in DC to effectively allocate resources by forecasting the relative severity of different felonies committed in the District.
Replicating the steps used above for the binary classification model, we split the data to use 2019-2022 for training and 2023 for testing and implementation. We selected the same predictor variables used above from the ‘crime_enriched_acs_nona’ merged dataset.
5.1 Prepare the Data
# Select variables of interest for modeling
<- c(
multiclass_vars "felony_severity", # outcome variable
"x", "y", "latitude", "longitude", "year",
"shift", "method", "tract", "report_date_parsed",
# GIS contextual features
"in_liquor_moratorium_zone", "nearest_liquorstore_dist", "near_liquorstore_200m",
"near_wifi_100m", "nearest_wifi_dist",
"in_lowfood_zone",
"nearest_grocery_dist", "near_grocery_300m",
"nearest_bank_dist",
"near_bank_250m",
"in_vending_zone",
"police_sector",
"in_military_base",
# ACS socioeconomic features
"poverty_rate", "unemployed_rate",
"black_pct", "hispanic_pct","foreign_born_pct",
"hsplus_pct",
"under18_pct", "age65plus_pct"
)
# Keep only selected variables
<- crime |>
crime_multiclass select(all_of(multiclass_vars)) |>
filter(felony_severity > 0)
# Define training data (2019-2022)
<- crime_multiclass |>
crime_multiclass_small filter(year >= 2019, year < 2023)
# Define implementation/test data (2023)
<- crime_multiclass |>
crime_multiclass_implement filter(year == 2023)
# Adjust variable data types for training data
<- crime_multiclass_small |>
crime_multiclass_small mutate(
report_date_parsed = as.Date(report_date_parsed),
felony_severity = factor(felony_severity)
)
# Adjust variable data types for implementation data
<- crime_multiclass_implement |>
crime_multiclass_implement mutate(
report_date_parsed = as.Date(report_date_parsed),
felony_severity = factor(felony_severity)
)
5.2 Split Data
set.seed(123)
<- initial_split(crime_multiclass_small, prop = 0.8, strata = felony_severity)
multiclass_split
<- training(multiclass_split)
multiclass_train
<- testing(multiclass_split) multiclass_test
5.3 EDA on Training Set
First, we investigate the distribution of felony_severity
, which is used in both the multi-class classification and regression tasks.
# Explore felony severity distribution
|>
multiclass_train count(felony_severity) |>
mutate(pct = n / sum(n)) |>
arrange(desc(felony_severity))
felony_severity | n | pct |
---|---|---|
7 | 487 | 0.0233137 |
6 | 439 | 0.0210158 |
5 | 3737 | 0.1788980 |
4 | 4889 | 0.2340466 |
3 | 15 | 0.0007181 |
2 | 2935 | 0.1405046 |
1 | 8387 | 0.4015032 |
Based on the table above, crimes were categorized into an ordinal scale from 1 to 7, with 1 representing motor vehicle theft and 7 representing the most severe felony homicide
.
The distribution of felony_severity in the training set is highly skewed. The most frequent felony types are lower-level ones such as motor vehicle theft
(severity = 1) and burglary
(severity = 2). Mid-range felonies like robbery
(severity = 4) and aggravated assault
(severity = 5) occur less often, while severe crimes like sexual abuse
(severity = 6) and homicide
(severity = 7) are extremely rare, with only a few hundred instances in total. The category with severity 3 arson
is the rarest, appearing only 15 times.
To visualize this distribution, we plot the number of cases per severity score:
|>
multiclass_train ggplot(aes(x = factor(felony_severity))) +
geom_bar(fill = "steelblue") +
labs(title = "Distribution of Felony Severity Scores",
x = "Felony Severity Score",
y = "Count")+
theme_classic() +
theme(
legend.position = "none",
text = element_text(family = "serif")
)
This long-tailed distribution shows that serious crimes like arson
are rare in the data, which can make it harder for models to learn from them. For classification tasks, we can fix this by reducing the number of common cases (downsampling), adding more rare cases (using SMOTE), or telling the model to pay more attention to rare classes (weighted loss). For regression task, we can try using methods that handle outliers better (like robust regression), or change the target variable (like using log or bins) to make it easier for the model to learn.
The following map visualizes the distribution of felony incidents across the District.
<- st_read("data/tl_2023_11_tract/tl_2023_11_tract.shp") # load tract shp dc_tract_sf
<- crime |>
crime_sf st_as_sf(coords = c("longitude", "latitude"))
ggplot(data = crime_sf) +
geom_sf(aes(color = felony_severity), alpha = 0.5) +
labs(title = "Distribution of Felony Severities Across DC",
color = "Felony Severity") +
theme_void() +
theme(
text = element_text(family = "serif")
)
The more severe felonies (class 6 and 7) are concentrated in the eastern and southeastern parts of DC.
Since motor vehicle theft is the most common felony in DC, this map visualizes the percentage of felonies that are car thefts in each census tract.
<- multiclass_train |>
tract_felrate group_by(tract) |>
#filter(felony_severity == 1) |>
summarize(class_1 = sum(felony_severity ==1), total = n()) |>
mutate(pct = class_1 / total)
<- dc_tract_sf |>
tract_sf_joined left_join(tract_felrate, by = c("TRACTCE" = "tract"))
ggplot(tract_sf_joined) +
geom_sf(aes(fill = pct), color = "white", size = 0.2) +
scale_fill_gradient(
low = "#d6e4f0",
high = "#274472",
na.value = "grey90"
+
) labs(title = "Vehicle Theft Rate by Census Tract",
fill = "Vehicle Theft Rate") +
theme_classic() +
theme(
text = element_text(family ="serif")
)
This choropleth aligns with the distribution of felony severities across DC: higher vehicle theft rates are concentrated in the northern and northwestern tracts, while southern and southeastern neighborhoods show lower rates, likely due to a higher prevalence of more severe felony types in those areas.
The following choropleth shows the felony rate of each felony severity class in each census tract.
<- multiclass_train |>
facetwrap group_by(tract, felony_severity) |>
summarise(crimecount = n()) |>
mutate(felony_severity = case_when(
== 7 ~ "7. Homicide",
felony_severity == 6 ~ "6. Sex Abuse",
felony_severity == 5 ~ "5. Assault w/ Weapon",
felony_severity == 4 ~ "4. Robbery",
felony_severity == 3 ~ "3. Arson",
felony_severity == 2 ~ "2. Burglary",
felony_severity == 1 ~ "1. Vehicle Theft"
felony_severity
))
<- dc_tract_sf |>
tract_sf_newjoined left_join(facetwrap, by = c("TRACTCE" = "tract"))
ggplot(tract_sf_newjoined) +
geom_sf(aes(fill = crimecount)) +
facet_wrap(~ felony_severity, ncol = 4) +
scale_fill_gradient(
low = "#FFF5B1",
high = "#B2182B",
na.value = "gray90",
name = "Crime Count"
+
) labs(
title = "Crime Count by Tract and Offense Type",
fill = "Crime Count"
+
) theme_void() +
theme(
strip.text = element_text(family = "serif", size = 10, face = "bold"),
plot.title = element_text(family = "serif", hjust = 0.5, size = 16, face = "bold"),
legend.title = element_text(family = "serif"),
plot.margin = margin(t = 10, r = 10, b = 10, l = 20) # top 20 points of space
)
Each individual facet of the map shows the distribution of each felony severity class. Unsurprisingly, class 1 has the widest geographical distribution across the District, while class 3, which only had a 15 incidents total in the dataset, is quite sparse. Class 7, the most severe class, is highly concentrated in southern and southeastern DC, yet still with less than 50 total in each census tract.
Next, we examine the distribution of felony severity by day of the week, one of the explanatory variables used in our candidate models to predict felony severity class.
<- multiclass_train |>
weekday mutate(weekday = wday(report_date_parsed, label = TRUE, abbr = FALSE ))
ggplot(weekday, aes(x = weekday, fill = felony_severity)) +
geom_bar() +
labs(
title = "Distribution of Felony Severity by Day of the Week",
x = "Day of the Week",
y = "Number of Crimes"
+
) theme_minimal() +
theme(
plot.title = element_text(size = 15),
axis.text = element_text(size = 8)
)
The chart shows that there is not a high variation in felony severities by day of the week.
5.4 Error Metrics
For our multi-classification problem, we will calculate overall model accuracy, micro- precision/recall, as well as precision and recall within each class.
5.4.1 Accuracy
Accuracy measures how often the model is correct and is defined as:
\[ \frac{\text{TP}}{\text{total}} \]
True positives are defined as the cases where the model correctly classifies crimes as 0, 1, 2, 3, 4, 5, 6, or 7 felony severity, and total is the total number of predictions.
5.4.2 Precision and Recall by Class
Precision and recall calculated by class focuses on each class individually. Each individual class is treated as a “positive” while all of the others are treated as a “negative.”
Precision for an individual class is the fraction of instances correctly predicted to belong to a specific class among all of the instances the model predicted as belonging to that class. It is true positives over all positives (both true and false):
\[ \frac{\text{TP}_{class1}}{\text{TP}_{class1}+{FP}_{class1}} \]
Recall for an individual class is the fraction of instances in a class that the model correctly identified out of all instances in that class. It is true positives over actual positives (true positives and false negatives):
\[ \frac{\text{TP}_{class1}}{\text{TP}_{class1}+{FN}_{class1}} \]
5.4.3 Custom Function for Calculating Precision and Recall by Class
We have created a custom function to calculate within-class precision and recall:
#' Calculate within-class precision and recall
#'
#' @param x The bare (unquoted) name of the dataframe containing the true class values and their predictions
#' @param truth The bare (unquoted) name of the column within the dataframe containing the true class values
#' @param class The class within the truth column used to calculate within-class precision and recall
#' @param estimate The column containing the predictions
#'
#' @returns Precision and recall for the class specified by the category argument
#'
#' @examples
#' calc_class_metrics(x = random_pred, truth = felony_severity, class = 3, estimate = .pred_class)
<- function(x, truth, class, estimate) {
calc_class_metrics
<- x |>
totals summarize(
true_positives = sum({{ truth }} == class & {{ estimate }} == class),
false_positives = sum({{ estimate }} == class & {{ truth }} != class),
false_negatives = sum({{ truth }} == class & {{ estimate }} != class)
|>
) mutate(
precision = true_positives / (true_positives + false_positives),
recall = true_positives / (true_positives + false_negatives)
)
<- totals |>
prec_rec select(precision, recall)
return(prec_rec)
}
Within-class metrics are appropriate for our purposes because we have significant class imbalance, with 40% of felonies classified as class 1. We are interested in how the model performs on the largest classes for practical purposes. A model that performs well at predicting the largest classes will have the greatest usefulness for police departments and law enforcement in the District.
Therefore, we will pay particular attention to recall. We are especially interested in evaluating when the model mis-classifies a specific class as a negative. For law enforcement, a false negative potentially means ignoring or missing a felony incident, which has ramifications for the safety of the community and overall success of the police department in reponding to the needs of the community.
5.5 Candidate Model 1: Random Forest Model
Our first model is a random forest model using all of the predictors included in the dataset to predict the felony severity. Because felony severity level 1 is disproportionately represented in the dataset, we will downsample felony_severity.
set.seed(123)
# recipe
<- recipe(formula = felony_severity ~ ., data = multiclass_train) |>
random_newrecipe step_downsample(felony_severity, seed = 123) |>
step_rm(report_date_parsed) |>
step_mutate(method = factor(method)) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors())
# model
<- rand_forest(
random_model trees = 200,
mtry = 2,
min_n = 5) |>
set_mode("classification") |>
set_engine("ranger",
importance = "impurity",
num.threads = 4)
# workflow
<- workflow() |>
random_wf add_recipe(random_newrecipe) |>
add_model(random_model)
set.seed(123)
# fit the model
<- random_wf |>
random_fit fit(data = multiclass_train)
# generate predictions
<- predict(random_fit, multiclass_test, type = "class") |>
random_pred bind_cols(multiclass_test)
5.5.1 Metrics for Random Forest Model
# calculate accuracy and confusion matrix
conf_mat(data = random_pred,
truth = felony_severity,
estimate = .pred_class)
Truth
Prediction 1 2 3 4 5 6 7
1 283 71 0 158 146 12 23
2 774 292 0 289 135 32 11
3 249 104 4 108 65 14 7
4 94 23 0 96 54 4 10
5 119 38 0 103 177 12 9
6 362 100 2 161 112 20 18
7 211 102 0 304 241 16 58
accuracy(data = random_pred, truth = felony_severity, estimate = .pred_class)
.metric | .estimator | .estimate |
---|---|---|
accuracy | multiclass | 0.1780586 |
The overall accuracy rate for our random forest model is 17.8%, showing that overall, the model is has a somewhat weak ability to predict the accurate felony severity category.
Calculating precision and recall for each class provides some insight into how the model performs by class:
# Class 1
calc_class_metrics(x = random_pred, truth = felony_severity, class = 1, estimate = .pred_class)
precision | recall |
---|---|
0.4083694 | 0.1352772 |
# Class 2
calc_class_metrics(x = random_pred, truth = felony_severity, class = 2, estimate = .pred_class)
precision | recall |
---|---|
0.1904762 | 0.4 |
# Class 3
calc_class_metrics(x = random_pred, truth = felony_severity, class = 3, estimate = .pred_class)
precision | recall |
---|---|
0.0072595 | 0.6666667 |
# Class 4
calc_class_metrics(x = random_pred, truth = felony_severity, class = 4, estimate = .pred_class)
precision | recall |
---|---|
0.341637 | 0.0787531 |
# Class 5
calc_class_metrics(x = random_pred, truth = felony_severity, class = 5, estimate = .pred_class)
precision | recall |
---|---|
0.3864629 | 0.1903226 |
# Class 6
calc_class_metrics(x = random_pred, truth = felony_severity, class = 6, estimate = .pred_class)
precision | recall |
---|---|
0.0258065 | 0.1818182 |
# Class 7
calc_class_metrics(x = random_pred, truth = felony_severity, class = 7, estimate = .pred_class)
precision | recall |
---|---|
0.0622318 | 0.4264706 |
The strongest overall performance is on class 1, the largest class (precision = 40.8% and recall = 13.5%). Class 4 is the second largest class, and while class 4’s precision is relatively good (34%), the recall is only 8%. The model is predicting a large amount of false negatives for class 4. Class 3 is the smallest class, and surprisingly, the recall is 66.7%, showing a relatively low number of false negatives. However, class 3’s precision is only 6.7%, so the model often incorrectly predicts other classes as class 3.
5.5.2 Random Forest Variable Importance
# Variable importance for random forest
set.seed(123)
|>
random_fit extract_fit_parsnip() |>
vip(num_features = 20) %>% #we are using this pipe to make sure the argument works
$data |>
.mutate (
importance = Importance / max(Importance),
variable = fct_reorder(Variable, importance)
|>
) ggplot(aes(x = importance, y = variable, fill = importance)) +
geom_col() +
labs(
title = "variable importance of random forest",
x = "normalized importance",
y = "variable"
)
Interestingly, ‘under18_pct’ is the most important factor, with a normalized importance of 1. This value indicates the percentage of the population that is under the age of 18 in each tract. The following visualization shows the average percentage of the census tract that is under 18 within each felony severity category.
<- multiclass_train |>
under_18 group_by(felony_severity) |>
summarise(avg_under18 = mean(under18_pct, na.rm = TRUE))
ggplot(under_18, aes(x = felony_severity, y = avg_under18, group = 1)) +
geom_line() +
geom_point() +
labs(
title = "Average Percent of Community Under Age 18 by Felony Severity",
x = "Felony Severity",
y = "Average Percent Under 18") +
theme_minimal()
The highest felony severity category, 7, occurs at higher rates in communities with a larger percent of the community being under the age of 18. Classes 1 and 2, the lowest severity levels, occur more frequently in communities with a relatively small percentage of the community under the age of 18. The relationship between felony severity and the age makeup of a community is not strictly linear, but the model has picked up on a relationship between age and felony severity level.
‘hsplus_pct’ was the second most important variable. This variable is the percentage of people in a census tract with greater than a high school level education.
<- multiclass_train |>
hs group_by(felony_severity) |>
summarise(avg_hs = mean(hsplus_pct, na.rm = TRUE))
ggplot(hs, aes(x = felony_severity, y = avg_hs, group = 1)) +
geom_line() +
geom_point() +
labs(
title = "Community Education Level by Felony Severity",
x = "Felony Severity",
y = "Percent with High School Education or Higher"
+
) theme_minimal() +
theme(
axis.title.y = element_text(size = 11),
axis.title.x = element_text(size = 11)
)
Neighborhoods with higher rates of more than high school education tend to have higher rates of lower felony severity classes. Class 5 and 7 tend to occur in neighborhoods with lower rates of high school education, indicating a downwards trend. However, class 6 is the exception to this trend.
‘Foreign_born_pct’ was the third most important variable. The following visualization shows the average percentage of a census tract that is foreign-born within each felony severity category.
<- multiclass_train |>
foreign group_by(felony_severity) |>
summarise(average_foreign = mean(foreign_born_pct, na.rm = TRUE))
ggplot(foreign, aes(x = felony_severity, y = average_foreign, group = 1)) +
geom_line() +
geom_point() +
labs(
title = "Average Foreign Born Population by Felony Severity",
x = "Felony Severity",
y = "Average Percent of Foreign-Born in a Community"
+
) theme_minimal()
Interestingly, the graph shows a very similar pattern of the relationship between foreign-born makeup of the community and felony severity category as percent of the community with a greater than high school level education.
Neighborhoods with higher rates of foreign-born tend to have higher rates of lower felony severity levels than higher severity levels, with class 6 as an exception to this trend.
5.6 Candidate Model 2: CART Model
Our second candidate model is a CART (classification and regression tree) model, stratified on felony_severity because the felony_severity classes are not equal. We will use five-fold cross-validation and tune tree depth, cost complexity, and the minimum number of nodes to optimize the model.
# cross-validation
set.seed(123)
<- vfold_cv(multiclass_train, v = 5, strata = felony_severity)
cross_validation
# recipe
<- recipe(formula = felony_severity ~ ., data = multiclass_train) |>
cart_newrecipe step_downsample(felony_severity, seed = 123) |>
step_rm(report_date_parsed) |>
step_mutate(method = factor(method)) |>
step_zv(all_predictors()) |>
step_novel(all_factor_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
# cart model object
<- decision_tree(
cart_model tree_depth = tune(),
cost_complexity = tune(),
min_n = tune()
|>
) set_engine(engine = "rpart") |>
set_mode(mode = "classification")
set.seed(123)
# set up tuning grid
<- grid_regular(
cart_grid cost_complexity(range = c(0.0001, 0.1)),
tree_depth(range = c(3, 15)),
min_n(),
levels = 5
)
# cart workflow
<- workflow() |>
cart_wf add_recipe(cart_newrecipe) |>
add_model(cart_model)
set.seed(123)
# test and choose best hyperparameters
<- cart_wf |>
cart_test tune_grid(resamples = cross_validation,
grid = cart_grid,
metrics = metric_set(accuracy))
|>
cart_test show_best()
cost_complexity | tree_depth | min_n | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
1.000230 | 3 | 2 | accuracy | multiclass | 0.401503 | 5 | 0.0007154 | Preprocessor1_Model001 |
1.059437 | 3 | 2 | accuracy | multiclass | 0.401503 | 5 | 0.0007154 | Preprocessor1_Model002 |
1.122148 | 3 | 2 | accuracy | multiclass | 0.401503 | 5 | 0.0007154 | Preprocessor1_Model003 |
1.188571 | 3 | 2 | accuracy | multiclass | 0.401503 | 5 | 0.0007154 | Preprocessor1_Model004 |
1.258925 | 3 | 2 | accuracy | multiclass | 0.401503 | 5 | 0.0007154 | Preprocessor1_Model005 |
# finalize workflow
<- cart_wf |>
final_cart_wf finalize_workflow(select_best(cart_test)
)
set.seed(123)
# fit
<- final_cart_wf |>
cart_fit fit(data = multiclass_train)
# predictions
<- bind_cols(
predictions
multiclass_test,predict(object = cart_fit, new_data = multiclass_test),
predict(object = cart_fit, new_data = multiclass_test, type = "prob")
)
5.6.1 Metrics for CART Model
# confusion matrix
conf_mat(data = predictions,
truth = felony_severity,
estimate = .pred_class)
Truth
Prediction 1 2 3 4 5 6 7
1 2092 730 6 1219 930 110 136
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0
# accuracy
accuracy(data = predictions,
truth = felony_severity,
estimate = .pred_class)
.metric | .estimator | .estimate |
---|---|---|
accuracy | multiclass | 0.4005361 |
Our CART model is only correctly predicting the majority class (felony_severity = 1), and therefore not learning the minority classes. This is likely due to the class imbalance, even though we downsampled felony severity. To optimize performance, the decision tree is classifying every felony as class 1, resulting in an accuracy rate that equals the percentage of felonies that are class 1.
5.7 Candidate Model 3: KNN Model
Our third candidate Model is K-nearest neighbors (KNN). We selected 5 as the number of nearest neighbors to avoid the model blurring class boundaries.
# recipe
<- recipe(formula = felony_severity ~ ., data = multiclass_train) |>
knn_newrecipe step_downsample(felony_severity, seed = 123) |>
step_rm(report_date_parsed) |>
step_mutate(method = factor(method)) |>
step_zv(all_predictors()) |>
step_novel(all_factor_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
# model
<- nearest_neighbor(neighbors = 5) |>
knn_model set_engine(engine = "kknn") |>
set_mode(mode = "classification")
# workflow
<- workflow() |>
knn_wf add_recipe(knn_newrecipe) |>
add_model(knn_model)
# fit the model
<- finalize_workflow(knn_wf, knn_model)
final_knn_wf
set.seed(123)
<- fit(final_knn_wf, data = multiclass_train)
knn_fit
# generate predictions
<- predict(knn_fit, new_data = multiclass_test, type = "class") |>
knn_preds bind_cols(multiclass_test)
5.7.1 Metrics for KNN Model
# calculate confusion matrix and accuracy
conf_mat(data = knn_preds, truth = felony_severity, estimate = .pred_class)
Truth
Prediction 1 2 3 4 5 6 7
1 183 52 0 85 74 8 9
2 198 77 0 92 62 4 9
3 758 237 5 310 172 27 13
4 467 201 0 374 243 24 41
5 175 60 0 106 130 21 18
6 199 60 1 124 97 14 19
7 112 43 0 128 152 12 27
accuracy(knn_preds, truth = felony_severity, estimate = .pred_class)
.metric | .estimator | .estimate |
---|---|---|
accuracy | multiclass | 0.1550833 |
The accuracy rate is 15.5% for the KNN model, which is lower than our random forest model.
Evaluating the within-class precision and accuracy will be useful for comparing to the performance of the random forest model.
# Class 1
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 1, estimate = .pred_class)
precision | recall |
---|---|
0.4452555 | 0.0874761 |
# Class 2
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 2, estimate = .pred_class)
precision | recall |
---|---|
0.1742081 | 0.1054795 |
# Class 3
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 3, estimate = .pred_class)
precision | recall |
---|---|
0.0032852 | 0.8333333 |
# Class 4
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 4, estimate = .pred_class)
precision | recall |
---|---|
0.277037 | 0.3068089 |
# Class 5
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 5, estimate = .pred_class)
precision | recall |
---|---|
0.254902 | 0.1397849 |
# Class 6
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 6, estimate = .pred_class)
precision | recall |
---|---|
0.0272374 | 0.1272727 |
# Class 7
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 7, estimate = .pred_class)
precision | recall |
---|---|
0.056962 | 0.1985294 |
KNN has the best within-class precision for class 1, the largest class, yet the recall for class 1 is only 11%, meaning that there are a high number of false negatives. It also performs fairly well in the other largest classes (2, 4, and 5) in terms of its precision. However, recall also hovers around less than 15% for these three classes. Unsurprisingly, the model struggled to classify the smaller classes.
5.8 Final Multiclass Classification Model Selection
The CART model, despite having the highest accuracy rate of 40%, will be eliminated from consideration because it achieved this rate by predicting all classes as class 1.
This leaves random forest or KNN. Random forest’s accuracy rate was ~17.8%, while KNN’s was ~15.5%. Random forest performs slightly more accurately overall, but we will also consider the fact that our model has significant class imbalances. To compare the performance within each class, we will focus on classes 1, 4, 5, and 2 because those classes are the largest, respectively.
Comparing with the random forest model’s precision and recall within each class, KNN has lower precision than random forest within the 3/4 of the largest classes (class 2, 4, and 5). In terms of within-class recall, random forest performs better than KNN in 3/4 of the largest classes as well (class 1, 2, and 5).
As discussed in our selection of error metrics, recall is of particular importance for model evaluation because the consequences of a false negative are higher for police and communities than a false positive.
Random forest has better within-class recall, as well as within-class precision and overall accuracy. Therefore, we will select random forest as the final model for implementation.
5.9 Implementation
We will fit the final model on the 2023 data to evaluate its application to the real-world.
# Apply the recipe and model to 2023 data
<- predict(random_fit,
multiclass_implement_results new_data = crime_multiclass_implement,
type = "prob") |>
bind_cols(predict(random_fit, new_data = crime_multiclass_implement)) |>
bind_cols(crime_multiclass_implement |>
select(felony_severity))
# final model's accuracy
accuracy(multiclass_implement_results, truth = felony_severity, estimate = .pred_class)
.metric | .estimator | .estimate |
---|---|---|
accuracy | multiclass | 0.163185 |
# Class 1
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 1,
estimate = .pred_class)
precision | recall |
---|---|
0.5418233 | 0.1699838 |
#Class 2
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 2,
estimate = .pred_class)
precision | recall |
---|---|
0.1128364 | 0.3992674 |
#Class 3
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 3,
estimate = .pred_class)
precision | recall |
---|---|
0 | 0 |
#Class 4
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 4,
estimate = .pred_class)
precision | recall |
---|---|
0.3415493 | 0.0561018 |
#Class 5
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 5,
estimate = .pred_class)
precision | recall |
---|---|
0.3169985 | 0.1476462 |
#Class 6
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 6,
estimate = .pred_class)
precision | recall |
---|---|
0.0117914 | 0.1354167 |
#Class 7
calc_class_metrics(x = multiclass_implement_results,
truth = felony_severity,
class = 7,
estimate = .pred_class)
precision | recall |
---|---|
0.0452635 | 0.5109489 |
5.9.1 Interpreting Final Model Results
The final model’s overall accuracy rate is 16.32%, so the model makes correct predictions about 16% of the time. The within-class recall and precision varies between classes.
Class 1 had the best precision, and class 4 had the best recall. The model performed the worst on class 3, with an accuracy and precision of 0%.
The distribution of felony severities in the implementation data was very similar to the training data:
# Check class imbalance in the target variable
|>
crime_multiclass_implement count(felony_severity) |>
mutate(pct = n / sum(n))
felony_severity | n | pct |
---|---|---|
1 | 6783 | 0.5133969 |
2 | 1092 | 0.0826521 |
3 | 11 | 0.0008326 |
4 | 3458 | 0.2617318 |
5 | 1402 | 0.1061157 |
6 | 192 | 0.0145322 |
7 | 274 | 0.0207387 |
ggplot(crime_multiclass_implement, aes(x = felony_severity)) +
geom_bar() +
scale_y_continuous(n.breaks = 10) +
labs(title = "Number of Felonies in Each Category",
x = "Felony Severity Category",
y = "Count") +
theme_minimal()
The implementation data only had 11 felonies of class 3, so it is unsurprising that the model failed to predict this class. Similarly, classes 6 and 7 have less than 300 observations, explaining the poor recall and precision for these classes.
The final model is quite limited in its practical usefulness because of its relatively low accuracy rate and within-class precision and recall. The model could potentially be improved with different class specifications. Because the dataset is dominated by felony severity class 1, a different approach could be a binary classification model with motor vehicle theft defined as a positive outcome, and all other felonies defined as a negative outcome. This specification would have less class imbalance, which may enhance precision and recall.
Overall, it is difficult to predict the severity of a felony due to the variety of features related to crime, including geographic factors, demographic variables, and temporal factors. The limitations of the model will be explored further in the discussion section below.
6 Regression Task
6.1 Split Data
set.seed(123)
<- initial_split(crime_enriched_mod, prop = 0.8, strata = felony_severity_trans)
crime_split_reg
<- training(crime_split_reg)
crime_train_reg
<- testing(crime_split_reg) crime_test_reg
6.2 Explanation on Outcome Variable
In this regression task, the outcome variable is felony_severity_trans
, which is a transformed version of the original felony_severity
score using the Yeo-Johnson transformation. The original score ranged from 0 to 7, reflecting the seriousness of the felony based on sentence length and social cost. For example, homicide
is scored as 7, while motor vehicle theft
is scored as 1. Although this score is technically ordinal (ranked categories), we treat it as approximately continuous for modeling purposes. This is a common approach in applied research when the levels are evenly spaced and reflect meaningful differences (Robitzsch, 2020). For instance, the gap between scores of 1 and 2 can be interpreted similarly to the gap between 5 and 6. This treatment allows us to conduct standard regression models.
However, the distribution of felony_severity
is skewed due to the rarity of severe crimes like arson
and homicide
, which can distort model performance. To address this, we applied the Yeo-Johnson transformation to the outcome variable. This transformation preserves the relative severity ranking, but reduces skewness and pulls in extreme values, improving model performance by better meeting assumptions of normality and homoscedasticity.
After transformation, the values are no longer on the original 0–7 scale. For example, the eight unique severity levels were transformed to:
sort(unique(crime_train_reg$felony_severity_trans))
[1] -0.6688794 1.1762487 1.5159447 1.6344214 1.6891198 1.7187730 1.7366234
[8] 1.7481926
These numbers are not directly interpretable in real-world terms, but they retain the ordering of crime severity and provide a statistically stable basis for modeling.
In practice, we use felony_severity_trans
as the outcome variable during model training and prediction, and then apply the inverse Yeo-Johnson transformation (via predict(yj_obj, inverse = TRUE)
) to convert predicted values back to the original 0–7 severity scale. This allows for more intuitive interpretation of the results while preserving the modeling benefits of the transformation.
6.3 Select Error Metric
Our goal is to predict how severe a crime might be if it is a felony. This helps the police understand which crimes are more dangerous and need more attention.
Since some severe crimes are rare, we want to make sure the model does not make big mistakes when predicting them. So we use RMSE
and R-squared
as our main evaluation metrics. RMSE tells us how far the predictions are from the true values, especially for big mistakes. R-squared shows how much variation in severity scores our model can explain.
6.4 Come up with Models
For this regression task, we still build three candidate models to test which performs best.
6.4.1 Model 1: Linear Regression with Yeo-Johnson Transformation
This baseline model uses linear regression to predict felony severity. It focuses on socioeconomic context and crime-level temporal features. While linear models may not capture complex interactions, they offer a strong baseline for comparison and clear insights into variable importance. We use the pre-transformed outcome felony_severity_trans
to reduce skewness and stabilize variance.
<- recipe(felony_severity_trans ~ poverty_rate + unemployed_rate +
crime_rec_lm + hispanic_pct + foreign_born_pct +
black_pct +
hsplus_pct + age65plus_pct +
under18_pct + method + report_date_parsed,
shift data = crime_train_reg) |>
step_date(report_date_parsed, features = "dow", label = TRUE) |>
step_mutate(
yday = lubridate::yday(report_date_parsed),
yday_cos = cos(2 * pi * yday / 365),
yday_sin = sin(2 * pi * yday / 365)
|>
) step_rm(report_date_parsed) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
The model specification and workflow are defined as follows:
<- linear_reg() |>
lm_spec set_engine("lm") |>
set_mode("regression")
<- workflow() |>
lm_wf add_model(lm_spec) |>
add_recipe(crime_rec_lm)
6.4.2 Model 2: Random Forest Regression with Spatial Features
This model uses Random Forest to predict felony_severity_trans
, a transformed version of felony severity
, based on spatial and environmental context. It includes features such as tract, proximity to liquor stores, grocery stores, banks, and public infrastructure, as well as police sector and zoning indicators. Socioeconomic variables are excluded to focus on location-based risk factors.
<- recipe(felony_severity_trans ~ tract + shift + method +
crime_rec_rf_reg +
report_date_parsed +
in_liquor_moratorium_zone + near_liquorstore_200m +
nearest_liquorstore_dist + nearest_wifi_dist +
near_wifi_100m +
in_lowfood_zone + near_grocery_300m +
nearest_grocery_dist + near_bank_250m +
nearest_bank_dist + police_sector + in_military_base,
in_vending_zone data = crime_train_reg) |>
step_date(report_date_parsed, features = "dow", label = TRUE) |>
step_mutate(
yday = lubridate::yday(report_date_parsed),
yday_cos = cos(2 * pi * yday / 365),
yday_sin = sin(2 * pi * yday / 365)
|>
) step_rm(report_date_parsed) |>
step_corr(all_numeric_predictors(), threshold = 0.9) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
The model specification and workflow are defined as follows:
<- rand_forest(
rf_spec_reg mtry = tune(),
trees = 50,
min_n = tune()
|>
) set_engine("ranger", importance = "impurity") |>
set_mode("regression")
<- workflow() |>
rf_wf_reg add_model(rf_spec_reg) |>
add_recipe(crime_rec_rf_reg)
6.4.3 Model 3: XGBoost Regression with All Features
This model leverages all available features to predict felony_severity_trans
, a transformed version of felony severity. It includes demographic, spatial, temporal, and contextual predictors. We apply normalization, correlation filtering, and dummy encoding to prepare predictors.
<- recipe(felony_severity_trans ~ ., data = crime_train_reg) |>
crime_rec_xgb_reg step_rm(felony_flag, felony_severity) |> # remove original outcome columns
step_date(report_date_parsed, features = "dow", label = TRUE) |>
step_mutate(
yday = lubridate::yday(report_date_parsed),
yday_cos = cos(2 * pi * yday / 365),
yday_sin = sin(2 * pi * yday / 365)
|>
) step_rm(x, y, latitude, longitude, year, report_date_parsed) |>
step_nzv(all_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.9) |>
step_normalize(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
The model specification and workflow are defined as follows:
<- boost_tree(
xgb_spec_reg trees = 50,
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = tune()
|>
) set_engine("xgboost") |>
set_mode("regression")
<- workflow() |>
xgb_wf_reg add_model(xgb_spec_reg) |>
add_recipe(crime_rec_xgb_reg)
6.5 Estimation
We use 5-fold cross-validation and tune model hyperparameters for all three regression models. Our evaluation metrics are RMSE
and R-squared
, which reflect prediction accuracy and explained variance, respectively.
# Set up 5-fold cross-validation
set.seed(123)
<- vfold_cv(crime_train_reg, v = 5)
crime_folds_reg
# Define metrics for regression
<- metric_set(rmse, rsq) reg_metrics
6.5.1 Model 1: Linear Regression (no tuning needed)
# Evaluate model via 5-fold CV
<- fit_resamples(
lm_res
lm_wf,resamples = crime_folds_reg,
metrics = reg_metrics,
control = control_resamples(save_pred = TRUE)
)
# Fit final model on full training set
<- fit(lm_wf, data = crime_train_reg) lm_fit
6.5.2 Model 2: Random Forest
# Define tuning grid for random forest
set.seed(123)
<- grid_random(
rf_grid_reg mtry(range = c(3, 12)),
min_n(range = c(2, 10)),
size = 5
)
# Tune model
set.seed(123)
<- tune_grid(
rf_res_reg
rf_wf_reg,resamples = crime_folds_reg,
grid = rf_grid_reg,
metrics = reg_metrics,
control = control_grid(save_pred = TRUE),
verbose = TRUE
)
# Select best model based on RMSE
<- select_best(rf_res_reg, metric = "rmse")
best_rf_reg
# Finalize and fit
<- finalize_workflow(rf_wf_reg, best_rf_reg)
final_rf_wf_reg
set.seed(123)
<- fit(final_rf_wf_reg, data = crime_train_reg) final_rf_fit_reg
6.5.3 Model 3: XGBoost
# Create random tuning grid
set.seed(123)
<- grid_random(
xgb_grid_reg tree_depth(range = c(3, 10)),
learn_rate(range = c(0.01, 0.3)),
loss_reduction(range = c(0, 1)),
size = 5
)
# Tune with cross-validation
set.seed(123)
<- tune_grid(
xgb_res_reg
xgb_wf_reg,resamples = crime_folds_reg,
grid = xgb_grid_reg,
metrics = reg_metrics,
control = control_grid(save_pred = TRUE),
verbose = TRUE
)
# Select best hyperparameters
<- select_best(xgb_res_reg, metric = "rmse")
best_xgb_reg
# Finalize and fit model
<- finalize_workflow(xgb_wf_reg, best_xgb_reg)
final_xgb_wf_reg
set.seed(123)
<- fit(final_xgb_wf_reg, data = crime_train_reg) final_xgb_fit_reg
6.6 Interpretation
6.6.1 Variable Importance
Although linear regression does not provide built-in feature importance, we inspect standardized regression coefficients (β) to assess influence.
|>
lm_fit extract_fit_parsnip() |>
tidy() |>
arrange(desc(abs(estimate)))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
method_OTHERS | -1.6923629 | 0.0123312 | -137.2421258 | 0.0000000 |
(Intercept) | 1.4667559 | 0.0153006 | 95.8623633 | 0.0000000 |
shift_MIDNIGHT | 0.3006732 | 0.0090803 | 33.1127293 | 0.0000000 |
report_date_parsed_dow_Wed | -0.0621854 | 0.0118883 | -5.2308252 | 0.0000002 |
shift_EVENING | 0.0543332 | 0.0070769 | 7.6775166 | 0.0000000 |
black_pct | 0.0495872 | 0.0068456 | 7.2436892 | 0.0000000 |
foreign_born_pct | -0.0373619 | 0.0057639 | -6.4819960 | 0.0000000 |
report_date_parsed_dow_Tue | -0.0342020 | 0.0118799 | -2.8789735 | 0.0039910 |
hsplus_pct | -0.0324247 | 0.0055568 | -5.8351139 | 0.0000000 |
report_date_parsed_dow_Thu | -0.0287529 | 0.0120975 | -2.3767618 | 0.0174682 |
age65plus_pct | -0.0270022 | 0.0033612 | -8.0334131 | 0.0000000 |
report_date_parsed_dow_Fri | -0.0217346 | 0.0122786 | -1.7701105 | 0.0767133 |
yday | 0.0192624 | 0.0050609 | 3.8061352 | 0.0001413 |
yday_cos | -0.0188528 | 0.0031652 | -5.9563380 | 0.0000000 |
report_date_parsed_dow_Mon | -0.0096846 | 0.0119442 | -0.8108209 | 0.4174714 |
under18_pct | 0.0089592 | 0.0044654 | 2.0063811 | 0.0448196 |
unemployed_rate | 0.0088240 | 0.0047450 | 1.8596534 | 0.0629390 |
yday_sin | 0.0049775 | 0.0050619 | 0.9833400 | 0.3254437 |
poverty_rate | 0.0026705 | 0.0048203 | 0.5540057 | 0.5795768 |
report_date_parsed_dow_Sat | -0.0025336 | 0.0122927 | -0.2061031 | 0.8367110 |
hispanic_pct | -0.0015230 | 0.0057131 | -0.2665806 | 0.7897929 |
method_KNIFE | 0.0008013 | 0.0248746 | 0.0322139 | 0.9743016 |
Accoding to the table above, the strongest predictors of felony severity (in transformed scale) include:
method_OTHERS
(β = -1.69): Crimes with unspecified methods tend to have lower predicted severity.shift_MIDNIGHT
(β = 0.30) andshift_EVENING
(β = 0.06): Crimes that happen around midnight and evening are predicted to be more severe.black_pct
andforeign_born_pct
have small but statistically significant effects on predicted severity. A higher percentage of Black residents slightly increases the predicted score, while a higher percentage of foreign-born residents slightly lowers it.
These results should be interpreted with caution, as they might be affected by other potential confounding factors.
Note: all predictors were standardized and the outcome was transformed, so coefficients represent relative, not direct effects on severity.
# Random Forest importance
<- extract_fit_parsnip(final_rf_fit_reg)
rf_model_reg vip(rf_model_reg, num_features = 20)
method_OTHERS
and shift_MIDNIGHT
are again the most influential features. Several distance-based features, such as distance to banks, grocery stores, liquor stores and free WiFi, also play meaningful roles, suggesting that spatial proximity to certain infrastructure may be predictive of felony severity.
# XGBoost importance
<- extract_fit_parsnip(final_xgb_fit_reg)
xgb_model_reg vip(xgb_model_reg, num_features = 20)
Though the importance values are more concentrated. The method_OTHERS
, shift_MIDNIGHT
and black_pct
dominate, while demographic variables such as under18_pct
, foreign_born_pct
and unemployed_rate
have smaller but non-zero contributions.
6.6.2 Interpreting Model Predictions
All three models were trained using felony_severity_trans
, the Yeo-Johnson transformed version of the original severity score. While this transformation helps improve model performance and satisfy statistical assumptions, the predicted values are no longer on the familiar 0–7 scale.
To make the predictions directly interpretable, we apply an inverse Yeo-Johnson transformation to the model outputs. This step converts .pred
values back to the original felony_severity
scale, allowing for meaningful evaluation using RMSE
and R-squared
. These evaluation metrics now reflect prediction error and explained variance in the original severity score, which ranges from 0 (least severe) to 7 (most severe).
Evaluate on Test Set
# Linear Regression - Test
<- predict(lm_fit, new_data = crime_test_reg) |>
lm_test_pred mutate(.pred = predict(yj_obj, newdata = .pred, inverse = TRUE)) |> # inverse transformation
bind_cols(crime_test_reg |> select(felony_severity))
metrics(lm_test_pred, truth = felony_severity, estimate = .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.7102975 |
rsq | standard | 0.1841820 |
mae | standard | 0.6646159 |
# Random Forest - Test
<- predict(final_rf_fit_reg, new_data = crime_test_reg) |>
rf_test_pred mutate(.pred = predict(yj_obj, newdata = .pred, inverse = TRUE)) |>
bind_cols(crime_test_reg |> select(felony_severity))
metrics(rf_test_pred, truth = felony_severity, estimate = .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.5117827 |
rsq | standard | 0.5192121 |
mae | standard | 0.7915837 |
# XGBoost - Test
<- predict(final_xgb_fit_reg, new_data = crime_test_reg) |>
xgb_test_pred mutate(.pred = predict(yj_obj, newdata = .pred, inverse = TRUE)) |>
bind_cols(crime_test_reg |> select(felony_severity))
metrics(xgb_test_pred, truth = felony_severity, estimate = .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.8906491 |
rsq | standard | 0.2047492 |
mae | standard | 0.6817682 |
After evaluating model performance on the held-out test set using RMSE, R-squared, and MAE after applying inverse transformation. Linear regression showed weak performance, with an RMSE of 1.71 and R² of 0.18, capturing little variance. Random forest performed best, with an RMSE of 1.48 and R² of 0.53. XGBoost had RMSE of 1.89 and slightly better R² at 0.24 than linear regression.
Evaluate on Implementation Set
# Linear Regression - Implementation
<- predict(lm_fit, new_data = crime_enriched_implement) |>
lm_impl_pred mutate(.pred = predict(yj_obj_impl, newdata = .pred, inverse = TRUE)) |>
bind_cols(crime_enriched_implement |> select(felony_severity))
metrics(lm_impl_pred, truth = felony_severity, estimate = .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 8.1388952 |
rsq | standard | 0.0406479 |
mae | standard | 0.9702572 |
# Random Forest - Implementation
<- predict(final_rf_fit_reg, new_data = crime_enriched_implement) |>
rf_impl_pred mutate(.pred = predict(yj_obj_impl, newdata = .pred, inverse = TRUE)) |>
bind_cols(crime_enriched_implement |> select(felony_severity))
metrics(rf_impl_pred, truth = felony_severity, estimate = .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 1.3626525 |
rsq | standard | 0.5149314 |
mae | standard | 0.7959087 |
# XGBoost - Implementation
<- predict(final_xgb_fit_reg, new_data = crime_enriched_implement) |>
xgb_impl_pred mutate(.pred = predict(yj_obj_impl, newdata = .pred, inverse = TRUE)) |>
bind_cols(crime_enriched_implement |> select(felony_severity))
metrics(xgb_impl_pred, truth = felony_severity, estimate = .pred)
.metric | .estimator | .estimate |
---|---|---|
rmse | standard | 12.3225165 |
rsq | standard | 0.0218831 |
mae | standard | 0.9982794 |
On the 2023 implementation data, linear regression and XGBoost failed to generalize, with RMSEs of 8.14 and 12.32, and R² values near zero. In contrast, random forest maintained strong performance with an RMSE of 1.32 and R² of 0.53.
6.6.3 Collect all metrics into one table
<- bind_rows(
all_metrics_reg mutate(metrics(lm_test_pred, truth = felony_severity, estimate = .pred), model = "Linear Regression", set = "Test"),
mutate(metrics(rf_test_pred, truth = felony_severity, estimate = .pred), model = "Random Forest", set = "Test"),
mutate(metrics(xgb_test_pred, truth = felony_severity, estimate = .pred), model = "XGBoost", set = "Test"),
mutate(metrics(lm_impl_pred, truth = felony_severity, estimate = .pred), model = "Linear Regression", set = "Implementation"),
mutate(metrics(rf_impl_pred, truth = felony_severity, estimate = .pred), model = "Random Forest", set = "Implementation"),
mutate(metrics(xgb_impl_pred, truth = felony_severity, estimate = .pred), model = "XGBoost", set = "Implementation")
)
6.6.4 Visualize 3 Models’ Performances
<- all_metrics_reg |>
all_metrics_reg mutate(set = factor(set, levels = c("Test", "Implementation")))
# Plot RMSE Comparison
|>
all_metrics_reg filter(.metric == "rmse") |>
ggplot(aes(x = model, y = .estimate, fill = set)) +
scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "RMSE Comparison across Models and Datasets",
x = "Model",
y = "RMSE") +
theme_classic() +
theme(
text = element_text(family = "serif")
)
# Plot R-squared Comparison
|>
all_metrics_reg filter(.metric == "rsq") |>
ggplot(aes(x = model, y = .estimate, fill = set)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
labs(title = "R-squared Comparison across Models and Datasets",
x = "Model",
y = "R-squared") +
theme_classic() +
theme(
text = element_text(family = "serif")
)
After comparing model performance across test and implementation sets using RMSE
and R-squared
, the results show that Random Forest
outperforms both Linear Regression
and XGBoost
in predictive accuracy and generalizability. It achieves the lowest RMSE and highest R² in both datasets, suggesting that it captures stable and generalizable patterns in the data. In contrast, Linear Regression
and XGBoost
perform relatively well on the test set but fail to generalize to 2023 data, indicating overfitting.
Importantly, the Random Forest
model used in this task relies only on spatial and situational predictors, such as proximity to liquor stores, grocery outlets, public Wi-Fi zones, and timing features like shift and day of week. It excludes all socioeconomic indicators. That way this model still achieves high predictive accuracy suggests that spatial context alone provides meaningful signals about felony severity risk across tracts.
From a policy perspective, this finding supports efforts to incorporate environmental exposure and situational context into crime prevention strategies. Rather than relying only on historical crime counts or demographic profiling, city agencies could use this model to identify high-severity risk areas based on public infrastructure and land use patterns, and prioritize patrols or intervention accordingly. This aligns with the broader goal of reducing the social costs of crime by focusing on where severe offenses are more likely to occur, even in the absence of full socioeconomic data.
7 Discussion of Results
Our project investigated three key questions: To what extent can crime characteristics and spatial context predict whether an incident is a felony? How accurately can we classify the type of felony? And can we estimate the felony severity score associated with a newly reported crime in DC? The answers to these questions offer practical insights for public-sector decision-making in law enforcement, community engagement, and urban resource allocation.
We investigated our first research question through a binary classification model, developing three candidates based on geospatial and demographic features. The best-performing model was a logistic regression with lasso regularization, using only socioeconomic variables from the ACS and excluding all GIS features. Despite this, it achieved a recall of 99.6% on the implementation dataset, with a 1 − Recall of just 0.0043, indicating a very low miss rate for felonies. This suggests that law enforcement could potentially confidently use this model to predict felonies at the DC tract level, using this information to direct resources and enhance public safety. These findings also demonstrate that more features do not necessarily lead to better performance, and that focused feature selection can improve model robustness. To build on this foundation, future researchers should examine the model’s performance on other metrics, such as false positive rates, given the limitations of relying solely on recall.
We investigated our second research question through a multiclass classification model, also using machine learning techniques to develop three candidate models using the same geospatial features and demographic characteristics. The best performing model was random forest, with an overall accuracy rate on the implementation dataset of 16.32%. This low accuracy rate suggests future room for improvement, specifically through different feature engineering. Our models utilized both demographic and geospatial features, presenting a limitation of our approach to multi-classification, given that the binary classification and regression models both performed quite well, and they each used more specific feature engineering. Future researchers can focus on geospatial features for predicting felony severity class, and compare the performance of this model to one that uses only socioeconomic features to predict felony severity class. This approach can provide insight into the most salient predictors for felony severity.
Finally, we explored our third research question through a regression model, utilizing machine learning methods to estimate a transformed version of felony severity classification. Because of the skewed nature of the dataset, we utilized a Yeo-Johnson transformation technique to reduce skewness. Random forest was the highest-performing model, with an R2 of 53% and RMSE of 1.32 on the implementation dataset. The regression task revealed important insights about the importance of spatial and situational predictors, such as proximity to liquor stores, grocery outlets, public Wi-Fi zones, and timing features like shift and day of week, rather than demographic and socioeconomic indicators. Future researchers may build on our regression analysis by adding socioeconomic and demographic variables to our existing regression model to determine which ones meaningfully enhance predictive power.
It is difficult for law enforcement to balance public safety with community relations. It is worth highlighting that our regression findings suggest that crime prevention and targeting strategies that rely on environmental and geographic factors can identify high-severity risk areas without relying on demographic profiling. Our analysis adds valuable insights to the ongoing discussion on how to strike the right between community engagement and policing for public safety.
8 Reference
Bowers, K. J., & Johnson, S. D. (2011). Risk terrain modelling: Brokering criminological theory and GIS methods for crime forecasting. Crime Science, 1, 5–17. https://doi.org/10.1080/07418825.2010.486037
Braga, A. A., Papachristos, A. V., & Hureau, D. M. (2014). The effects of hot spots policing on crime: An updated systematic review and meta‑analysis. Justice Quarterly, 31(4), 633–663. https://doi.org/10.1080/07418825.2012.673632
Cheng, T., & Chen, T. (2021). Urban crime and security. In W. Shi, M. F. Goodchild, M. Batty, M.-P. Kwan, & A. Zhang (Eds.), Urban informatics (pp. 213–228). Springer. https://doi.org/10.1007/978-981-15-8983-6_14
Code of the District of Columbia. (2025). Title 22: Criminal offenses and penalties. https://code.dccouncil.gov/us/dc/council/code/titles/22
Golash-Boza, T., & Oh, H. (2021). Crime and neighborhood change in the nation’s capital: From disinvestment to gentrification. Crime & Delinquency, 67(9), 1267–1294. https://doi.org/10.1177/00111287211005394
McCollister, K. E., French, M. T., & Fang, H. (2010). The cost of crime to society: New crime‑specific estimates for policy and program evaluation. Drug and Alcohol Dependence, 108(1–2), 98–109. https://doi.org/10.1016/j.drugalcdep.2009.12.002
Van Wilsem, J., Wittebrood, K., & de Graaf, N. D. (2006). Socioeconomic dynamics of neighborhoods and the risk of crime victimization: A multilevel study of improving, declining, and stable areas in the Netherlands. Social Problems, 53(2), 226–247. https://doi.org/10.1525/sp.2006.53.2.226
Weisburd, D., Telep, C. W., Hinkle, J. C., & Eck, J. E. (2010). Is problem‑oriented policing effective in reducing crime and disorder? Findings from a Campbell systematic review. Criminology & Public Policy, 9(1), 139–172. https://onlinelibrary.wiley.com/doi/10.1111/j.1745-9133.2010.00617.x
9 Appendix
9.1 First Stage for Data Processing (DC Open Data)
9.1.1 Load crime incident data (2008–2025) from DC Open Data portal
# Read the base
<- read_csv("data/DC/crime_incidents/Crime_Incidents_in_2008 - 2025.csv")
crime_incidents
# Check different categories of crime
|>
crime_incidents count(OFFENSE, sort = TRUE)
sort(unique(crime_incidents$OFFENSE))
# Standardize column names for easier downstream referencing
<- janitor::clean_names(crime_incidents) crime_incidents
9.1.2 Create outcome variables
# Attach felony/not felony flag
<- crime_incidents |>
crime_incidents mutate(felony_flag = case_when(
%in% c("HOMICIDE", "SEX ABUSE", "ASSAULT W/DANGEROUS WEAPON",
offense "ROBBERY", "BURGLARY", "ARSON", "MOTOR VEHICLE THEFT") ~ 1,
TRUE ~ 0
))
# Assign a numerical severity score to each felony type for ranking
<- crime_incidents |>
crime_incidents mutate(felony_severity = case_when(
== "HOMICIDE" ~ 7,
offense == "SEX ABUSE" ~ 6,
offense == "ASSAULT W/DANGEROUS WEAPON" ~ 5,
offense == "ROBBERY" ~ 4,
offense == "ARSON" ~ 3,
offense == "BURGLARY" ~ 2,
offense == "MOTOR VEHICLE THEFT" ~ 1,
offense TRUE ~ 0 # not felony
))
Here we assign a felony_severity
score to each major offense type, using the Cambridge Crime Harm Index (CHI) and per-case cost data as reference. CHI sentence days reflect the relative harm severity as determined by sentencing guidelines. While monetary cost gives an economic impact perspective, the severity ranking primarily follows the CHI framework.
Felony Offense | felony_severity Score | CHI Sentence Days (per case) | Total Cost per Case (USD, 2008) |
---|---|---|---|
HOMICIDE | 7 | 5,475 | $8,982,907 |
SEX ABUSE (Rape/Sexual Assault) | 6 | 1,825 / 365 | $240,776 |
ASSAULT W/DANGEROUS WEAPON | 5 | 1,460 (GBH) | $107,020 (Aggravated Assault) |
ROBBERY | 4 | 365 | $42,310 |
ARSON | 3 | 33 | $21,103 |
BURGLARY | 2 | 20 | $6,462 (Household) |
MOTOR VEHICLE THEFT | 1 | 20 | $10,772 |
THEFT F/AUTO / THEFT/OTHER | 0 | 20 / 2 | $3,532 |
Notes:
ASSAULT W/DANGEROUS WEAPON is aligned with GBH (Grievous Bodily Harm, Intent) rather than ABH (Actual Bodily Harm), due to its higher violence and harm level.
MOTOR VEHICLE THEFT has a higher monetary cost than BURGLARY, but both offenses carry similar CHI sentence days (20). However, BURGLARY often involves unlawful entry into private residences, violating a victim’s sense of safety and security. This intrusion into personal space is considered more psychologically harmful and socially disruptive than property theft alone, which is why burglary is ranked higher in severity.
THEFT F/AUTO and THEFT/OTHER are not classified as felonies in this model; they receive a score of 0 despite having minor CHI and cost estimates. While D.C. law does consider theft exceeding $1,000 as a felony, we exclude these categories because the offense labels in the dataset do not reliably distinguish petty theft from felony-level theft. Without clear thresholds or legal classifications in the
offense
variable, treating all theft as non-felony ensures greater consistency in modeling.
9.1.3 Add Features
<- st_as_sf(crime_incidents,
crime_sf coords = c("longitude", "latitude"),
crs = 4326,
remove = FALSE)
Liquor
# Liquor Moratorium Zone (polygon)
# Read shapefile:alcohol sale restriction areas(moratorium zones)
<- st_read("data/DC/liquor/Alcoholic_Beverage_and_Cannabis_Administration_Moratorium_Zones/Alcoholic_Beverage_and_Cannabis_Administration_Moratorium_Zones.shp")
liquor_zone_sf
# Check geometry type
# st_geometry_type(liquor_zone_sf)
table(st_geometry_type(liquor_zone_sf))
# Keep only polygon type geometries
<- liquor_zone_sf |>
liquor_zone_sf filter(st_geometry_type(liquor_zone_sf) %in% c("POLYGON", "MULTIPOLYGON"))
# Convert to metric CRS (EPSG:3857) to align with `crime_m` for subsequent spatial operations
<- st_transform(crime_sf, crs = 3857)
crime_m <- st_transform(liquor_zone_sf, 3857)
liquor_zone_m
# Spatial intersection: check whether each crime falls within any alcohol restriction zone polygon
$in_liquor_moratorium_zone <- lengths(st_intersects(crime_m, liquor_zone_m)) > 0
crime_m
# Convert to integer (0/1) for modeling purposes
$in_liquor_moratorium_zone <- as.integer(crime_m$in_liquor_moratorium_zone)
crime_m
# Inspect distribution to assess data validity
table(crime_m$in_liquor_moratorium_zone)
# Drop geometry and convert to data.frame for further analysis and modeling
<- crime_m |>
crime_enriched st_drop_geometry() |>
select(everything())
# Liquor Store Location
# Read CSV file containing liquor store information
<- read_csv("data/DC/liquor/Liquor_Licenses.csv")
liquor_csv
# Check data structure
# glimpse(liquor_csv)
# Check coordinate fields
summary(liquor_csv$LONGITUDE)
summary(liquor_csv$LATITUDE)
table(liquor_csv$STATUS)
# Filter out records with missing latitude or longitude
<- liquor_csv |>
liquor_csv_clean filter(STATUS == "Active") |>
filter(!is.na(LONGITUDE), !is.na(LATITUDE)) # Handle missing values in the raw data
# Convert to sf object with geographic CRS (EPSG:4326)
<- st_as_sf(liquor_csv_clean, coords = c("LONGITUDE", "LATITUDE"), crs = 4326)
liquor_store_sf
# st_geometry_type(liquor_store_sf)
# Convert to metric CRS (EPSG:3857) to align with `crime_m`
<- st_transform(liquor_store_sf, 3857)
liquor_store_m
# Compute distance matrix: distances from each crime to all liquor stores
<- st_distance(crime_m, liquor_store_m)
dist_matrix_liquorstore
# Generate feature columns
# Distance to the nearest liquor store (in meters)
$nearest_liquorstore_dist <- apply(dist_matrix_liquorstore, 1, min)
crime_m
# Flag indicating if there is a liquor store within 200 meters
$near_liquorstore_200m <- apply(dist_matrix_liquorstore, 1, function(x) any(x < 200))
crime_m$near_liquorstore_200m <- as.integer(crime_m$near_liquorstore_200m)
crime_m
# Inspect Distribution
summary(crime_m$nearest_liquorstore_dist)
table(crime_m$near_liquorstore_200m)
WiFi
# WiFi spots (point)
# Read Wifi shapefile
<- st_read("data/DC/wireless_hotspots/Wireless_Hotspots_from_DC_Government/Wireless_Hotspots_from_DC_Government.shp")
wifi_sf
# Convert to WGS84 geographic CRS (EPSG:4326)
<- st_transform(wifi_sf, crs = 4326)
wifi_sf
# Check geometry type
# st_geometry_type(wifi_sf)
# Convert to metric CRS(EPSG:3857)for spatial calculations
<- st_transform(wifi_sf, 3857)
wifi_m
# Compute distance matrix(crime x wifi)
<- st_distance(crime_m, wifi_m)
dist_matrix_wifi
# Flag whether each crime has a WiFi point within 100 meters
$near_wifi_100m <- apply(dist_matrix_wifi, 1, function(x) any(x < 100))
crime_m$near_wifi_100m <- as.integer(crime_m$near_wifi_100m)
crime_m
# Optional: compute actual distance to nearest WiFi point (for visualization)
$nearest_wifi_dist <- apply(dist_matrix_wifi, 1, min)
crime_m
# Inspect Distribution
table(crime_m$near_wifi_100m)
summary(crime_m$nearest_wifi_dist)
Low Food Access
# Low food access (polygon)
# Load low food access shapefile
<- st_read("data/DC/low_food_access/Low_Food_Access_Areas/Low_Food_Access_Areas.shp")
lowfood_sf
# Convert to geographic CRS (EPSG:4326)
<- st_transform(lowfood_sf, 4326)
lowfood_sf
# st_geometry_type(lowfood_sf)
# Convert to metric CRS (EPSG:3857) for spatial calculations (polygons)
<- st_transform(lowfood_sf, 3857)
lowfood_m # st_geometry_type(lowfood_m)
# Spatial intersection: determine if within low-food-access polygon area
$in_lowfood_zone <- lengths(st_intersects(crime_m, lowfood_m)) > 0
crime_m
# Convert to integer (0/1) for modeling
$in_lowfood_zone <- as.integer(crime_m$in_lowfood_zone)
crime_m
# Inspect Distribution
table(crime_m$in_lowfood_zone)
Grocery Store
# Grocery Store (point)
# Load and convert
<- st_read("data/DC/grocery_store/Grocery_Store_Locations/Grocery_Store_Locations.shp")
grocery_sf <- st_transform(grocery_sf, 4326)
grocery_sf # st_geometry_type(grocery_sf)
<- st_transform(grocery_sf, 3857)
grocery_m
# Calculate distance
<- st_distance(crime_m, grocery_m)
dist_matrix_grocery
# Add features
$nearest_grocery_dist <- apply(dist_matrix_grocery, 1, min)
crime_m$near_grocery_300m <- apply(dist_matrix_grocery, 1, function(x) any(x < 300))
crime_m$near_grocery_300m <- as.integer(crime_m$near_grocery_300m)
crime_m
# Inspect distribution
table(crime_m$near_grocery_300m)
summary(crime_m$nearest_grocery_dist)
Bank
# Bank Location (point)
<- st_read("data/DC/bank_location/Bank_Locations/Bank_Locations.shp")
bank_sf <- st_transform(bank_sf, 4326)
bank_sf
# st_geometry_type(bank_sf)
<- st_transform(bank_sf, 3857)
bank_m
<- st_distance(crime_m, bank_m)
dist_matrix_bank
$nearest_bank_dist <- apply(dist_matrix_bank, 1, min)
crime_m$near_bank_250m <- apply(dist_matrix_bank, 1, function(x) any(x < 250))
crime_m$near_bank_250m <- as.integer(crime_m$near_bank_250m)
crime_m
# Inspect Distribution
table(crime_m$near_bank_250m)
summary(crime_m$nearest_bank_dist)
Vending Zone
# Vending Zone (polygon)
<- st_read("data/DC/street_vending/Street_Vending_Zones/Street_Vending_Zones.shp")
vending_sf
# st_geometry_type(vending_sf)
<- st_transform(vending_sf, 3857)
vending_sf # Polygons only, so only match against `crime_m
$in_vending_zone <- lengths(st_intersects(crime_m, vending_sf)) > 0
crime_m$in_vending_zone <- as.integer(crime_m$in_vending_zone)
crime_m
# Inspect distribution
table(crime_m$in_vending_zone)
Police Sector
# Read the shapefile of police sectors
<- st_read("data/DC/police_sector_division/Police_Sectors/Police_Sectors.shp")
police_sector_sf
# Check geometry type (usually POLYGON or MULTIPOLYGON)
# table(st_geometry_type(police_sector_sf))
# Convert projection to EPSG:3857 (meters) to match crime_m
<- st_transform(police_sector_sf, 3857)
police_sector_m
# Check if each crime falls within a police sector polygon
# This will assign the sector name/code to each crime (if available)
<- st_join(crime_m, police_sector_m |>
crime_m select(police_sector = SECTOR), left = TRUE)
# Check how many NA (missing sectors)
# table(!is.na(crime_m$police_sector)) # No NA
table(crime_m$police_sector)
# Optional: Convert to factor (if categorical)
$police_sector <- as.factor(crime_m$police_sector) crime_m
Military Base
# Read shapefile of military bases
<- st_read("data/DC/military_base/Military_Bases/Military_Bases.shp")
military_sf
# table(st_geometry_type(military_sf))
# Check and transform CRS to EPSG:3857
<- st_transform(military_sf, 3857)
military_m
# Check if each crime falls inside a military base polygon
$in_military_base <- lengths(st_intersects(crime_m, military_m)) > 0
crime_m
# Convert to 0/1
$in_military_base <- as.integer(crime_m$in_military_base)
crime_m
# Optional summary check
table(crime_m$in_military_base)
9.1.4 Enriched crime_incidents Data
# Retain original sf object for spatial analysis
<- crime_m
crime_sf_final
# Create modeling dataframe: drop geometry
<- crime_m |>
crime_enriched st_drop_geometry()
# Export modeling data
write_csv(crime_enriched, "data/crime_enriched.csv")
<- read_csv("data/crime_enriched.csv") crime_enriched
9.2 Second Stage for Data Processing (ACS Data)
Here we are going to usetidycensus
package and ACS data to further enrich our data frame. To do so, you need to have a Census API first.
9.2.1 Set up API for ACS data
If you installed Census API in your R global env once, use this code to check:
Sys.getenv("CENSUS_API_KEY")
If not, open “https://api.census.gov/data/key_signup.html” to get one, then:
if (!file.exists("~/.Renviron") || !any(grepl("CENSUS_API_KEY", readLines("~/.Renviron")))) {
census_api_key("<YOUR_API_KEY_HERE>", install = TRUE)
message("API key added. Please restart your R session.")
else {
} message("Census API key already installed.")
}
Check for each variable of ACS5 using URL:“https://api.census.gov/data/year/acs/acs5/variables.html”
You can fill number into year from 2009 - 2023. For example, ACS5(2005 - 2009)’s link is: https://api.census.gov/data/2009/acs/acs5/variables.html
# Extract predictors of interest
<- c(
acs_vars_named # econ and employment
median_income = "B19013_001",
poverty_total = "B17001_001",
poverty_count = "B17001_002",
unemployed = "B23025_005",
labor_force = "B23025_003",
# race and ethnicity
black_count = "B02001_003",
total_race = "B02001_001",
hispanic_count = "B03003_003",
total_ethnicity = "B03003_001",
foreign_born = "B05002_013",
total_pop_fb = "B05002_001",
# household resources
renter = "B25003_003",
total_tenure = "B25003_001",
no_vehicle = "B08201_002",
total_vehicle = "B08201_001",
# family structure
single_mom = "B11003_010",
total_family = "B11003_001",
# education:high school +
edu_hsgrad = "B15003_017",
edu_somecollege = "B15003_018",
edu_associate = "B15003_019",
edu_bachelor = "B15003_020",
edu_master = "B15003_021",
edu_professional = "B15003_022",
edu_doctorate = "B15003_023",
edu_other1 = "B15003_024",
edu_other2 = "B15003_025",
total_edu = "B15003_001",
# Age:< 18 & 65 +
# male < 18
under18_m1 = "B01001_003",
under18_m2 = "B01001_004",
under18_m3 = "B01001_005",
under18_m4 = "B01001_006",
# female <18
under18_f1 = "B01001_027",
under18_f2 = "B01001_028",
under18_f3 = "B01001_029",
under18_f4 = "B01001_030",
# male 65+
age65plus_m1 = "B01001_020",
age65plus_m2 = "B01001_021",
age65plus_m3 = "B01001_022",
age65plus_m4 = "B01001_023",
age65plus_m5 = "B01001_024",
age65plus_m6 = "B01001_025",
# female 65+
age65plus_f1 = "B01001_044",
age65plus_f2 = "B01001_045",
age65plus_f3 = "B01001_046",
age65plus_f4 = "B01001_047",
age65plus_f5 = "B01001_048",
age65plus_f6 = "B01001_049",
# total population
total_age = "B01001_001" # total population of different age
)
9.2.2 Variable Availability Check (2009-2023)
Before we actually put our arguments into data extraction, we can first check if they are available across 2009-2023 then adjust our args into a proper range.
# Function: check if variables are valid for a given year
<- function(year, var_vector) {
check_vars cat("Checking year:", year, "...\n")
<- load_variables(year, "acs5", cache = TRUE)
available <- names(var_vector)
var_names <- var_vector %in% available$name
matched return(data.frame(
year = year,
var_label = var_names,
var_code = var_vector,
exists = matched
))
}
# Check each year from 2009 to 2023
<- purrr::map_dfr(2009:2023, ~check_vars(.x, acs_vars_named))
var_check_df
# Identify which variables are missing in which years
<- var_check_df |>
missing_table filter(!exists) |>
arrange(var_label, year)
9.2.3 Starting ACS Data Retrieval from 2012
Our checks show that data from 2009 to 2011 is incomplete. Since incomplete data cannot yield reliable predictions and we aim to forecast for the most recent year (2023), we will begin retrieving data from 2012 onward.
<- 2012:2023 # 5-year; available from 2009-2023
years
<- map_dfr(years, function(y) {
acs_raw get_acs(
geography = "tract",
state = "11",
year = y,
variables = acs_vars_named,
survey = "acs5",
geometry = FALSE
|>
) mutate(year = y)
})
# Export acs raw data
write_csv(acs_raw, "data/acs_raw.csv")
<- read_csv("data/acs_raw.csv") acs_raw
Prepare ACS data for merging by pivoting and creating derived features.
<- acs_raw |>
acs_wide select(GEOID, year, variable, estimate) |>
pivot_wider(
names_from = variable,
values_from = estimate
)
<- acs_wide |>
acs_processed mutate(
# Proportion variables
poverty_rate = poverty_count / poverty_total,
black_pct = black_count / total_race,
hispanic_pct = hispanic_count / total_ethnicity,
foreign_born_pct = foreign_born / total_pop_fb,
renter_pct = renter / total_tenure,
no_vehicle_pct = no_vehicle / total_vehicle,
singlemom_pct = single_mom / total_family,
unemployed_rate = unemployed / labor_force,
# Education proportion (sum of high school and above)
hsplus_total = edu_hsgrad + edu_somecollege + edu_associate + edu_bachelor + edu_master + edu_professional + edu_doctorate + edu_other1 + edu_other2,
hsplus_pct = hsplus_total / total_edu,
# Age structure proportions
under18_total = under18_m1 + under18_m2 + under18_m3 + under18_m4 + under18_f1 + under18_f2 + under18_f3 + under18_f4,
under18_pct = under18_total / total_age,
age65plus_total = age65plus_m1 + age65plus_m2 + age65plus_m3 + age65plus_m4 + age65plus_f1 + age65plus_f2 + age65plus_f3 + age65plus_f4 + age65plus_m5 + age65plus_m6 + age65plus_f5 + age65plus_f6,
age65plus_pct = age65plus_total / total_age,
# Create tract col for join
tract = stringr::str_sub(GEOID, -6)
)
# Export acs processed data
write_csv(acs_processed, "data/acs_processed.csv")
<- read_csv("data/acs_processed.csv") acs_processed
9.2.4 Handling Missing Values in Tract Column
As we want our data to keep the maximum valid geospatial features at tract level, we need to check if the DC open data, ACS5 data and official DC tract data are consistent before we merge them into one. We first compare tract IDs in crime_enriched
and acs_processed
with the official 2023 shapefile to identify missing or obsolete tracts.
# Load 2023 official DC tract shapefile
<- st_read("data/tl_2023_11_tract/tl_2023_11_tract.shp")
dc_tract_sf
# Extract tract lists
<- unique(dc_tract_sf$TRACTCE)
official_tract6
<- unique(dc_tract_sf$GEOID)
official_tract11
# Compare crime_enriched (census_tract) with dc_tract_sf$TRACTCE
<- unique(na.omit((crime_enriched$census_tract)))
crime_tracts
<- setdiff(official_tract6, crime_tracts)
missing_in_crime
<- setdiff(crime_tracts, official_tract6)
unexpected_in_crime
# Compare acs_processed GEOID with official dc_tract_sf$GEOID
<- unique(na.omit(acs_processed$GEOID))
acs_tracts
<- setdiff(official_tract11, acs_tracts)
missing_in_acs
<- setdiff(acs_tracts, official_tract11) unexpected_in_acs
The official DC shapefile (dc_tract_sf
) contains 206 tracts, while both crime_enriched$census_tract
and acs_processed$GEOID
contain 231 unique tract identifiers after droping NAs. This means there are 25 tracts present in the datasets but missing from the 2023 TIGER shapefile.
Note: Only crime incidents from DC open data have NAs in the tract column. There’s no NAs in ACS5 tract column, meaning all tracts had been calculated its demographic feature in ACS survey. While a few crime incidents did not have their tract recorded, this may occur during law enforcement operations when detailed location data is not captured.
Since the crime_enriched
data spans 2008–2025 and the acs_processed
data covers 2012–2023, it is likely that changes to tract definitions, such as splits, merges, or renumbering, have occurred over the years. These changes can result in older tract codes persisting in historical data but not appearing in the most recent shapefile.
To confirm this, we compare the unexpected tracts in crime_enriched
and acs_processed
against the official 2023 shapefile to see if they are the same.
# Convert 11-digit ACS GEOIDs to 6-digit tract codes
<- substr(unexpected_in_acs, 6, 11)
unexpected_in_acs_tract6
# Check if both datasets contain the same extra tracts
setequal(unexpected_in_crime, unexpected_in_acs_tract6)
# Inspection of overlap and differences
intersect(unexpected_in_crime, unexpected_in_acs_tract6)
setdiff(unexpected_in_crime, unexpected_in_acs_tract6)
setdiff(unexpected_in_acs_tract6, unexpected_in_crime)
The results show that the two datasets share the exact same set of 25 unexpected tract codes.
To investigate whether these tracts are legacy codes that have since been phased out, we now filter both datasets to include only observations from 2019 to 2023, which we wanna use to build our predictive models. We then compare the resulting tract coverage to the official 2023 DC shapefile.
# Filter crime data to 2019–2023
<- crime_enriched |>
crime_filtered mutate(report_date = lubridate::ymd_hms(str_sub(report_dat, 1, 19))) |>
filter(year(report_date) >= 2019 & year(report_date) <= 2023)
# Filter ACS data to 2019–2023
<- acs_processed |>
acs_filtered filter(year >= 2019 & year <= 2023)
# Extract unique tract identifiers
<- unique(na.omit(crime_filtered$census_tract))
crime_tracts_filtered
<- unique(na.omit(acs_filtered$GEOID)) acs_tracts_filtered
We then identify any tracts that appear in the filtered datasets (2019-2023) but are not included in the 2023 TIGER shapefile.
# Compare against 2023 official tracts
<- setdiff(official_tract6, crime_tracts_filtered)
missing_in_crime_2019_2023
<- setdiff(crime_tracts_filtered, official_tract6)
unexpected_in_crime_2019_2023
<- setdiff(official_tract11, acs_tracts_filtered)
missing_in_acs_2019_2023
<- setdiff(acs_tracts_filtered, official_tract11) unexpected_in_acs_2019_2023
The crime data from 2019–2023 aligns perfectly with the 2023 TIGER shapefile, showing no extra tracts. However, ACS still contains 25 unexpected tracts. This is likely because ACS 5-year estimates span earlier years and retain older tract definitions. To ensure consistency across all datasets, we will restrict our analysis to the 2020–2023 period.
To investigate further, we check whether these 25 tracts are the same as the unexpected tracts found in the full datasets from 2008–2025 (crime incidents) and 2012–2023 (ACS5).
# Check if the 25 unexpected tracts in 2019–2023 ACS match those in the full ACS data
setequal(unexpected_in_acs_2019_2023, unexpected_in_acs)
# Trim ACS GEOIDs to 6-digit tract codes for comparison with crime incidents data
<- substr(unexpected_in_acs_2019_2023, 6, 11)
unexpected_in_acs_tract6_filtered
# Check if these match the unexpected tracts in the full crime incidents dataset
setequal(unexpected_in_acs_tract6_filtered, unexpected_in_crime)
In the full datasets:
Both crime_enriched (2008-2025) and acs_processed (2012-2023) contain 25 extra tracts not present in the 2023 TIGER shapefile.
These 25 tracts are identical across both datasets, indicating they are likely the legacy tracts.
In the 2019–2023 subset:
The unexpected tracts have been removed from the crime incidents data.
However, they still exist in the ACS data.
This suggests that the outdated tracts persist in ACS due to the rolling nature of 5-year estimates.
To test this further, we narrow ACS to 2020–2023. Because this window only includes data from 2016 onward, we expect the outdated tracts to drop off if the change occurred before or around 2016.
# Filter ACS to 2020–2023
<- acs_processed |>
acs_filtered_2020 filter(year >= 2020 & year <= 2023)
<- unique(acs_filtered_2020$GEOID)
acs_tracts_filtered_2020
# Compare with official 2023 tract list
<- setdiff(official_tract11, acs_tracts_filtered_2020)
missing_in_acs_2020_2023
<- setdiff(acs_tracts_filtered_2020, official_tract11)
unexpected_in_acs_2020_2023
# Compare with 2019-2023 ACS tract list
setdiff(acs_tracts_filtered, acs_tracts_filtered_2020)
setdiff(acs_tracts_filtered_2020, acs_tracts_filtered)
# Check if all unexpected tracts are gone
<- substr(acs_tracts_filtered_2020, 6, 11)
acs_tract6_filtered_2020
setequal(crime_tracts_filtered, acs_tract6_filtered_2020)
We find that the 2020-2023 ACS data now matches the 2023 TIGER shapefile and the 2019–2023 crime data exactly. The outdated tract codes still appeared in the 2019–2023 results, while they disappeared in the 2020-2023 results. This may because ACS 5-year estimates include data from earlier years. 2019-2023 ACS data covers data from 2015, and 2020-2023 ACS data covers data from 2016. So the tract changes are likely to happen between 2015 and 2016.
Next, we will compare the TIGER shapefiles from 2015 and 2016 to confirm when the tract boundary changes were officially implemented to test this hypothesis.
# Load 2015 and 2016 DC tract shapefiles
<- st_read("data/tl_2015_11_tract/tl_2015_11_tract.shp")
tract_2015 <- st_read("data/tl_2016_11_tract/tl_2016_11_tract.shp")
tract_2016
# Extract unique GEOIDs
<- unique(tract_2015$GEOID)
geoids_2015 <- unique(tract_2016$GEOID)
geoids_2016
# Compare
setdiff(geoids_2015, geoids_2016)
setdiff(geoids_2016, geoids_2015)
The results show our assumption was incorrect. No tract changes occurred between 2015 and 2016. The GEOIDs are identical for both years.
To further investigate when the tract definitions changed, we referred to the U.S. Census Bureau’s 2010 guide, which states that DC had 179 census tracts as of 2010.
However, we observed that the 2023 DC TIGER shapefile includes 206 tracts. The same count appears in the ACS 5-year data from 2020–2023 and the crime incident data from 2019–2023.
This led to a second hypothesis: DC’s total number of tracts increased between 2010 and 2023.
# Load 2010 DC tracts
<- st_read("data/tl_2010_11001_tract10/tl_2010_11001_tract10.shp")
dc_tract_2010
# Extract tract GEOIDs
<- unique(dc_tract_2010$GEOID)
dc_tracts_2010_ids <- unique(dc_tract_sf$GEOID)
dc_tracts_2023_ids
# Compare tract counts and differences
length(dc_tracts_2010_ids) # 179
length(dc_tracts_2023_ids) # 206
setdiff(dc_tracts_2023_ids, dc_tracts_2010_ids) # 52 added
setdiff(dc_tracts_2010_ids, dc_tracts_2023_ids) # 25 removed
From this comparison, we see that 52 tracts were added and 25 were removed between 2010 and 2023. This results in a net increase of 27 tracts, confirming that DC’s tract count grew from 179 to 206 over this period. While this change isn’t explicitly documented in a single source, it can be deduced through shapefile comparison. To identify the specific year of this change, we can do this comparison year by year.
Since both 2015 and 2016 still had 179 tracts, which are the same as in 2010, we began checking for changes starting from 2017.
# Load 2017-2020 DC tract shapefiles
<- st_read("data/tl_2017_11_tract/tl_2017_11_tract.shp")
tract_2017 <- st_read("data/tl_2018_11_tract/tl_2018_11_tract.shp")
tract_2018 <- st_read("data/tl_2019_11_tract/tl_2019_11_tract.shp")
tract_2019 <- st_read("data/tl_2020_11_tract/tl_2020_11_tract.shp")
tract_2020
# Extract unique GEOIDs
<- unique(tract_2017$GEOID)
geoids_2017 <- unique(tract_2018$GEOID)
geoids_2018 <- unique(tract_2019$GEOID)
geoids_2019 <- unique(tract_2020$GEOID)
geoids_2020
# Compare year by year
setdiff(geoids_2017, geoids_2018)
setdiff(geoids_2018, geoids_2017)
setdiff(geoids_2018, geoids_2019)
setdiff(geoids_2019, geoids_2018)
setdiff(geoids_2019, geoids_2020)
setdiff(geoids_2020, geoids_2019)
The results confirm that the major tract update occurred between 2019 and 2020. During this transition, 25 tracts were removed and 52 were added, increasing the total from 179 to 206.
While ACS and crime data each follow their own update cycles, we have verified that for the years 2020–2023, the tracts in both ACS 5-year estimates and crime incidents data fully match the 2023 TIGER shapefile.
Therefore, to ensure consistency across all datasets, we revise our modeling window to cover only the four years from 2020 to 2023, instead of the originally planned 2019–2023 period.
9.2.5 Merge crime_enriched and ACS5 data
Clean and prepare the crime data to match ACS structure.
# Select required fields from `crime_enriched`
<- crime_enriched |>
crime_enriched mutate(
# report_dat` is a string like "2008/12/12 22:00:00+00"
# Parse with `ymd_hms()` to automatically detect timezone
report_date_parsed = ymd_hms(report_dat),
year = year(report_date_parsed)
|>
) rename(tract = census_tract) #rename census_tract for join
Now we merge the cleaned crime data with the ACS data by matching on tract and year.
# Perform left join by tract and year
<- left_join(
crime_enriched
crime_enriched,
acs_processed,by = c("tract" = "tract", "year" = "year")
)
write_csv(crime_enriched, "data/crime_enriched_with_acs.csv")
<- read_csv("data/crime_enriched_with_acs.csv") crime_enriched
Last process to keep valid data:
<- crime_enriched |>
crime_enriched_filtered filter(year >= 2020, year <= 2023)
# Inspect which variables for missing values
colSums(is.na(crime_enriched_filtered))
9.2.6 Data imputation on missing values
During the merge of DC Open Data and ACS datasets, several variables introduced a high proportion of missing values. Instead of dropping all rows with any NA, which would remove some tracts entirely, we first drop variables that are missing in more than 95% of (year, tract)
groups. We then remove any remaining rows that still contain missing values.
# Calculate missing value proportion for each column
<- colSums(is.na(crime_enriched_filtered)) / nrow(crime_enriched_filtered)
na_ratio
# Retain columns with missing rate < 30%
<- crime_enriched_filtered |>
crime_enriched_filtered select(which(na_ratio < 0.3))
# Drop rows where `tract` is NA
<- crime_enriched_filtered |>
crime_enriched_filtered filter(!is.na(tract))
# Identify columns to check (excluding grouping columns)
<- setdiff(colnames(crime_enriched_filtered), c("year", "tract"))
cols_to_check
# For each (year, tract) group, calculate the NA rate for each variable
<- crime_enriched_filtered |>
na_summary group_by(year, tract) |>
summarise(across(all_of(cols_to_check), ~ mean(is.na(.)), .names = "na_{col}")) |>
ungroup()
# Drop variables missing in >95% of (year, tract) group
<- na_summary |>
vars_with_high_na pivot_longer(cols = starts_with("na_"), names_to = "var", values_to = "na_rate") |>
filter(na_rate > 0.95) |>
mutate(var = sub("^na_", "", var)) |>
distinct(var) |>
pull(var)
<- crime_enriched_filtered |>
crime_enriched_filtered select(-all_of(vars_with_high_na))
# Check how many NA values remain per column
colSums(is.na(crime_enriched_filtered))
|>
crime_enriched_filtered summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(cols = everything(), names_to = "column", values_to = "na_count") |>
filter(na_count > 0) |>
arrange(desc(na_count))
# Drop known NA columns that are not used for modeling
<- crime_enriched_filtered |>
crime_enriched_no_na select(-c(end_date, psa, district, start_date, block_group, voting_precinct))
# Check if there's NA in crime_enriched_no_na
sum(is.na(crime_enriched_no_na))
write_csv(crime_enriched_no_na, "data/crime_enriched_acs_nona.csv")
<- read_csv("data/crime_enriched_acs_nona.csv")
crime_enriched_no_na
# Check the tract column
unique(crime_enriched_no_na$tract)
9.3 Final Stage for Data Processing (NIBRS Data)
We checked the UCR Program data from Harvard Dataverse, including its administrative segment, offense segment, offender segment, arrestee segment and victim segment. Though they have proper year column that we can merge our current data frame with, they are missing the column of tract number. Actually, these datasets cover the whole US. So they might be more useful when building a predictive model at state level, instead of tract level within DC. Here, we decide not to joining the data from NIBRS for our DC-crime-predicting model.
<- readRDS("data/DC/nibrs/admin/nibrs_administrative_segment_2021.rds") |>
admin_2023 filter(state_abb == "dc")
<- readRDS("data/DC/nibrs/offense/nibrs_offense_segment_2023.rds") |>
offense_2023 filter(state_abb == "dc")
<- readRDS("data/DC/nibrs/offender/nibrs_offender_segment_2023.rds") |>
offender_2023 filter(state_abb == "dc")
<- readRDS("data/DC/nibrs/arrestee/nibrs_arrestee_segment_2023.rds") |>
arrestee_2023 filter(state_abb == "dc")
<- readRDS("data/DC/nibrs/victim/nibrs_victim_segment_2023.rds") |>
victim_2023 filter(state_abb == "dc")