1 Literature Review

Urban cities, characterized by their high population density, diverse socioeconomic landscapes, and intricate social interactions, often present environments where crime rates are elevated compared to less densely populated areas (Cheng & Chen, 2021). The District of Columbia, a city marked by pronounced neighborhood disparities and shifting crime dynamics, presents a timely case for predictive crime modeling (Golash-Boza & Oh, 2021). Our project investigates three key questions: To what extent can crime characteristics and spatial context predict whether an incident is a felony? How accurately can we classify the type of felony? And can we estimate the felony severity score associated with a newly reported crime in DC? These questions not only reflect academic interest in machine learning for social applications, but also offer practical insights for public-sector decision-making on law enforcement and urban resource allocation.

A robust body of literature underscores the importance of both spatial and socioeconomic factors in shaping crime patterns. Environmental criminology highlights how the built environment—liquor outlets, transit hubs, vending zones, and other public facilities—can serve as vectors of accessibility that increase the likelihood of certain offenses (Bowers & Johnson, 2011). Risk-terrain modeling shows that crimes often cluster around these spatial features. In parallel, socioeconomic disadvantage at the neighborhood level has long been associated with elevated rates of both violent and property crime. Multilevel studies confirm that factors such as poverty, unemployment, and educational attainment are closely linked to victimization risk(Van et al.,2006). By incorporating ACS 5-year estimates alongside geo-spatial indicators from D.C. Open Data, our model reflects a multidimensional understanding of crime exposure and allows for fine-grained prediction at the incident level.

Felony crimes represent the most severe and socially costly offenses in the criminal justice system. Legally defined in D.C. as offenses punishable by more than one year of imprisonment (Council of the District of Columbia, 2025), felonies include acts such as assault with a dangerous weapon, robbery, and homicide. These crimes disproportionately consume public safety resources and contribute to long-term social harm. The societal cost of violent felonies can reach tens of thousands of dollars per incident, factoring in medical care, lost productivity, and incarceration (McCollister et al., 2010). Prioritizing felony classification enables the development of targeted interventions and more efficient policing. In our study, the felony severity score serves as the core outcome in a regression task aimed at quantifying the harm associated with newly reported incidents.

While spatial and socioeconomic features help explain where and why crimes occur, effective crime prevention also depends on how institutions respond to these risks. Problem-oriented policing (POP) emphasizes proactive, context-sensitive interventions based on the SARA model—Scanning, Analysis, Response, and Assessment.(Weisburd et al.2010). Meanwhile, hot-spots policing strategies that concentrate patrols at high-risk micro-locations have been shown to reduce crime not only in targeted areas but also in adjacent zones through diffusion effects (Braga et al., 2014). These insights align closely with our project’s emphasis on prediction-informed intervention. Our modeling approach provides a practical tool for supporting targeted, evidence-based law enforcement strategies in DC.

2 Set up

library(tidyverse)
library(readr)
library(haven)
library(lubridate)
library(tidymodels)
library(tidycensus)
library(janitor)
library(sf)
library(plotly)
library(themis)
library(vip)
library(ranger)
library(xgboost)
library(rlang)
library(bestNormalize)

3 Prepare Data Set

In this project, we use a government and policy-relevant dataset to develop a three different types of models to predict:

the likelihood of a crime being a felony in Washington, D.C. (binary classification task);
which felony category a new reported crime in DC falls under (multi-classification task);
what felony severity score of a crime new reported in DC would be (regression task).

The data is compiled primarily from the DC Open Data Portal. The main source is a series of datasets titled Crime Incidents, released annually from 2008 to 2025, comprising a total of eighteen datasets. These include metadata for each crime incident, such as the exact date and time of occurrence, geographic coordinates (latitude and longitude), offense classification (e.g., theft, robbery), police shift (e.g., evening, midnight).

To enrich the crime records with contextual and spatial information, we merged several additional geospatial datasets also provided by DC Open Data. These include:

Liquor license moratorium zones
Locations and proximity of liquor stores, grocery stores, and banks
Public WiFi zones and vending zones
Low food access areas
Police sector boundaries and military base zones

We also integrated tract-level socioeconomic indicators from the American Community Survey (ACS5) using the tidycensus package. These variables include poverty rate, income, renter percentage, race/ethnicity composition, vehicle ownership, and age structure etc. Notably, since ACS5 provides rolling estimates across five-year spans, we aligned each crime observation with the most recent ACS data available for that year. For example, the 2005–2009 ACS estimates are used to represent conditions in 2009. While not a perfect snapshot, this method can act as a consistent way to approximate local context year by year.

The final merged dataset (crime_enriched_acs_nona.csv) is available via this shared link.

The outcome variable, felony_flag, is derived from the reported offense type, indicating whether a reported crime is classified as a felony. This classification task is critical, as felonies typically represent more serious crimes and require distinct responses in terms of police allocation. Our goal is to build a predictive model that helps law enforcement deploy patrol resources more efficiently by forecasting felony likelihood using supervised machine learning techniques.

To ensure real-world model evaluation, we split the dataset by time: observations from 2020 to 2022 are used for training, while 2023 data is reserved for implementation testing. The detailed data processing and cleaning steps are documented in the Appendix section. The raw data sources are listed in the README.

# Load the cleaned and pre-merged dataset combining DC crime data and ACS indicators
crime_enriched_no_na <- read_csv("data/crime_enriched_acs_nona.csv")

# Select variables of interest for modeling
keep_vars <- c(
  "felony_flag", "felony_severity", # outcome variables
  "x", "y", "latitude", "longitude", "year",
  "shift", "method", "tract", "report_date_parsed",

  # GIS contextual features
  "in_liquor_moratorium_zone", "nearest_liquorstore_dist", "near_liquorstore_200m",
  "near_wifi_100m", "nearest_wifi_dist",
  "in_lowfood_zone",
  "nearest_grocery_dist", "near_grocery_300m",
  "nearest_bank_dist",
  "near_bank_250m", 
  "in_vending_zone", 
  "police_sector", 
  "in_military_base",

  # ACS socioeconomic features
  "poverty_rate", "unemployed_rate", 
  "black_pct", "hispanic_pct","foreign_born_pct", 
  "hsplus_pct", 
  "under18_pct", "age65plus_pct"
)

# Keep only selected variables
crime_enriched_small <- crime_enriched_no_na |> 
  select(all_of(keep_vars))

# Define training data (2020-2022)
crime_enriched_mod <- crime_enriched_small |>
  filter(year >= 2020, year < 2023)

# Define implementation/test data (2023)
crime_enriched_implement <- crime_enriched_small |>
  filter(year == 2023)

# Adjust variable data types for training data
crime_enriched_mod <- crime_enriched_mod |> 
  mutate(
    report_date_parsed = as.Date(report_date_parsed),
         felony_flag = factor(felony_flag)
         )

# Adjust variable data types for implementation data
crime_enriched_implement <- crime_enriched_implement |> 
  mutate(
    report_date_parsed = as.Date(report_date_parsed),
         felony_flag = factor(felony_flag)
         )

# Apply Yeo-Johnson transformation to the felony_severity column
yj_obj <- yeojohnson(crime_enriched_mod$felony_severity)

# Add transformed column to the training dataset
crime_enriched_mod$felony_severity_trans <- predict(yj_obj)

# Apply Yeo-Johnson transformation to the felony_severity column
yj_obj_impl <- yeojohnson(crime_enriched_implement$felony_severity)

# Add transformed column to the implement dataset
crime_enriched_implement$felony_severity_trans <- predict(yj_obj_impl)

4 Binary Classification

4.1 Split Data

set.seed(123)
crime_split <- initial_split(crime_enriched_mod, prop = 0.8, strata = felony_flag)

crime_train <- training(crime_split)

crime_test <- testing(crime_split)

Note: This train-test split is performed only on data from 2020 to 2022. The year 2023 is held out separately for final implementation testing, simulating real-world deployment.

4.2 EDA on Training Set

First, let’s examine the distribution of the outcome variable felony_flag for binary classification task.

# Check class imbalance in the target variable
crime_train |> 
  count(felony_flag) |> 
  mutate(pct = n / sum(n))

felony_flag	n	pct
0	45804	0.6867887
1	20889	0.3132113

According to the table, about 30% of crimes in the training set are felonies, while about 70% are non-felonies. This class imbalance could affect how the model learns. We’ll apply a downsampling method to help the model treat both classes fairly during training.

Second, we explore whether felony crimes are more likely on specific days of the week by plotting the proportion of crimes that are felonies for each weekday.

crime_train |> 
  mutate(dow = lubridate::wday(report_date_parsed, label = TRUE)) |> 
  group_by(dow) |> 
  summarize(felony_rate = mean(as.numeric(as.character(felony_flag)))) |> 
  ggplot(aes(x = dow, y = felony_rate, fill = dow)) +
  geom_col() +
  scale_fill_manual(values = rep("steelblue", 7)) + 
  labs(
    title = "Felony Rate by Weekday",
    x = "Day of Week",
    y = "Proportion of Felony"
  ) +
  theme_classic() +
  theme(
    legend.position = "none",
    text = element_text(family = "serif") 
  )

The plot shows felonies are slightly more frequent on weekends compared to weekdays.

Third, we assess whether ACS5 predictors we selected relate meaningfully to one of our outcome variables felony_flag. Here we focus on poverty_rate.

ggplot(crime_train, aes(x = poverty_rate, y = as.numeric(as.character(felony_flag)))) +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, color = "steelblue") +
  labs(
    title = "Probability of Felony by Poverty Rate",
    x = "Poverty Rate",
    y = "Estimated Probability of Felony"
  ) +
  theme_classic() +
  theme(
    text = element_text(family = "serif") 
  )

Using a logistic smoother to examine the relationship between poverty rate and the likelihood of felony classification, the plot shows a clear positive association: as poverty increases, the probability of a crime being classified as a felony also rises.

Furthermore, we calculate average felony rates by census tract and display them on a choropleth map in order to examine geographic variation in felony risk.

dc_tract_sf <- st_read("data/tl_2023_11_tract/tl_2023_11_tract.shp")  # load tract shp

tract_felrate <- crime_train |>
  group_by(tract) |>
  summarize(
    felony_rate = mean(as.numeric(as.character(felony_flag)))
  )

tract_sf_joined <- dc_tract_sf |> 
  left_join(tract_felrate, by = c("TRACTCE" = "tract"))

ggplot(tract_sf_joined) +
  geom_sf(aes(fill = felony_rate), color = "white", size = 0.2) +
    scale_fill_gradient(
    low = "#d6e4f0",   
    high = "#274472",
    na.value = "grey90"
  ) +
  labs(title = "Felony Rate by Census Tract") +
  theme_classic() +
  theme(
    text = element_text(family ="serif")
  )

unique(crime_enriched_no_na$tract)

  [1] "009602" "008904" "004100" "005802" "001100" "007603" "009905" "004802"
  [9] "003400" "007503" "001303" "008802" "006801" "007304" "007100" "007803"
 [17] "010602" "003100" "002503" "001004" "005003" "002101" "009201" "002002"
 [25] "008702" "007202" "000702" "009906" "010500" "008803" "010601" "004902"
 [33] "009604" "010603" "008301" "004600" "003000" "000102" "004703" "007901"
 [41] "009204" "009102" "007804" "003600" "002001" "004001" "003902" "004402"
 [49] "005203" "003702" "007605" "008002" "004300" "001600" "005502" "001200"
 [57] "009603" "007000" "003500" "003200" "007406" "004401" "003301" "004801"
 [65] "003901" "004702" "009807" "007601" "001304" "009504" "007203" "010700"
 [73] "004901" "007504" "009508" "005801" "002400" "007604" "009907" "009902"
 [81] "002703" "007407" "005001" "005004" "001702" "002301" "000300" "003302"
 [89] "001402" "007806" "010100" "007408" "007502" "010300" "005202" "007404"
 [97] "007707" "003801" "001902" "009803" "007409" "006500" "009400" "005302"
[105] "010400" "002600" "011100" "005303" "000400" "002802" "001901" "000202"
[113] "007808" "006900" "008402" "009801" "009000" "009904" "000201" "006400"
[121] "009802" "003701" "007703" "002504" "007809" "000501" "008701" "004202"
[129] "000904" "007401" "006802" "002501" "007708" "005601" "009203" "002801"
[137] "007403" "010201" "007807" "001401" "002102" "002702" "007903" "002704"
[145] "008804" "001500" "001804" "008903" "009510" "003802" "009301" "006600"
[153] "009901" "002201" "002900" "009511" "000903" "008200" "010900" "009811"
[161] "000704" "009302" "009700" "010202" "011001" "008001" "009804" "002202"
[169] "009503" "008302" "008410" "005900" "007201" "009509" "005503" "009601"
[177] "001301" "004002" "004201" "002302" "011002" "000600" "009507" "008100"
[185] "009505" "009810" "001003" "000804" "000502" "009903" "005602" "007709"
[193] "006804" "001803" "010800" "005501" "006700" "980000" "000802" "004704"
[201] "000902" "000803" "000101" "001002" "000703" "007301"

As shown in the map, felony rates are unevenly distributed across D.C. The southern and eastern tracts report substantially higher proportions of felony incidents, while felony rates are lower in northern areas.

Lastly, we plot each crime by its location to spot patterns across the city. Because there are too many points for a static map, we use an interactive one. Red dots are felonies. Blue dots are non-felonies.

plot_ly(
  data = crime_train,
  x = ~longitude,
  y = ~latitude,
  type = "scattergl",
  mode = "markers",
  color = ~felony_flag,
  colors = c("blue", "red"),
  text = ~paste("Felony:", felony_flag),
  marker = list(size = 2, opacity = 0.5)
) |>
  layout(
    title = list(
      text = "Interactive Felony Crime Map (DC)",
      font = list(family = "serif", size = 18)
    ),
    xaxis = list(
      title = "Longitude",
      zeroline = FALSE,
      showgrid = FALSE,
      font = list(family = "serif"),
      titlefont = list(family = "serif")
    ),
    yaxis = list(
      title = "Latitude",
      zeroline = FALSE,
      showgrid = FALSE,
      scaleanchor = "x",
      font = list(family = "serif"),
      titlefont = list(family = "serif")
    ),
    margin = list(l = 40, r = 40, b = 40, t = 60),
    showlegend = FALSE
  )

4.3 Select Error Metric

In this binary classification task, the outcome variable is felony_flag, indicating whether a reported crime is classified as a felony. We aim to build a predictive model that helps law enforcement allocate patrol resources more efficiently by identifying potential felony incidents in advance.

For this binary classification problem, we propose to use 1 − Recall as our main evaluation metric, which intuitively represents the share of felonies the model fails to detect:

\[ 1 - \text{Recall} = 1 - \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{FN}}{\text{TP} + \text{FN}} \]

This metric directly reflects how often the model misses serious crimes. Since missing a felony is riskier than flagging a non-felony. Our goal is to keep 1 − Recall as low as possible. In practice, we would aim for a 1 − Recall below 0.2, meaning the model correctly identifies at least 80% of actual felony cases.

Although metrics like accuracy and ROC AUC provide an overall picture, they may be misleading under class imbalance. Here, felonies represent about 30% of cases, so a model could achieve high accuracy while failing to catch most felonies.

In this context, a false negative (missed felonies) is more costly than a false positive (flagging non-felonies). A false positive might result in unnecessary police allocation, which is manageable. While a false negative may mean failing to prevent or respond to a serious crime, posing higher social and safety costs.

4.4 Come up with Models

We test three model specifications as below. Each model uses different combinations of feature sets, pre-processing (recipe) and algorithms, so that we can compare their effectiveness and usefulness in the evaluation part.

4.4.1 Model 1: Logistic Regression with LASSO

This model uses logistic regression with L1 regularization to predict felony likelihood, prioritizing interpretability and simplicity. It focuses on demographic and temporal predictors that reflect structural and situational risk. The L1 penalty helps reduce noise by shrinking less relevant variables toward zero.

We include seven ACS-derived variables: poverty rate, Black and Hispanic population percentages, foreign-born share, and the proportions of residents under 18 and over 65. We also include two crime-level factors: police shift and method of offense.

Preprocessing steps include:

- Converting weekday and day-of-year into features from the report date

- One-hot encoding categorical variables

- Standardizing numeric variables

- Downsampling to address class imbalance

The specific recipe is as below:

set.seed(123)
crime_rec_log <- recipe(felony_flag ~ poverty_rate + unemployed_rate +
                          black_pct + hispanic_pct + foreign_born_pct + 
                          hsplus_pct +
                          under18_pct + age65plus_pct +
                          shift + method + report_date_parsed,
                        data = crime_train) |>
  step_downsample(felony_flag) |>
  step_date(report_date_parsed, features = "dow", label = TRUE) |>
  step_mutate(
    yday = lubridate::yday(report_date_parsed),
    yday_cos = cos(2 * pi * yday / 365),
    yday_sin = sin(2 * pi * yday / 365)
  ) |>
  step_rm(report_date_parsed) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

The specification and workflow is as below:

log_spec <- logistic_reg(
  mode = "classification",
  penalty = tune(), # L1 penalty strength (lambda)
  mixture = 1 # 1 means LASSO (L1 only)
) |> 
  set_engine("glmnet") |>
  set_mode("classification") 

log_wf <- workflow() |> 
  add_model(log_spec) |> 
  add_recipe(crime_rec_log)

4.4.2 Model 2: Random Forest with Spatial Features

This model focuses on spatial context and aims to detect felony patterns based on location, proximity to public infrastructure and tract-level zoning conditions. Unlike Model 1, we intentionally exclude demographic variables to isolate the predictive power of GIS features.

The predictors include: tract, police shift and method, proximity to liquor stores, grocery stores, banks, public WiFi zones, vending zones, military bases, and food access zones. These variables capture the built environment and local regulations that may influence criminal activity.

Preprocessing includes removing high correlated variables, converting categorical variables to dummies, normalizing numeric variables, extracting temporal signals and downsampling to reduce class imbalance. We retain tract as a spatial unit, based on earlier EDA results showing geographic variation in felony rates. This model uses a Random Forest from the ranger package. It handles nonlinear relationships and complex interactions among predictors well. We may tune settings like the number of trees or how many variables are used at each split later to improve accuracy.

The specific recipe is as below:

set.seed(123)
crime_rec_rf <- recipe(felony_flag ~ tract + shift + method + 
                         report_date_parsed +
                         in_liquor_moratorium_zone + 
                         nearest_liquorstore_dist + near_liquorstore_200m +
                         near_wifi_100m + nearest_wifi_dist +
                         in_lowfood_zone + 
                         nearest_grocery_dist + near_grocery_300m +
                         nearest_bank_dist + near_bank_250m +
                         in_vending_zone + police_sector + in_military_base,
                       data = crime_train) |>
  step_downsample(felony_flag) |>
  step_date(report_date_parsed, features = "dow", label = TRUE) |>
  step_mutate(
    yday = lubridate::yday(report_date_parsed),
    yday_cos = cos(2 * pi * yday / 365),
    yday_sin = sin(2 * pi * yday / 365)
  ) |>
  step_rm(report_date_parsed) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  step_normalize(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors())

The specification and workflow is as below:

rf_spec <- rand_forest(
  mtry = tune(), # number of variables sampled at each split (to be tuned)
  trees = 100, # total number of trees (fixed for now)
  min_n = tune() # minimum number of observations in a node (to be tuned)
) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")

rf_wf <- workflow() |>
  add_model(rf_spec) |>
  add_recipe(crime_rec_rf)

4.4.3 Model 3: XGBoost with All Features

This model maximizes predictive power by incorporating the full set of spatial, socioeconomic, and temporal predictors available in the dataset. Unlike Models 1 and 2, which focus on interpretability or spatial patterns, this model includes feature broadly and lets the boosting algorithm handle interactions and variable selection.

The predictors include:

Spatial and Proximity Features: in_liquor_moratorium_zone, nearest_liquorstore_dist, near_liquorstore_200m, near_wifi_100m, nearest_wifi_dist, in_lowfood_zone, nearest_grocery_dist, near_grocery_300m, nearest_bank_dist, near_bank_250m, in_vending_zone, police_sector, in_military_base.
Demographic and Socioeconomic Features from ACS: poverty_rate, unemployed_rate, black_pct, hispanic_pct, foreign_born_pct, hsplus_pct, under18_pct, age65plus_pct.
Time and Categorical Contextual Features: shift, method, tract, report_date_parsed.

which can capture complex nonlinear interactions and works well with tabular data. It supports regularization and tree-based boosting, making it a strong candidate when overfitting is a concern.

The specific recipe is as below:

set.seed(123)
crime_rec_xgb <- recipe(felony_flag ~ ., data = crime_train) |>
  step_rm(felony_severity, felony_severity_trans) |>
  step_downsample(felony_flag) |>
  step_date(report_date_parsed, features = "dow", label = TRUE) |>
  step_mutate(
    yday = lubridate::yday(report_date_parsed),
    yday_cos = cos(2 * pi * yday / 365),
    yday_sin = sin(2 * pi * yday / 365)
  ) |>
  step_rm(
    x, y, latitude, longitude, year, report_date_parsed
  ) |>
  step_nzv(all_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

The specification and workflow is as below:

xgb_spec <- boost_tree(
  mode = "classification",
  trees = 100, # total number of trees (fixed)
  tree_depth = tune(), # max depth of each tree (to be tuned)
  learn_rate = tune(), # learning rate (to be tuned)
  loss_reduction = tune() # min loss reduction to split further (to be tuned)
) |>
  set_engine("xgboost") |>
  set_mode("classification") 

xgb_wf <- workflow() |>
  add_model(xgb_spec) |>
  add_recipe(crime_rec_xgb)

4.5 Estimation

We use 5-fold cross-validation and tune the L1 penalty (penalty) to find the best balance between model complexity and predictive performance. Our tuning metric is 1 - Recall, measuring model performance on minimizing false negatives.

# Create a metric set that includes 1 - Recall
custom_metrics <- metric_set(recall, roc_auc, accuracy)

# Set up 5-fold cross-validation
set.seed(123)
crime_folds <- vfold_cv(crime_train, v = 5, strata = felony_flag)

# Define the tuning grid for L1 penalty
lambda_grid <- grid_regular(penalty(range = c(-4, 0)), levels = 30)

# Tune the logistic regression model
set.seed(123)
log_res <- tune_grid(
  object = log_wf,
  resamples = crime_folds,
  grid = lambda_grid,
  metrics = custom_metrics,
  control = control_grid(save_pred = TRUE)
)

# Show top results by 1-recall
log_res |> 
  collect_metrics() |> 
  filter(.metric == "recall") |> 
  mutate(one_minus_recall = 1 - mean) |> 
  arrange(one_minus_recall)

penalty	.metric	.estimator	mean	n	std_err	.config	one_minus_recall
0.0573615	recall	binary	0.9981006	5	0.0001489	Preprocessor1_Model21	0.0018994
0.0788046	recall	binary	0.9981006	5	0.0001489	Preprocessor1_Model22	0.0018994
0.1082637	recall	binary	0.9981006	5	0.0001489	Preprocessor1_Model23	0.0018994
0.1487352	recall	binary	0.9981006	5	0.0001489	Preprocessor1_Model24	0.0018994
0.2043360	recall	binary	0.9981006	5	0.0001489	Preprocessor1_Model25	0.0018994
0.0417532	recall	binary	0.9526023	5	0.0023107	Preprocessor1_Model20	0.0473977
0.0303920	recall	binary	0.9393720	5	0.0023572	Preprocessor1_Model19	0.0606280
0.0221222	recall	binary	0.9319927	5	0.0025707	Preprocessor1_Model18	0.0680073
0.0161026	recall	binary	0.9200069	5	0.0034577	Preprocessor1_Model17	0.0799931
0.0117210	recall	binary	0.8991791	5	0.0045617	Preprocessor1_Model16	0.1008209
0.0085317	recall	binary	0.8824776	5	0.0048504	Preprocessor1_Model15	0.1175224
0.0062102	recall	binary	0.8717362	5	0.0034181	Preprocessor1_Model14	0.1282638
0.0045204	recall	binary	0.8645534	5	0.0028163	Preprocessor1_Model13	0.1354466
0.0032903	recall	binary	0.8594229	5	0.0026934	Preprocessor1_Model12	0.1405771
0.0023950	recall	binary	0.8560607	5	0.0027278	Preprocessor1_Model11	0.1439393
0.0017433	recall	binary	0.8535282	5	0.0026044	Preprocessor1_Model10	0.1464718
0.0012690	recall	binary	0.8515633	5	0.0025004	Preprocessor1_Model09	0.1484367
0.0009237	recall	binary	0.8499477	5	0.0023134	Preprocessor1_Model08	0.1500523
0.0006723	recall	binary	0.8491835	5	0.0024442	Preprocessor1_Model07	0.1508165
0.0004894	recall	binary	0.8486377	5	0.0024858	Preprocessor1_Model06	0.1513623
0.0003562	recall	binary	0.8482666	5	0.0024964	Preprocessor1_Model05	0.1517334
0.0002593	recall	binary	0.8480919	5	0.0024355	Preprocessor1_Model04	0.1519081
0.0001000	recall	binary	0.8480482	5	0.0024448	Preprocessor1_Model01	0.1519518
0.0001374	recall	binary	0.8480482	5	0.0024448	Preprocessor1_Model02	0.1519518
0.0001887	recall	binary	0.8480482	5	0.0024448	Preprocessor1_Model03	0.1519518
0.2807216	recall	binary	0.2000000	5	0.2000000	Preprocessor1_Model26	0.8000000
0.3856620	recall	binary	0.2000000	5	0.2000000	Preprocessor1_Model27	0.8000000
0.5298317	recall	binary	0.2000000	5	0.2000000	Preprocessor1_Model28	0.8000000
0.7278954	recall	binary	0.2000000	5	0.2000000	Preprocessor1_Model29	0.8000000
1.0000000	recall	binary	0.2000000	5	0.2000000	Preprocessor1_Model30	0.8000000

# Select best bundle of hyperparameters
best_log <- select_best(log_res, metric = "recall")
best_log

penalty	.config
0.0573615	Preprocessor1_Model21

# Last fit with best hyperparameters bundle
final_log_wf <- finalize_workflow(log_wf, best_log)

set.seed(123)
final_log_fit <- fit(final_log_wf, data = crime_train)

4.6 Interpretation

4.6.1 Interpret the results

Variable Importance

First, let’s take a look at the variable importance.

# Extract fitted model from final workflow
log_model <- extract_fit_parsnip(final_log_fit)

# Plot variable importance
vip(log_model, num_features = 20)

The plot shows that offense method and police shift are the most important predictors. Features like method_OTHERS, method_KNIFE and shift_MIDNIGHT, shift_EVENING strongly contribute to felony predictions. Most temporal and demographic variables like report_date_parsed_dow_Wed, black_pct, hsplus_pct, foreign_born_pct have much lower influence. This shows that situational factors of how the crime happened and when it happened carry more predictive weight than background demographic characteristics in this model.

Next, let’s assess our tuned logistic regression model using both the test set (from 2020–2022) and the implementation set (from 2023). We focus on the model’s ability to identify felony incidents, particularly minimizing false negatives.

In-sample Performance on Test Set

We apply the trained model to the hold-out test set and calculate standard classification metrics.

# Apply the recipe and model to crime_test
crime_test_results <- predict(final_log_fit, new_data = crime_test, type = "prob") |> 
  bind_cols(predict(final_log_fit, new_data = crime_test)) |>
  bind_cols(crime_test |> select(felony_flag))

# Evaluate on test set
metrics_test <- crime_test_results |>
  yardstick::metrics(truth = felony_flag, estimate = .pred_class)

# Recall and 1 - Recall
recall_val <- recall(data = crime_test_results, truth = felony_flag, estimate = .pred_class)$.estimate
one_minus_recall_val <- 1 - recall_val

# Print metrics
metrics_test

.metric	.estimator	.estimate
accuracy	binary	0.7837481
kap	binary	0.3825064

recall_val

[1] 0.9985155

one_minus_recall_val

[1] 0.001484457

The model achieves an accuracy of 78.4% and a 1 − Recall of only 0.0015 on the test set. This means fewer than 0.2% of true felonies are misclassified as non-felonies. This is crucial in a public safety context where failing to detect a felony could result in inadequate response.

Out-of-sample Generalization to 2023 Implementation Set

We then test the model on unseen 2023 data to evaluate its generalization in real-world applicability.

# Apply the recipe and model to crime_enriched_implement
crime_implement_results <- predict(final_log_fit, new_data = crime_enriched_implement, type = "prob") |> 
  bind_cols(predict(final_log_fit, new_data = crime_enriched_implement)) |>
  bind_cols(crime_enriched_implement |> select(felony_flag))

# Evaluate on implementation set
metrics_implement <- crime_implement_results |>
  metrics(truth = felony_flag, estimate = .pred_class)

# Recall and 1 - Recall
recall_val_impl <- recall(data = crime_implement_results, truth = felony_flag, estimate = .pred_class)$.estimate
one_minus_recall_val_impl <- 1 - recall_val_impl

# Print metrics
metrics_implement

.metric	.estimator	.estimate
accuracy	binary	0.7241279
kap	binary	0.3309441

recall_val_impl

[1] 0.9957171

one_minus_recall_val_impl

[1] 0.004282859

On the implementation set, the model maintains a recall of 99.6% and an accuracy of 72.4%, with a 1 − Recall of 0.0043. This confirms the model generalizes well and continues to flag nearly all felony cases, even in new data.

In conclusion, the model is effective because it captures the majority of serious crimes with very low false negatives. In policing strategy, this is far more important than overall accuracy. The model’s simplicity and high recall make it practical for daily patrol planning and hotspot targeting.

4.6.2 Improve Model with Different Features and Modeling Options

We explore two different combinations of feature engineering and modeling options to improve our initial regularized logistic regression model, which focused on demographic and temporal features.

First, we test Model 2 which replaces demographics with spatial features such as proximity to liquor stores, grocery stores and banks, as well as tract, special zones like military bases and vending areas. These geographical indicators may capture hot-spots with high felony risk that demographic variables may miss in Model 1. The model shifts from logistic regression to random forest, which handles nonlinear patterns and variable interactions better. It requires less preprocessing and includes tuning for mtry and min_n to boost performance.

Second, we try out Model 3, which keeps all features from Model 1 and Model 2, including demographics, spatial context and timing. It lets the model learn the best combinations without us limiting scope. This helps test if more context leads to better predictions. We use XGBoost for its ability to learn complex patterns with regularization. It improves on Model 1 by tuning tree depth, learning rate, and loss reduction, allowing better balance between fit and generalization.

Model 2 Estimation with Cross-Validation

# Define a metric set
custom_metrics <- metric_set(recall, accuracy, roc_auc)

# Create resampling folds
set.seed(123)
crime_folds <- vfold_cv(crime_train, v = 5, strata = felony_flag)

# Define a random grid of hyperparameters for tuning 
set.seed(123)
rf_grid <- grid_random(
  mtry(range = c(3, 10)),
  min_n(range = c(2, 10)),
  size = 5
)

# Tune the model
rf_res <- tune_grid(
  rf_wf,
  resamples = crime_folds,
  grid = rf_grid,
  metrics = custom_metrics,
  control = control_grid(save_pred = TRUE)
)

# Select best hyperparameters combinations by recall
best_rf <- select_best(rf_res, metric = "recall")

# Finalize the model
final_rf_wf <- finalize_workflow(rf_wf, best_rf)
final_rf_fit <- fit(final_rf_wf, data = crime_train)

Model 2 Evaluate on Test and Implementation Sets

# Extract fitted model from final workflow
rf_model <- extract_fit_parsnip(final_rf_fit)

# Plot top 20 important variables
vip(rf_model, num_features = 20)

The Random Forest model shows a different importance pattern than the logistic model. As seen in the plot above, crime-related features like method_OTHERS dominate, followed by temporal seasonality features (yday_sin, yday_cos, yday) and spatial proximity variables (nearest_bank_dist, nearest_liquorstore_dist, nearest_grocery_dist, `nearest_grocery_wifi). This suggests that environmental context and time-of-year carry more predictive weight than demographics like black_pct in this model.

# Predict on test set
rf_test_results <- predict(final_rf_fit, new_data = crime_test, type = "prob") |> 
  bind_cols(predict(final_rf_fit, new_data = crime_test)) |>
  bind_cols(crime_test |> select(felony_flag))

# Metrics on test set
metrics_test_rf <- rf_test_results |> 
  metrics(truth = felony_flag, estimate = .pred_class)

recall_rf <- recall(rf_test_results, truth = felony_flag, estimate = .pred_class)$.estimate

one_minus_recall_rf <- 1 - recall_rf

# Print metrics
metrics_test_rf

.metric	.estimator	.estimate
accuracy	binary	0.7358921
kap	binary	0.3933248

recall_rf

[1] 0.7977646

one_minus_recall_rf

[1] 0.2022354

On the test set, this model reaches an accuracy of 71.5% and a recall of 73.97%. Roughly 26% of actual felonies go undetected according to 1 - Recall. This means it misses more than a quarter of actual felonies, a substantial decline in sensitivity compared to Model 1. While the Model 2 may capture some useful spatial patterns, it underperforms on its ability to flag serious crimes when compared to the baseline.

# Predict on implementation set
rf_impl_results <- predict(final_rf_fit, new_data = crime_enriched_implement, type = "prob") |> 
  bind_cols(predict(final_rf_fit, new_data = crime_enriched_implement)) |>
  bind_cols(crime_enriched_implement |> select(felony_flag))

# Convert felony_flag to factor and calculate metrics
metrics_impl_rf <- rf_impl_results |>
  mutate(felony_flag = factor(felony_flag)) |>
  metrics(truth = felony_flag, estimate = .pred_class)

# Calculate recall
recall_rf_impl <- rf_impl_results |>
  mutate(felony_flag = factor(felony_flag)) |>
  recall(truth = felony_flag, estimate = .pred_class) |>
  pull(.estimate)

# 1 - recall
one_minus_recall_rf_impl <- 1 - recall_rf_impl

# Print metrics
metrics_impl_rf

.metric	.estimator	.estimate
accuracy	binary	0.7094022
kap	binary	0.3765945

recall_rf_impl

[1] 0.7914724

one_minus_recall_rf_impl

[1] 0.2085276

On the implementation set, Model 2 shows stronger accuracy (83.4%) but weaker recall (76.8%) than Model 1. Compared to its own test set results, the results improves on the implementation set: accuracy rises to 83.4% and recall improves to 76.8%, with 1 − Recall dropping to 0.232. This shows that Model 2 is less stable over time, due to the randomness of spatial signals or the smaller hyperparameter grid we used. To improve this model, we may need a larger tuning search or better feature engineering to stabilize performance.

Model 2 generalizes better in terms of accuracy but struggles with sensitivity and consistency. For real-world use, especially in crime forecasting, a more reliable recall rate is needed.

Model 3 Estimation with Cross-Validation

# Define a metric set
custom_metrics <- metric_set(recall, accuracy, roc_auc)

# Create 5-fold resampling folds
set.seed(123)
crime_folds <- vfold_cv(crime_train, v = 5, strata = felony_flag)

# Define a random grid for XGBoost hyperparameters 
set.seed(123)
xgb_grid <- grid_random(
  tree_depth(range = c(3, 10)),
  learn_rate(range = c(0.01, 0.3)),
  loss_reduction(range = c(0, 1)),
  size = 5
)

# Tune the model
set.seed(123)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = crime_folds,
  grid = xgb_grid,
  metrics = custom_metrics,
  control = control_grid(save_pred = TRUE)
)

# Select best hyperparameters based on recall
best_xgb <- select_best(xgb_res, metric = "recall")

# Finalize workflow
final_xgb_wf <- finalize_workflow(xgb_wf, best_xgb)

set.seed(123)
final_xgb_fit <- fit(final_xgb_wf, data = crime_train)

Model 3 Evaluate on Test and Implementation Sets

# Extract and visualize variable importance
xgb_model <- extract_fit_parsnip(final_xgb_fit)

vip(xgb_model, num_features = 20)

The variable importance plot shows black_pct as top drivers, followed by proximity and temporal measures like nearest_grocery_dist, yday_cos, nearest_wifi_dist and yday.

# Apply XGBoost model to test set
xgb_test_results <- predict(final_xgb_fit, new_data = crime_test, type = "prob") |> 
  bind_cols(predict(final_xgb_fit, new_data = crime_test)) |>
  bind_cols(crime_test |> select(felony_flag))

# Evaluate metrics on test set
metrics_xgb_test <- xgb_test_results |>
  yardstick::metrics(truth = felony_flag, estimate = .pred_class)

# Compute recall and 1 - recall
xgb_recall_test <- recall(xgb_test_results, truth = felony_flag, estimate = .pred_class)$.estimate

xgb_one_minus_recall_test <- 1 - xgb_recall_test

# Print metrics
metrics_xgb_test

.metric	.estimator	.estimate
accuracy	binary	0.6666267
kap	binary	0.2837879

xgb_recall_test

[1] 0.6886133

xgb_one_minus_recall_test

[1] 0.3113867

On the test set, Model 3 achieves an accuracy of 73.3% and a recall of 77.85%, yielding a 1 − Recall of 0.2215.

# Convert datetime to date to match the training recipe
crime_enriched_implement <- crime_enriched_implement |> 
  mutate(report_date_parsed = as.Date(report_date_parsed))

# Apply the model
xgb_implement_results <- predict(final_xgb_fit, new_data = crime_enriched_implement, type = "prob") |> 
  bind_cols(predict(final_xgb_fit, new_data = crime_enriched_implement)) |>
  bind_cols(crime_enriched_implement |> select(felony_flag))

# Evaluate metrics on implementation set
metrics_xgb_implement <- xgb_implement_results |>
  mutate(felony_flag = factor(felony_flag)) |>
  metrics(truth = felony_flag, estimate = .pred_class)

# Compute recall and 1 - recall
xgb_recall_implement <- xgb_implement_results |>
  mutate(felony_flag = factor(felony_flag)) |>
  recall(truth = felony_flag, estimate = .pred_class) |>
  pull(.estimate)

one_minus_recall_xgb_impl <- xgb_implement_results |>
  mutate(felony_flag = factor(felony_flag)) |>
  recall(truth = felony_flag, estimate = .pred_class) |>
  pull(.estimate) |>
  {\(x) 1 - x}()

# Print metrics
metrics_xgb_implement

.metric	.estimator	.estimate
accuracy	binary	0.6140069
kap	binary	0.2155158

xgb_recall_implement

[1] 0.6213001

one_minus_recall_xgb_impl

[1] 0.3786999

On the implementation set, accuracy drops slightly to 73.8%, while recall declines to 70.9% with a 1 − Recall of 0.2905. This suggests moderate sensitivity and stable accuracy, but a relatively high miss rate for felonies.

Compared to Model 1, Model 3 captures less of the actual felonies and is more prone to false negatives. While Model 1 maintained a 1 − Recall under 0.004, Model 3’s miss rate is over 20x higher.

Compared to Model 2, Model 3 performs more consistently in accuracy across test and implementation sets, but its recall drops more sharply. While adding both spatial and demographic features improves general balance, it does not enhance sensitivity. This suggests the combined model may generalize slightly better overall, but still misses more felonies than Model 1.

Visualize 3 Models’ Performances

metric_plot_df <- tibble(
  model = rep(c("Log", "Rf", "XGBoost"), each = 2),
  dataset = rep(c("Test", "Implementation"), times = 3),
  accuracy = c(
    0.785, 0.728, # Model 1
    0.715, 0.834, # Model 2
    0.733, 0.738 # Model 3
    ),
  one_minus_recall = c(
    0.0021, 0.0039, # Model 1
    0.2603, 0.2317, # Model 2
    0.2215, 0.2905 # Model 3
    )
  ) |>
    mutate(dataset = factor(dataset, levels = c("Test", "Implementation")))

# Accuracy
ggplot(metric_plot_df, aes(x = model, y = accuracy, fill = dataset)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
  labs(
    title = "Accuracy by Model and Dataset",
    x = "Model",
    y = "Accuracy"
  ) +
  theme_classic() +
  theme(
    text = element_text(family = "serif") 
  )

# 1 - Recall
ggplot(metric_plot_df, aes(x = model, y = one_minus_recall, fill = dataset)) +
  geom_col(position = "dodge") +
  scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
  labs(
    title = "1 - Recall by Model and Dataset",
    x = "Model",
    y = "1 - Recall"
  ) +
  theme_classic() +
  theme(
    text = element_text(family = "serif") 
  )

5 Multi-classification Task

# read in the data
crime <- read_csv("data/crime_enriched_acs_nona2.csv")

Note: the dataset crime_enriched_acs_nona2.csv is the same as dataset crime_enriched_acs_nona.csv. As we worked separately, we named our variables differently.

In addition to knowing whether a crime is a felony, it is also useful for law enforcement to know the severity of the felony, since several different crime categories are classified as felonies and each require different resources and police response.

In this section, we will build on the binary classification of crimes as felony or not felony by creating a multi-classification model to classify crimes based on the felony severity level. The outcome variable is ‘felony_severity,’ which was derived from the ‘offense’ variable in the Crime Incidents datasets (2008-2025) from DC Open Data. Felony_severity contains seven different severity levels, ranging from 0 (not a felony) to 7 (homicide) (See appendix). However, our multiclass classification candidate models will only include felonies, so felony severity level 0 will be removed from the dataset.

The goal of this multi-classification model is to further enhance the ability of law enforcement in DC to effectively allocate resources by forecasting the relative severity of different felonies committed in the District.

Replicating the steps used above for the binary classification model, we split the data to use 2019-2022 for training and 2023 for testing and implementation. We selected the same predictor variables used above from the ‘crime_enriched_acs_nona’ merged dataset.

5.1 Prepare the Data

# Select variables of interest for modeling
multiclass_vars <- c(
  "felony_severity", # outcome variable
  "x", "y", "latitude", "longitude", "year",
  "shift", "method", "tract", "report_date_parsed",

  # GIS contextual features
  "in_liquor_moratorium_zone", "nearest_liquorstore_dist", "near_liquorstore_200m",
  "near_wifi_100m", "nearest_wifi_dist",
  "in_lowfood_zone",
  "nearest_grocery_dist", "near_grocery_300m",
  "nearest_bank_dist",
  "near_bank_250m", 
  "in_vending_zone", 
  "police_sector", 
  "in_military_base",

  # ACS socioeconomic features
  "poverty_rate", "unemployed_rate",
  "black_pct", "hispanic_pct","foreign_born_pct", 
  "hsplus_pct", 
  "under18_pct", "age65plus_pct"
)

# Keep only selected variables
crime_multiclass <- crime |> 
  select(all_of(multiclass_vars)) |>
  filter(felony_severity > 0)

# Define training data (2019-2022)
crime_multiclass_small <- crime_multiclass |>
  filter(year >= 2019, year < 2023)

# Define implementation/test data (2023)
crime_multiclass_implement <- crime_multiclass |>
  filter(year == 2023)

# Adjust variable data types for training data
crime_multiclass_small <- crime_multiclass_small |> 
  mutate(
    report_date_parsed = as.Date(report_date_parsed),
         felony_severity = factor(felony_severity)
         )

# Adjust variable data types for implementation data
crime_multiclass_implement <- crime_multiclass_implement |> 
  mutate(
    report_date_parsed = as.Date(report_date_parsed),
         felony_severity = factor(felony_severity)
         )

5.2 Split Data

set.seed(123)
multiclass_split <- initial_split(crime_multiclass_small, prop = 0.8, strata = felony_severity)

multiclass_train <- training(multiclass_split)

multiclass_test <- testing(multiclass_split)

5.3 EDA on Training Set

First, we investigate the distribution of felony_severity, which is used in both the multi-class classification and regression tasks.

# Explore felony severity distribution
multiclass_train |>
  count(felony_severity) |>
  mutate(pct = n / sum(n)) |>
  arrange(desc(felony_severity))

felony_severity	n	pct
7	487	0.0233137
6	439	0.0210158
5	3737	0.1788980
4	4889	0.2340466
3	15	0.0007181
2	2935	0.1405046
1	8387	0.4015032

Based on the table above, crimes were categorized into an ordinal scale from 1 to 7, with 1 representing motor vehicle theft and 7 representing the most severe felony homicide.

The distribution of felony_severity in the training set is highly skewed. The most frequent felony types are lower-level ones such as motor vehicle theft(severity = 1) and burglary (severity = 2). Mid-range felonies like robbery (severity = 4) and aggravated assault (severity = 5) occur less often, while severe crimes like sexual abuse (severity = 6) and homicide (severity = 7) are extremely rare, with only a few hundred instances in total. The category with severity 3 arson is the rarest, appearing only 15 times.

To visualize this distribution, we plot the number of cases per severity score:

multiclass_train |>
  ggplot(aes(x = factor(felony_severity))) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Felony Severity Scores",
       x = "Felony Severity Score",
       y = "Count")+
  theme_classic() +
  theme(
    legend.position = "none",
    text = element_text(family = "serif") 
  )

This long-tailed distribution shows that serious crimes like arson are rare in the data, which can make it harder for models to learn from them. For classification tasks, we can fix this by reducing the number of common cases (downsampling), adding more rare cases (using SMOTE), or telling the model to pay more attention to rare classes (weighted loss). For regression task, we can try using methods that handle outliers better (like robust regression), or change the target variable (like using log or bins) to make it easier for the model to learn.

The following map visualizes the distribution of felony incidents across the District.

dc_tract_sf <- st_read("data/tl_2023_11_tract/tl_2023_11_tract.shp")  # load tract shp

crime_sf <- crime |>
  st_as_sf(coords = c("longitude", "latitude"))

ggplot(data = crime_sf) +
  geom_sf(aes(color = felony_severity), alpha = 0.5) +
  labs(title = "Distribution of Felony Severities Across DC",
       color = "Felony Severity") +
  theme_void() +
  theme(
    text = element_text(family = "serif") 
  )

The more severe felonies (class 6 and 7) are concentrated in the eastern and southeastern parts of DC.

Since motor vehicle theft is the most common felony in DC, this map visualizes the percentage of felonies that are car thefts in each census tract.

tract_felrate <- multiclass_train |>
  group_by(tract) |>
  #filter(felony_severity == 1) |>
  summarize(class_1 = sum(felony_severity ==1), total = n()) |> 
  mutate(pct = class_1 / total)

tract_sf_joined <- dc_tract_sf |> 
  left_join(tract_felrate, by = c("TRACTCE" = "tract"))

ggplot(tract_sf_joined) +
  geom_sf(aes(fill = pct), color = "white", size = 0.2) +
    scale_fill_gradient(
    low = "#d6e4f0",   
    high = "#274472",
    na.value = "grey90"
  ) +
  labs(title = "Vehicle Theft Rate by Census Tract",
       fill = "Vehicle Theft Rate") +
  theme_classic() +
  theme(
    text = element_text(family ="serif")
  )

This choropleth aligns with the distribution of felony severities across DC: higher vehicle theft rates are concentrated in the northern and northwestern tracts, while southern and southeastern neighborhoods show lower rates, likely due to a higher prevalence of more severe felony types in those areas.

The following choropleth shows the felony rate of each felony severity class in each census tract.

facetwrap <- multiclass_train |>
  group_by(tract, felony_severity) |>
  summarise(crimecount = n()) |>
  mutate(felony_severity = case_when(
    felony_severity == 7 ~ "7. Homicide",
    felony_severity == 6 ~ "6. Sex Abuse",
    felony_severity == 5 ~ "5. Assault w/ Weapon",
    felony_severity == 4 ~ "4. Robbery",
    felony_severity == 3 ~ "3. Arson",
    felony_severity == 2 ~ "2. Burglary",
    felony_severity == 1 ~ "1. Vehicle Theft"
  ))

tract_sf_newjoined <- dc_tract_sf |> 
  left_join(facetwrap, by = c("TRACTCE" = "tract"))

ggplot(tract_sf_newjoined) +
  geom_sf(aes(fill = crimecount)) +
  facet_wrap(~ felony_severity, ncol = 4) + 
    scale_fill_gradient(
    low = "#FFF5B1",
    high = "#B2182B",   
    na.value = "gray90",
    name = "Crime Count"
  ) +
  labs(
    title = "Crime Count by Tract and Offense Type",
    fill = "Crime Count"
  ) +
  theme_void() +
  theme(
  strip.text = element_text(family = "serif", size = 10, face = "bold"),
  plot.title = element_text(family = "serif", hjust = 0.5, size = 16, face = "bold"),
  legend.title = element_text(family = "serif"),
  plot.margin = margin(t = 10, r = 10, b = 10, l = 20)  # top 20 points of space
)

Each individual facet of the map shows the distribution of each felony severity class. Unsurprisingly, class 1 has the widest geographical distribution across the District, while class 3, which only had a 15 incidents total in the dataset, is quite sparse. Class 7, the most severe class, is highly concentrated in southern and southeastern DC, yet still with less than 50 total in each census tract.

Next, we examine the distribution of felony severity by day of the week, one of the explanatory variables used in our candidate models to predict felony severity class.

weekday <- multiclass_train |>
  mutate(weekday = wday(report_date_parsed, label = TRUE, abbr = FALSE ))

ggplot(weekday, aes(x = weekday, fill = felony_severity)) +
  geom_bar() +
  labs(
    title = "Distribution of Felony Severity by Day of the Week",
    x = "Day of the Week", 
    y = "Number of Crimes"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 15),
    axis.text = element_text(size = 8)
  )

The chart shows that there is not a high variation in felony severities by day of the week.

5.4 Error Metrics

For our multi-classification problem, we will calculate overall model accuracy, micro- precision/recall, as well as precision and recall within each class.

5.4.1 Accuracy

Accuracy measures how often the model is correct and is defined as:

\[ \frac{\text{TP}}{\text{total}} \]

True positives are defined as the cases where the model correctly classifies crimes as 0, 1, 2, 3, 4, 5, 6, or 7 felony severity, and total is the total number of predictions.

5.4.2 Precision and Recall by Class

Precision and recall calculated by class focuses on each class individually. Each individual class is treated as a “positive” while all of the others are treated as a “negative.”

Precision for an individual class is the fraction of instances correctly predicted to belong to a specific class among all of the instances the model predicted as belonging to that class. It is true positives over all positives (both true and false):

\[ \frac{\text{TP}_{class1}}{\text{TP}_{class1}+{FP}_{class1}} \]

Recall for an individual class is the fraction of instances in a class that the model correctly identified out of all instances in that class. It is true positives over actual positives (true positives and false negatives):

\[ \frac{\text{TP}_{class1}}{\text{TP}_{class1}+{FN}_{class1}} \]

5.4.3 Custom Function for Calculating Precision and Recall by Class

We have created a custom function to calculate within-class precision and recall:

#' Calculate within-class precision and recall
#'
#' @param x The bare (unquoted) name of the dataframe containing the true class values and their predictions
#' @param truth The bare (unquoted) name of the column within the dataframe containing the true class values 
#' @param class The class within the truth column used to calculate within-class precision and recall 
#' @param estimate The column containing the predictions 
#'
#' @returns Precision and recall for the class specified by the category argument
#'
#' @examples
#' calc_class_metrics(x = random_pred, truth = felony_severity, class = 3, estimate = .pred_class)
calc_class_metrics <- function(x, truth, class, estimate) {
  
  totals <- x |>
    summarize(
      true_positives = sum({{ truth }} == class & {{ estimate }} == class),
      false_positives = sum({{ estimate }} == class & {{ truth }} != class),
      false_negatives = sum({{ truth }} == class & {{ estimate }} != class)
    ) |>
    mutate(
      precision = true_positives / (true_positives + false_positives),
      recall = true_positives / (true_positives + false_negatives)
    ) 
  
  prec_rec <- totals |>
    select(precision, recall)
  
  return(prec_rec)

}

Within-class metrics are appropriate for our purposes because we have significant class imbalance, with 40% of felonies classified as class 1. We are interested in how the model performs on the largest classes for practical purposes. A model that performs well at predicting the largest classes will have the greatest usefulness for police departments and law enforcement in the District.

Therefore, we will pay particular attention to recall. We are especially interested in evaluating when the model mis-classifies a specific class as a negative. For law enforcement, a false negative potentially means ignoring or missing a felony incident, which has ramifications for the safety of the community and overall success of the police department in reponding to the needs of the community.

5.5 Candidate Model 1: Random Forest Model

Our first model is a random forest model using all of the predictors included in the dataset to predict the felony severity. Because felony severity level 1 is disproportionately represented in the dataset, we will downsample felony_severity.

set.seed(123)
# recipe
random_newrecipe <- recipe(formula = felony_severity ~ ., data = multiclass_train) |>
  step_downsample(felony_severity, seed = 123) |>
  step_rm(report_date_parsed) |>
  step_mutate(method = factor(method)) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

# model
random_model <- rand_forest(
  trees = 200, 
  mtry = 2,
  min_n = 5) |>
  set_mode("classification") |>
  set_engine("ranger",
             importance = "impurity",
             num.threads = 4)

# workflow 
random_wf <- workflow() |>
  add_recipe(random_newrecipe) |>
  add_model(random_model)

set.seed(123)
# fit the model
random_fit <- random_wf |>
  fit(data = multiclass_train)

# generate predictions 
random_pred <- predict(random_fit, multiclass_test, type = "class") |>
  bind_cols(multiclass_test)

5.5.1 Metrics for Random Forest Model

# calculate accuracy and confusion matrix
conf_mat(data = random_pred,
         truth = felony_severity,
         estimate = .pred_class)

          Truth
Prediction   1   2   3   4   5   6   7
         1 283  71   0 158 146  12  23
         2 774 292   0 289 135  32  11
         3 249 104   4 108  65  14   7
         4  94  23   0  96  54   4  10
         5 119  38   0 103 177  12   9
         6 362 100   2 161 112  20  18
         7 211 102   0 304 241  16  58

accuracy(data = random_pred, truth = felony_severity, estimate = .pred_class)

.metric	.estimator	.estimate
accuracy	multiclass	0.1780586

The overall accuracy rate for our random forest model is 17.8%, showing that overall, the model is has a somewhat weak ability to predict the accurate felony severity category.

Calculating precision and recall for each class provides some insight into how the model performs by class:

# Class 1
calc_class_metrics(x = random_pred, truth = felony_severity, class = 1, estimate = .pred_class)

precision	recall
0.4083694	0.1352772

# Class 2
calc_class_metrics(x = random_pred, truth = felony_severity, class = 2, estimate = .pred_class)

precision	recall
0.1904762	0.4

# Class 3
calc_class_metrics(x = random_pred, truth = felony_severity, class = 3, estimate = .pred_class)

precision	recall
0.0072595	0.6666667

# Class 4
calc_class_metrics(x = random_pred, truth = felony_severity, class = 4, estimate = .pred_class)

precision	recall
0.341637	0.0787531

# Class 5
calc_class_metrics(x = random_pred, truth = felony_severity, class = 5, estimate = .pred_class)

precision	recall
0.3864629	0.1903226

# Class 6
calc_class_metrics(x = random_pred, truth = felony_severity, class = 6, estimate = .pred_class)

precision	recall
0.0258065	0.1818182

# Class 7
calc_class_metrics(x = random_pred, truth = felony_severity, class = 7, estimate = .pred_class)

precision	recall
0.0622318	0.4264706

The strongest overall performance is on class 1, the largest class (precision = 40.8% and recall = 13.5%). Class 4 is the second largest class, and while class 4’s precision is relatively good (34%), the recall is only 8%. The model is predicting a large amount of false negatives for class 4. Class 3 is the smallest class, and surprisingly, the recall is 66.7%, showing a relatively low number of false negatives. However, class 3’s precision is only 6.7%, so the model often incorrectly predicts other classes as class 3.

5.5.2 Random Forest Variable Importance

# Variable importance for random forest
set.seed(123)
random_fit |>
  extract_fit_parsnip() |>
  vip(num_features = 20) %>% #we are using this pipe to make sure the argument works
  .$data |>
  mutate (
    importance = Importance / max(Importance),
    variable = fct_reorder(Variable, importance)
  ) |>
  ggplot(aes(x = importance, y = variable, fill = importance)) +
  geom_col() +
  labs(
    title = "variable importance of random forest",
    x = "normalized importance",
    y = "variable"
  )

Interestingly, ‘under18_pct’ is the most important factor, with a normalized importance of 1. This value indicates the percentage of the population that is under the age of 18 in each tract. The following visualization shows the average percentage of the census tract that is under 18 within each felony severity category.

under_18 <- multiclass_train |>
  group_by(felony_severity) |>
  summarise(avg_under18 = mean(under18_pct, na.rm = TRUE))

ggplot(under_18, aes(x = felony_severity, y = avg_under18, group = 1)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Percent of Community Under Age 18 by Felony Severity",
    x = "Felony Severity",
    y = "Average Percent Under 18") +
  theme_minimal()

The highest felony severity category, 7, occurs at higher rates in communities with a larger percent of the community being under the age of 18. Classes 1 and 2, the lowest severity levels, occur more frequently in communities with a relatively small percentage of the community under the age of 18. The relationship between felony severity and the age makeup of a community is not strictly linear, but the model has picked up on a relationship between age and felony severity level.

‘hsplus_pct’ was the second most important variable. This variable is the percentage of people in a census tract with greater than a high school level education.

hs <- multiclass_train |>
  group_by(felony_severity) |>
  summarise(avg_hs = mean(hsplus_pct, na.rm = TRUE))

ggplot(hs, aes(x = felony_severity, y = avg_hs, group = 1)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Community Education Level by Felony Severity",
    x = "Felony Severity",
    y = "Percent with High School Education or Higher"
  ) +
  theme_minimal() +
  theme(
    axis.title.y = element_text(size = 11),   
    axis.title.x = element_text(size = 11)                      
  )

Neighborhoods with higher rates of more than high school education tend to have higher rates of lower felony severity classes. Class 5 and 7 tend to occur in neighborhoods with lower rates of high school education, indicating a downwards trend. However, class 6 is the exception to this trend.

‘Foreign_born_pct’ was the third most important variable. The following visualization shows the average percentage of a census tract that is foreign-born within each felony severity category.

foreign <- multiclass_train |>
  group_by(felony_severity) |>
  summarise(average_foreign = mean(foreign_born_pct, na.rm = TRUE))

ggplot(foreign, aes(x = felony_severity, y = average_foreign, group = 1)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Average Foreign Born Population by Felony Severity",
    x = "Felony Severity",
    y = "Average Percent of Foreign-Born in a Community"
  ) +
  theme_minimal()

Interestingly, the graph shows a very similar pattern of the relationship between foreign-born makeup of the community and felony severity category as percent of the community with a greater than high school level education.

Neighborhoods with higher rates of foreign-born tend to have higher rates of lower felony severity levels than higher severity levels, with class 6 as an exception to this trend.

5.6 Candidate Model 2: CART Model

Our second candidate model is a CART (classification and regression tree) model, stratified on felony_severity because the felony_severity classes are not equal. We will use five-fold cross-validation and tune tree depth, cost complexity, and the minimum number of nodes to optimize the model.

# cross-validation
set.seed(123)
cross_validation <- vfold_cv(multiclass_train, v = 5, strata = felony_severity)

# recipe
cart_newrecipe <- recipe(formula = felony_severity ~ ., data = multiclass_train) |>
  step_downsample(felony_severity, seed = 123) |>
  step_rm(report_date_parsed) |>
  step_mutate(method = factor(method)) |>
  step_zv(all_predictors()) |>
  step_novel(all_factor_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())
  

# cart model object
cart_model <- decision_tree(
  tree_depth = tune(),
  cost_complexity = tune(),
  min_n = tune()
) |>
  set_engine(engine = "rpart") |>
  set_mode(mode = "classification")

set.seed(123)
# set up tuning grid
cart_grid <- grid_regular(
  cost_complexity(range = c(0.0001, 0.1)),
  tree_depth(range = c(3, 15)),
  min_n(),
  levels = 5
)

# cart workflow
cart_wf <- workflow() |>
  add_recipe(cart_newrecipe) |>
  add_model(cart_model)

set.seed(123)
# test and choose best hyperparameters
cart_test <- cart_wf |>
  tune_grid(resamples = cross_validation,
            grid = cart_grid,
            metrics = metric_set(accuracy))

cart_test |>
  show_best()

cost_complexity	tree_depth	min_n	.metric	.estimator	mean	n	std_err	.config
1.000230	3	2	accuracy	multiclass	0.401503	5	0.0007154	Preprocessor1_Model001
1.059437	3	2	accuracy	multiclass	0.401503	5	0.0007154	Preprocessor1_Model002
1.122148	3	2	accuracy	multiclass	0.401503	5	0.0007154	Preprocessor1_Model003
1.188571	3	2	accuracy	multiclass	0.401503	5	0.0007154	Preprocessor1_Model004
1.258925	3	2	accuracy	multiclass	0.401503	5	0.0007154	Preprocessor1_Model005

# finalize workflow
final_cart_wf <- cart_wf |>
  finalize_workflow(select_best(cart_test)
)

set.seed(123)
# fit
cart_fit <- final_cart_wf |>
  fit(data = multiclass_train)

# predictions
predictions <- bind_cols(
  multiclass_test,
  predict(object = cart_fit, new_data = multiclass_test),
  predict(object = cart_fit, new_data = multiclass_test, type = "prob")
)

5.6.1 Metrics for CART Model

# confusion matrix
conf_mat(data = predictions,
         truth = felony_severity,
         estimate = .pred_class)

          Truth
Prediction    1    2    3    4    5    6    7
         1 2092  730    6 1219  930  110  136
         2    0    0    0    0    0    0    0
         3    0    0    0    0    0    0    0
         4    0    0    0    0    0    0    0
         5    0    0    0    0    0    0    0
         6    0    0    0    0    0    0    0
         7    0    0    0    0    0    0    0

# accuracy
accuracy(data = predictions,
         truth = felony_severity,
         estimate = .pred_class)

.metric	.estimator	.estimate
accuracy	multiclass	0.4005361

Our CART model is only correctly predicting the majority class (felony_severity = 1), and therefore not learning the minority classes. This is likely due to the class imbalance, even though we downsampled felony severity. To optimize performance, the decision tree is classifying every felony as class 1, resulting in an accuracy rate that equals the percentage of felonies that are class 1.

5.7 Candidate Model 3: KNN Model

Our third candidate Model is K-nearest neighbors (KNN). We selected 5 as the number of nearest neighbors to avoid the model blurring class boundaries.

# recipe
knn_newrecipe <- recipe(formula = felony_severity ~ ., data = multiclass_train) |>
  step_downsample(felony_severity, seed = 123) |>
  step_rm(report_date_parsed) |>
  step_mutate(method = factor(method)) |>
  step_zv(all_predictors()) |>
  step_novel(all_factor_predictors()) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) 
  
# model
knn_model <- nearest_neighbor(neighbors = 5) |>
  set_engine(engine = "kknn") |>
  set_mode(mode = "classification")

# workflow
knn_wf <- workflow() |>
  add_recipe(knn_newrecipe) |>
  add_model(knn_model)

# fit the model
final_knn_wf <- finalize_workflow(knn_wf, knn_model)

set.seed(123)
knn_fit <- fit(final_knn_wf, data = multiclass_train)

# generate predictions
knn_preds <- predict(knn_fit, new_data = multiclass_test, type = "class") |>
  bind_cols(multiclass_test)

5.7.1 Metrics for KNN Model

# calculate confusion matrix and accuracy
conf_mat(data = knn_preds, truth = felony_severity, estimate = .pred_class)

          Truth
Prediction   1   2   3   4   5   6   7
         1 183  52   0  85  74   8   9
         2 198  77   0  92  62   4   9
         3 758 237   5 310 172  27  13
         4 467 201   0 374 243  24  41
         5 175  60   0 106 130  21  18
         6 199  60   1 124  97  14  19
         7 112  43   0 128 152  12  27

accuracy(knn_preds, truth = felony_severity, estimate = .pred_class)

.metric	.estimator	.estimate
accuracy	multiclass	0.1550833

The accuracy rate is 15.5% for the KNN model, which is lower than our random forest model.

Evaluating the within-class precision and accuracy will be useful for comparing to the performance of the random forest model.

# Class 1
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 1, estimate = .pred_class)

precision	recall
0.4452555	0.0874761

# Class 2
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 2, estimate = .pred_class)

precision	recall
0.1742081	0.1054795

# Class 3
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 3, estimate = .pred_class)

precision	recall
0.0032852	0.8333333

# Class 4
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 4, estimate = .pred_class)

precision	recall
0.277037	0.3068089

# Class 5
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 5, estimate = .pred_class)

precision	recall
0.254902	0.1397849

# Class 6
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 6, estimate = .pred_class)

precision	recall
0.0272374	0.1272727

# Class 7
calc_class_metrics(x = knn_preds, truth = felony_severity, class = 7, estimate = .pred_class)

precision	recall
0.056962	0.1985294

KNN has the best within-class precision for class 1, the largest class, yet the recall for class 1 is only 11%, meaning that there are a high number of false negatives. It also performs fairly well in the other largest classes (2, 4, and 5) in terms of its precision. However, recall also hovers around less than 15% for these three classes. Unsurprisingly, the model struggled to classify the smaller classes.

5.8 Final Multiclass Classification Model Selection

The CART model, despite having the highest accuracy rate of 40%, will be eliminated from consideration because it achieved this rate by predicting all classes as class 1.

This leaves random forest or KNN. Random forest’s accuracy rate was ~17.8%, while KNN’s was ~15.5%. Random forest performs slightly more accurately overall, but we will also consider the fact that our model has significant class imbalances. To compare the performance within each class, we will focus on classes 1, 4, 5, and 2 because those classes are the largest, respectively.

Comparing with the random forest model’s precision and recall within each class, KNN has lower precision than random forest within the 3/4 of the largest classes (class 2, 4, and 5). In terms of within-class recall, random forest performs better than KNN in 3/4 of the largest classes as well (class 1, 2, and 5).

As discussed in our selection of error metrics, recall is of particular importance for model evaluation because the consequences of a false negative are higher for police and communities than a false positive.

Random forest has better within-class recall, as well as within-class precision and overall accuracy. Therefore, we will select random forest as the final model for implementation.

5.9 Implementation

We will fit the final model on the 2023 data to evaluate its application to the real-world.

# Apply the recipe and model to 2023 data
multiclass_implement_results <- predict(random_fit, 
                                        new_data = crime_multiclass_implement, 
                                        type = "prob") |> 
  bind_cols(predict(random_fit, new_data = crime_multiclass_implement)) |>
  bind_cols(crime_multiclass_implement |> 
  select(felony_severity))

# final model's accuracy
accuracy(multiclass_implement_results, truth = felony_severity, estimate = .pred_class)

.metric	.estimator	.estimate
accuracy	multiclass	0.163185

# Class 1
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 1,
                   estimate = .pred_class)

precision	recall
0.5418233	0.1699838

#Class 2
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 2,
                   estimate = .pred_class)

precision	recall
0.1128364	0.3992674

#Class 3
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 3,
                   estimate = .pred_class)

precision	recall
0	0

#Class 4
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 4,
                   estimate = .pred_class)

precision	recall
0.3415493	0.0561018

#Class 5
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 5,
                   estimate = .pred_class)

precision	recall
0.3169985	0.1476462

#Class 6
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 6,
                   estimate = .pred_class)

precision	recall
0.0117914	0.1354167

#Class 7
calc_class_metrics(x = multiclass_implement_results,
                   truth = felony_severity,
                   class = 7,
                   estimate = .pred_class)

precision	recall
0.0452635	0.5109489

5.9.1 Interpreting Final Model Results

The final model’s overall accuracy rate is 16.32%, so the model makes correct predictions about 16% of the time. The within-class recall and precision varies between classes.

Class 1 had the best precision, and class 4 had the best recall. The model performed the worst on class 3, with an accuracy and precision of 0%.

The distribution of felony severities in the implementation data was very similar to the training data:

# Check class imbalance in the target variable
crime_multiclass_implement |> 
  count(felony_severity) |> 
  mutate(pct = n / sum(n))

felony_severity	n	pct
1	6783	0.5133969
2	1092	0.0826521
3	11	0.0008326
4	3458	0.2617318
5	1402	0.1061157
6	192	0.0145322
7	274	0.0207387

ggplot(crime_multiclass_implement, aes(x = felony_severity)) +
  geom_bar() +
  scale_y_continuous(n.breaks = 10) +
  labs(title = "Number of Felonies in Each Category",
       x = "Felony Severity Category",
       y = "Count") +
  theme_minimal()

The implementation data only had 11 felonies of class 3, so it is unsurprising that the model failed to predict this class. Similarly, classes 6 and 7 have less than 300 observations, explaining the poor recall and precision for these classes.

The final model is quite limited in its practical usefulness because of its relatively low accuracy rate and within-class precision and recall. The model could potentially be improved with different class specifications. Because the dataset is dominated by felony severity class 1, a different approach could be a binary classification model with motor vehicle theft defined as a positive outcome, and all other felonies defined as a negative outcome. This specification would have less class imbalance, which may enhance precision and recall.

Overall, it is difficult to predict the severity of a felony due to the variety of features related to crime, including geographic factors, demographic variables, and temporal factors. The limitations of the model will be explored further in the discussion section below.

6 Regression Task

6.1 Split Data

set.seed(123)
crime_split_reg <- initial_split(crime_enriched_mod, prop = 0.8, strata = felony_severity_trans)

crime_train_reg <- training(crime_split_reg)

crime_test_reg <- testing(crime_split_reg)

6.2 Explanation on Outcome Variable

In this regression task, the outcome variable is felony_severity_trans, which is a transformed version of the original felony_severity score using the Yeo-Johnson transformation. The original score ranged from 0 to 7, reflecting the seriousness of the felony based on sentence length and social cost. For example, homicide is scored as 7, while motor vehicle theft is scored as 1. Although this score is technically ordinal (ranked categories), we treat it as approximately continuous for modeling purposes. This is a common approach in applied research when the levels are evenly spaced and reflect meaningful differences (Robitzsch, 2020). For instance, the gap between scores of 1 and 2 can be interpreted similarly to the gap between 5 and 6. This treatment allows us to conduct standard regression models.

However, the distribution of felony_severity is skewed due to the rarity of severe crimes like arson and homicide, which can distort model performance. To address this, we applied the Yeo-Johnson transformation to the outcome variable. This transformation preserves the relative severity ranking, but reduces skewness and pulls in extreme values, improving model performance by better meeting assumptions of normality and homoscedasticity.

After transformation, the values are no longer on the original 0–7 scale. For example, the eight unique severity levels were transformed to:

sort(unique(crime_train_reg$felony_severity_trans))

[1] -0.6688794  1.1762487  1.5159447  1.6344214  1.6891198  1.7187730  1.7366234
[8]  1.7481926

These numbers are not directly interpretable in real-world terms, but they retain the ordering of crime severity and provide a statistically stable basis for modeling.

In practice, we use felony_severity_trans as the outcome variable during model training and prediction, and then apply the inverse Yeo-Johnson transformation (via predict(yj_obj, inverse = TRUE)) to convert predicted values back to the original 0–7 severity scale. This allows for more intuitive interpretation of the results while preserving the modeling benefits of the transformation.

6.3 Select Error Metric

Our goal is to predict how severe a crime might be if it is a felony. This helps the police understand which crimes are more dangerous and need more attention.

Since some severe crimes are rare, we want to make sure the model does not make big mistakes when predicting them. So we use RMSE and R-squared as our main evaluation metrics. RMSE tells us how far the predictions are from the true values, especially for big mistakes. R-squared shows how much variation in severity scores our model can explain.

6.4 Come up with Models

For this regression task, we still build three candidate models to test which performs best.

6.4.1 Model 1: Linear Regression with Yeo-Johnson Transformation

This baseline model uses linear regression to predict felony severity. It focuses on socioeconomic context and crime-level temporal features. While linear models may not capture complex interactions, they offer a strong baseline for comparison and clear insights into variable importance. We use the pre-transformed outcome felony_severity_trans to reduce skewness and stabilize variance.

crime_rec_lm <- recipe(felony_severity_trans ~ poverty_rate + unemployed_rate + 
                         black_pct + hispanic_pct + foreign_born_pct +
                         hsplus_pct +
                         under18_pct + age65plus_pct +
                         shift + method + report_date_parsed,
                       data = crime_train_reg) |>
  step_date(report_date_parsed, features = "dow", label = TRUE) |>
  step_mutate(
    yday = lubridate::yday(report_date_parsed),
    yday_cos = cos(2 * pi * yday / 365),
    yday_sin = sin(2 * pi * yday / 365)
  ) |>
  step_rm(report_date_parsed) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

The model specification and workflow are defined as follows:

lm_spec <- linear_reg() |> 
  set_engine("lm") |>
  set_mode("regression") 
  
lm_wf <- workflow() |> 
  add_model(lm_spec) |> 
  add_recipe(crime_rec_lm)

6.4.2 Model 2: Random Forest Regression with Spatial Features

This model uses Random Forest to predict felony_severity_trans, a transformed version of felony severity, based on spatial and environmental context. It includes features such as tract, proximity to liquor stores, grocery stores, banks, and public infrastructure, as well as police sector and zoning indicators. Socioeconomic variables are excluded to focus on location-based risk factors.

crime_rec_rf_reg <- recipe(felony_severity_trans ~ tract + shift + method + 
                             report_date_parsed +
                             in_liquor_moratorium_zone + 
                             nearest_liquorstore_dist + near_liquorstore_200m +
                             near_wifi_100m + nearest_wifi_dist +
                             in_lowfood_zone + 
                             nearest_grocery_dist + near_grocery_300m +
                             nearest_bank_dist + near_bank_250m +
                             in_vending_zone + police_sector + in_military_base,
                           data = crime_train_reg) |>
  step_date(report_date_parsed, features = "dow", label = TRUE) |>
  step_mutate(
    yday = lubridate::yday(report_date_parsed),
    yday_cos = cos(2 * pi * yday / 365),
    yday_sin = sin(2 * pi * yday / 365)
  ) |>
  step_rm(report_date_parsed) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  step_normalize(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors())

The model specification and workflow are defined as follows:

rf_spec_reg <- rand_forest(
  mtry = tune(),
  trees = 50,
  min_n = tune()
) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("regression")

rf_wf_reg <- workflow() |>
  add_model(rf_spec_reg) |>
  add_recipe(crime_rec_rf_reg)

6.4.3 Model 3: XGBoost Regression with All Features

This model leverages all available features to predict felony_severity_trans, a transformed version of felony severity. It includes demographic, spatial, temporal, and contextual predictors. We apply normalization, correlation filtering, and dummy encoding to prepare predictors.

crime_rec_xgb_reg <- recipe(felony_severity_trans ~ ., data = crime_train_reg) |>
  step_rm(felony_flag, felony_severity) |>  # remove original outcome columns
  step_date(report_date_parsed, features = "dow", label = TRUE) |>
  step_mutate(
    yday = lubridate::yday(report_date_parsed),
    yday_cos = cos(2 * pi * yday / 365),
    yday_sin = sin(2 * pi * yday / 365)
  ) |>
  step_rm(x, y, latitude, longitude, year, report_date_parsed) |>
  step_nzv(all_predictors()) |>
  step_corr(all_numeric_predictors(), threshold = 0.9) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

The model specification and workflow are defined as follows:

xgb_spec_reg <- boost_tree(
  trees = 50,
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = tune()
) |>
  set_engine("xgboost") |>
  set_mode("regression")

xgb_wf_reg <- workflow() |>
  add_model(xgb_spec_reg) |>
  add_recipe(crime_rec_xgb_reg)

6.5 Estimation

We use 5-fold cross-validation and tune model hyperparameters for all three regression models. Our evaluation metrics are RMSE and R-squared, which reflect prediction accuracy and explained variance, respectively.

# Set up 5-fold cross-validation
set.seed(123)
crime_folds_reg <- vfold_cv(crime_train_reg, v = 5)

# Define metrics for regression
reg_metrics <- metric_set(rmse, rsq)

6.5.1 Model 1: Linear Regression (no tuning needed)

# Evaluate model via 5-fold CV
lm_res <- fit_resamples(
  lm_wf,
  resamples = crime_folds_reg,
  metrics = reg_metrics,
  control = control_resamples(save_pred = TRUE)
)

# Fit final model on full training set
lm_fit <- fit(lm_wf, data = crime_train_reg)

6.5.2 Model 2: Random Forest

# Define tuning grid for random forest
set.seed(123)
rf_grid_reg <- grid_random(
  mtry(range = c(3, 12)),
  min_n(range = c(2, 10)),
  size = 5
)

# Tune model
set.seed(123)
rf_res_reg <- tune_grid(
  rf_wf_reg,
  resamples = crime_folds_reg,
  grid = rf_grid_reg,
  metrics = reg_metrics,
  control = control_grid(save_pred = TRUE),
  verbose = TRUE
)

# Select best model based on RMSE
best_rf_reg <- select_best(rf_res_reg, metric = "rmse")

# Finalize and fit
final_rf_wf_reg <- finalize_workflow(rf_wf_reg, best_rf_reg)

set.seed(123)
final_rf_fit_reg <- fit(final_rf_wf_reg, data = crime_train_reg)

6.5.3 Model 3: XGBoost

# Create random tuning grid
set.seed(123)
xgb_grid_reg <- grid_random(
  tree_depth(range = c(3, 10)),
  learn_rate(range = c(0.01, 0.3)),
  loss_reduction(range = c(0, 1)),
  size = 5
)

# Tune with cross-validation
set.seed(123)
xgb_res_reg <- tune_grid(
  xgb_wf_reg,
  resamples = crime_folds_reg,
  grid = xgb_grid_reg,
  metrics = reg_metrics,
  control = control_grid(save_pred = TRUE),
  verbose = TRUE
)

# Select best hyperparameters
best_xgb_reg <- select_best(xgb_res_reg, metric = "rmse")

# Finalize and fit model
final_xgb_wf_reg <- finalize_workflow(xgb_wf_reg, best_xgb_reg)

set.seed(123)
final_xgb_fit_reg <- fit(final_xgb_wf_reg, data = crime_train_reg)

6.6 Interpretation

6.6.1 Variable Importance

Although linear regression does not provide built-in feature importance, we inspect standardized regression coefficients (β) to assess influence.

lm_fit |> 
  extract_fit_parsnip() |>
  tidy() |>
  arrange(desc(abs(estimate)))

term	estimate	std.error	statistic	p.value
method_OTHERS	-1.6923629	0.0123312	-137.2421258	0.0000000
(Intercept)	1.4667559	0.0153006	95.8623633	0.0000000
shift_MIDNIGHT	0.3006732	0.0090803	33.1127293	0.0000000
report_date_parsed_dow_Wed	-0.0621854	0.0118883	-5.2308252	0.0000002
shift_EVENING	0.0543332	0.0070769	7.6775166	0.0000000
black_pct	0.0495872	0.0068456	7.2436892	0.0000000
foreign_born_pct	-0.0373619	0.0057639	-6.4819960	0.0000000
report_date_parsed_dow_Tue	-0.0342020	0.0118799	-2.8789735	0.0039910
hsplus_pct	-0.0324247	0.0055568	-5.8351139	0.0000000
report_date_parsed_dow_Thu	-0.0287529	0.0120975	-2.3767618	0.0174682
age65plus_pct	-0.0270022	0.0033612	-8.0334131	0.0000000
report_date_parsed_dow_Fri	-0.0217346	0.0122786	-1.7701105	0.0767133
yday	0.0192624	0.0050609	3.8061352	0.0001413
yday_cos	-0.0188528	0.0031652	-5.9563380	0.0000000
report_date_parsed_dow_Mon	-0.0096846	0.0119442	-0.8108209	0.4174714
under18_pct	0.0089592	0.0044654	2.0063811	0.0448196
unemployed_rate	0.0088240	0.0047450	1.8596534	0.0629390
yday_sin	0.0049775	0.0050619	0.9833400	0.3254437
poverty_rate	0.0026705	0.0048203	0.5540057	0.5795768
report_date_parsed_dow_Sat	-0.0025336	0.0122927	-0.2061031	0.8367110
hispanic_pct	-0.0015230	0.0057131	-0.2665806	0.7897929
method_KNIFE	0.0008013	0.0248746	0.0322139	0.9743016

Accoding to the table above, the strongest predictors of felony severity (in transformed scale) include:

method_OTHERS(β = -1.69): Crimes with unspecified methods tend to have lower predicted severity.
shift_MIDNIGHT(β = 0.30) and shift_EVENING(β = 0.06): Crimes that happen around midnight and evening are predicted to be more severe.
black_pct and foreign_born_pct have small but statistically significant effects on predicted severity. A higher percentage of Black residents slightly increases the predicted score, while a higher percentage of foreign-born residents slightly lowers it.

These results should be interpreted with caution, as they might be affected by other potential confounding factors.

Note: all predictors were standardized and the outcome was transformed, so coefficients represent relative, not direct effects on severity.

# Random Forest importance
rf_model_reg <- extract_fit_parsnip(final_rf_fit_reg)
vip(rf_model_reg, num_features = 20)

method_OTHERS and shift_MIDNIGHT are again the most influential features. Several distance-based features, such as distance to banks, grocery stores, liquor stores and free WiFi, also play meaningful roles, suggesting that spatial proximity to certain infrastructure may be predictive of felony severity.

# XGBoost importance
xgb_model_reg <- extract_fit_parsnip(final_xgb_fit_reg)
vip(xgb_model_reg, num_features = 20)

Though the importance values are more concentrated. The method_OTHERS, shift_MIDNIGHT and black_pct dominate, while demographic variables such as under18_pct, foreign_born_pct and unemployed_rate have smaller but non-zero contributions.

6.6.2 Interpreting Model Predictions

All three models were trained using felony_severity_trans, the Yeo-Johnson transformed version of the original severity score. While this transformation helps improve model performance and satisfy statistical assumptions, the predicted values are no longer on the familiar 0–7 scale.

To make the predictions directly interpretable, we apply an inverse Yeo-Johnson transformation to the model outputs. This step converts .pred values back to the original felony_severity scale, allowing for meaningful evaluation using RMSE and R-squared. These evaluation metrics now reflect prediction error and explained variance in the original severity score, which ranges from 0 (least severe) to 7 (most severe).

Evaluate on Test Set

# Linear Regression - Test
lm_test_pred <- predict(lm_fit, new_data = crime_test_reg) |> 
  mutate(.pred = predict(yj_obj, newdata = .pred, inverse = TRUE)) |>  # inverse transformation
  bind_cols(crime_test_reg |> select(felony_severity))

metrics(lm_test_pred, truth = felony_severity, estimate = .pred)

.metric	.estimator	.estimate
rmse	standard	1.7102975
rsq	standard	0.1841820
mae	standard	0.6646159

# Random Forest - Test
rf_test_pred <- predict(final_rf_fit_reg, new_data = crime_test_reg) |> 
  mutate(.pred = predict(yj_obj, newdata = .pred, inverse = TRUE)) |>
  bind_cols(crime_test_reg |> select(felony_severity))

metrics(rf_test_pred, truth = felony_severity, estimate = .pred)

.metric	.estimator	.estimate
rmse	standard	1.5117827
rsq	standard	0.5192121
mae	standard	0.7915837

# XGBoost - Test
xgb_test_pred <- predict(final_xgb_fit_reg, new_data = crime_test_reg) |> 
  mutate(.pred = predict(yj_obj, newdata = .pred, inverse = TRUE)) |>
  bind_cols(crime_test_reg |> select(felony_severity))

metrics(xgb_test_pred, truth = felony_severity, estimate = .pred)

.metric	.estimator	.estimate
rmse	standard	1.8906491
rsq	standard	0.2047492
mae	standard	0.6817682

After evaluating model performance on the held-out test set using RMSE, R-squared, and MAE after applying inverse transformation. Linear regression showed weak performance, with an RMSE of 1.71 and R² of 0.18, capturing little variance. Random forest performed best, with an RMSE of 1.48 and R² of 0.53. XGBoost had RMSE of 1.89 and slightly better R² at 0.24 than linear regression.

Evaluate on Implementation Set

# Linear Regression - Implementation
lm_impl_pred <- predict(lm_fit, new_data = crime_enriched_implement) |> 
  mutate(.pred = predict(yj_obj_impl, newdata = .pred, inverse = TRUE)) |> 
  bind_cols(crime_enriched_implement |> select(felony_severity))

metrics(lm_impl_pred, truth = felony_severity, estimate = .pred)

.metric	.estimator	.estimate
rmse	standard	8.1388952
rsq	standard	0.0406479
mae	standard	0.9702572

# Random Forest - Implementation
rf_impl_pred <- predict(final_rf_fit_reg, new_data = crime_enriched_implement) |> 
  mutate(.pred = predict(yj_obj_impl, newdata = .pred, inverse = TRUE)) |>
  bind_cols(crime_enriched_implement |> select(felony_severity))

metrics(rf_impl_pred, truth = felony_severity, estimate = .pred)

.metric	.estimator	.estimate
rmse	standard	1.3626525
rsq	standard	0.5149314
mae	standard	0.7959087

# XGBoost - Implementation
xgb_impl_pred <- predict(final_xgb_fit_reg, new_data = crime_enriched_implement) |> 
  mutate(.pred = predict(yj_obj_impl, newdata = .pred, inverse = TRUE)) |>
  bind_cols(crime_enriched_implement |> select(felony_severity))

metrics(xgb_impl_pred, truth = felony_severity, estimate = .pred)

.metric	.estimator	.estimate
rmse	standard	12.3225165
rsq	standard	0.0218831
mae	standard	0.9982794

On the 2023 implementation data, linear regression and XGBoost failed to generalize, with RMSEs of 8.14 and 12.32, and R² values near zero. In contrast, random forest maintained strong performance with an RMSE of 1.32 and R² of 0.53.

6.6.3 Collect all metrics into one table

all_metrics_reg <- bind_rows(
  mutate(metrics(lm_test_pred, truth = felony_severity, estimate = .pred), model = "Linear Regression", set = "Test"),
  mutate(metrics(rf_test_pred, truth = felony_severity, estimate = .pred), model = "Random Forest", set = "Test"),
  mutate(metrics(xgb_test_pred, truth = felony_severity, estimate = .pred), model = "XGBoost", set = "Test"),
  mutate(metrics(lm_impl_pred, truth = felony_severity, estimate = .pred), model = "Linear Regression", set = "Implementation"),
  mutate(metrics(rf_impl_pred, truth = felony_severity, estimate = .pred), model = "Random Forest", set = "Implementation"),
  mutate(metrics(xgb_impl_pred, truth = felony_severity, estimate = .pred), model = "XGBoost", set = "Implementation")
)

6.6.4 Visualize 3 Models’ Performances

all_metrics_reg <- all_metrics_reg |>
  mutate(set = factor(set, levels = c("Test", "Implementation")))

# Plot RMSE Comparison
all_metrics_reg |>
  filter(.metric == "rmse") |>
  ggplot(aes(x = model, y = .estimate, fill = set)) +
    scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "RMSE Comparison across Models and Datasets",
       x = "Model",
       y = "RMSE") +
  theme_classic() +
  theme(
    text = element_text(family = "serif") 
  )

# Plot R-squared Comparison
all_metrics_reg |>
  filter(.metric == "rsq") |>
  ggplot(aes(x = model, y = .estimate, fill = set)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = c("Test" = "#56B4E9", "Implementation" = "steelblue")) +
  labs(title = "R-squared Comparison across Models and Datasets",
       x = "Model",
       y = "R-squared") +
  theme_classic() +
  theme(
    text = element_text(family = "serif") 
  )

After comparing model performance across test and implementation sets using RMSE and R-squared, the results show that Random Forest outperforms both Linear Regression and XGBoost in predictive accuracy and generalizability. It achieves the lowest RMSE and highest R² in both datasets, suggesting that it captures stable and generalizable patterns in the data. In contrast, Linear Regression and XGBoost perform relatively well on the test set but fail to generalize to 2023 data, indicating overfitting.

Importantly, the Random Forest model used in this task relies only on spatial and situational predictors, such as proximity to liquor stores, grocery outlets, public Wi-Fi zones, and timing features like shift and day of week. It excludes all socioeconomic indicators. That way this model still achieves high predictive accuracy suggests that spatial context alone provides meaningful signals about felony severity risk across tracts.

From a policy perspective, this finding supports efforts to incorporate environmental exposure and situational context into crime prevention strategies. Rather than relying only on historical crime counts or demographic profiling, city agencies could use this model to identify high-severity risk areas based on public infrastructure and land use patterns, and prioritize patrols or intervention accordingly. This aligns with the broader goal of reducing the social costs of crime by focusing on where severe offenses are more likely to occur, even in the absence of full socioeconomic data.

7 Discussion of Results

Our project investigated three key questions: To what extent can crime characteristics and spatial context predict whether an incident is a felony? How accurately can we classify the type of felony? And can we estimate the felony severity score associated with a newly reported crime in DC? The answers to these questions offer practical insights for public-sector decision-making in law enforcement, community engagement, and urban resource allocation.

We investigated our first research question through a binary classification model, developing three candidates based on geospatial and demographic features. The best-performing model was a logistic regression with lasso regularization, using only socioeconomic variables from the ACS and excluding all GIS features. Despite this, it achieved a recall of 99.6% on the implementation dataset, with a 1 − Recall of just 0.0043, indicating a very low miss rate for felonies. This suggests that law enforcement could potentially confidently use this model to predict felonies at the DC tract level, using this information to direct resources and enhance public safety. These findings also demonstrate that more features do not necessarily lead to better performance, and that focused feature selection can improve model robustness. To build on this foundation, future researchers should examine the model’s performance on other metrics, such as false positive rates, given the limitations of relying solely on recall.

We investigated our second research question through a multiclass classification model, also using machine learning techniques to develop three candidate models using the same geospatial features and demographic characteristics. The best performing model was random forest, with an overall accuracy rate on the implementation dataset of 16.32%. This low accuracy rate suggests future room for improvement, specifically through different feature engineering. Our models utilized both demographic and geospatial features, presenting a limitation of our approach to multi-classification, given that the binary classification and regression models both performed quite well, and they each used more specific feature engineering. Future researchers can focus on geospatial features for predicting felony severity class, and compare the performance of this model to one that uses only socioeconomic features to predict felony severity class. This approach can provide insight into the most salient predictors for felony severity.

Finally, we explored our third research question through a regression model, utilizing machine learning methods to estimate a transformed version of felony severity classification. Because of the skewed nature of the dataset, we utilized a Yeo-Johnson transformation technique to reduce skewness. Random forest was the highest-performing model, with an R2 of 53% and RMSE of 1.32 on the implementation dataset. The regression task revealed important insights about the importance of spatial and situational predictors, such as proximity to liquor stores, grocery outlets, public Wi-Fi zones, and timing features like shift and day of week, rather than demographic and socioeconomic indicators. Future researchers may build on our regression analysis by adding socioeconomic and demographic variables to our existing regression model to determine which ones meaningfully enhance predictive power.

It is difficult for law enforcement to balance public safety with community relations. It is worth highlighting that our regression findings suggest that crime prevention and targeting strategies that rely on environmental and geographic factors can identify high-severity risk areas without relying on demographic profiling. Our analysis adds valuable insights to the ongoing discussion on how to strike the right between community engagement and policing for public safety.

8 Reference

Bowers, K. J., & Johnson, S. D. (2011). Risk terrain modelling: Brokering criminological theory and GIS methods for crime forecasting. Crime Science, 1, 5–17. https://doi.org/10.1080/07418825.2010.486037

Braga, A. A., Papachristos, A. V., & Hureau, D. M. (2014). The effects of hot spots policing on crime: An updated systematic review and meta‑analysis. Justice Quarterly, 31(4), 633–663. https://doi.org/10.1080/07418825.2012.673632

Cheng, T., & Chen, T. (2021). Urban crime and security. In W. Shi, M. F. Goodchild, M. Batty, M.-P. Kwan, & A. Zhang (Eds.), Urban informatics (pp. 213–228). Springer. https://doi.org/10.1007/978-981-15-8983-6_14

Code of the District of Columbia. (2025). Title 22: Criminal offenses and penalties. https://code.dccouncil.gov/us/dc/council/code/titles/22

Golash-Boza, T., & Oh, H. (2021). Crime and neighborhood change in the nation’s capital: From disinvestment to gentrification. Crime & Delinquency, 67(9), 1267–1294. https://doi.org/10.1177/00111287211005394

McCollister, K. E., French, M. T., & Fang, H. (2010). The cost of crime to society: New crime‑specific estimates for policy and program evaluation. Drug and Alcohol Dependence, 108(1–2), 98–109. https://doi.org/10.1016/j.drugalcdep.2009.12.002

Van Wilsem, J., Wittebrood, K., & de Graaf, N. D. (2006). Socioeconomic dynamics of neighborhoods and the risk of crime victimization: A multilevel study of improving, declining, and stable areas in the Netherlands. Social Problems, 53(2), 226–247. https://doi.org/10.1525/sp.2006.53.2.226

Weisburd, D., Telep, C. W., Hinkle, J. C., & Eck, J. E. (2010). Is problem‑oriented policing effective in reducing crime and disorder? Findings from a Campbell systematic review. Criminology & Public Policy, 9(1), 139–172. https://onlinelibrary.wiley.com/doi/10.1111/j.1745-9133.2010.00617.x

9 Appendix

9.1 First Stage for Data Processing (DC Open Data)

9.1.1 Load crime incident data (2008–2025) from DC Open Data portal

# Read the base
crime_incidents <- read_csv("data/DC/crime_incidents/Crime_Incidents_in_2008 - 2025.csv")

# Check different categories of crime
crime_incidents |>
  count(OFFENSE, sort = TRUE)

sort(unique(crime_incidents$OFFENSE))

# Standardize column names for easier downstream referencing
crime_incidents <- janitor::clean_names(crime_incidents)

9.1.2 Create outcome variables

# Attach felony/not felony flag
crime_incidents <- crime_incidents |>
  mutate(felony_flag = case_when(
    offense %in% c("HOMICIDE", "SEX ABUSE", "ASSAULT W/DANGEROUS WEAPON",
                   "ROBBERY", "BURGLARY", "ARSON", "MOTOR VEHICLE THEFT") ~ 1,
    TRUE ~ 0
  ))

# Assign a numerical severity score to each felony type for ranking
crime_incidents <- crime_incidents |>
  mutate(felony_severity = case_when(
    offense == "HOMICIDE" ~ 7,
    offense == "SEX ABUSE" ~ 6,
    offense == "ASSAULT W/DANGEROUS WEAPON" ~ 5,
    offense == "ROBBERY" ~ 4,
    offense == "ARSON" ~ 3,
    offense == "BURGLARY" ~ 2,
    offense == "MOTOR VEHICLE THEFT" ~ 1,
    TRUE ~ 0  # not felony
  ))

Here we assign a felony_severity score to each major offense type, using the Cambridge Crime Harm Index (CHI) and per-case cost data as reference. CHI sentence days reflect the relative harm severity as determined by sentencing guidelines. While monetary cost gives an economic impact perspective, the severity ranking primarily follows the CHI framework.

Felony Offense	felony_severity Score	CHI Sentence Days (per case)	Total Cost per Case (USD, 2008)
HOMICIDE	7	5,475	$8,982,907
SEX ABUSE (Rape/Sexual Assault)	6	1,825 / 365	$240,776
ASSAULT W/DANGEROUS WEAPON	5	1,460 (GBH)	$107,020 (Aggravated Assault)
ROBBERY	4	365	$42,310
ARSON	3	33	$21,103
BURGLARY	2	20	$6,462 (Household)
MOTOR VEHICLE THEFT	1	20	$10,772
THEFT F/AUTO / THEFT/OTHER	0	20 / 2	$3,532

Notes:

ASSAULT W/DANGEROUS WEAPON is aligned with GBH (Grievous Bodily Harm, Intent) rather than ABH (Actual Bodily Harm), due to its higher violence and harm level.
MOTOR VEHICLE THEFT has a higher monetary cost than BURGLARY, but both offenses carry similar CHI sentence days (20). However, BURGLARY often involves unlawful entry into private residences, violating a victim’s sense of safety and security. This intrusion into personal space is considered more psychologically harmful and socially disruptive than property theft alone, which is why burglary is ranked higher in severity.
THEFT F/AUTO and THEFT/OTHER are not classified as felonies in this model; they receive a score of 0 despite having minor CHI and cost estimates. While D.C. law does consider theft exceeding $1,000 as a felony, we exclude these categories because the offense labels in the dataset do not reliably distinguish petty theft from felony-level theft. Without clear thresholds or legal classifications in the offense variable, treating all theft as non-felony ensures greater consistency in modeling.

9.1.3 Add Features

crime_sf <- st_as_sf(crime_incidents,
                     coords = c("longitude", "latitude"),
                     crs = 4326,
                     remove = FALSE)

Liquor

# Liquor Moratorium Zone (polygon)

# Read shapefile：alcohol sale restriction areas（moratorium zones）
liquor_zone_sf <- st_read("data/DC/liquor/Alcoholic_Beverage_and_Cannabis_Administration_Moratorium_Zones/Alcoholic_Beverage_and_Cannabis_Administration_Moratorium_Zones.shp")

# Check geometry type
# st_geometry_type(liquor_zone_sf)
table(st_geometry_type(liquor_zone_sf))

# Keep only polygon type geometries
liquor_zone_sf <- liquor_zone_sf |>
  filter(st_geometry_type(liquor_zone_sf) %in% c("POLYGON", "MULTIPOLYGON"))

# Convert to metric CRS (EPSG:3857) to align with `crime_m` for subsequent spatial operations
crime_m <- st_transform(crime_sf, crs = 3857)
liquor_zone_m <- st_transform(liquor_zone_sf, 3857)

# Spatial intersection: check whether each crime falls within any alcohol restriction zone polygon

crime_m$in_liquor_moratorium_zone <- lengths(st_intersects(crime_m, liquor_zone_m)) > 0

# Convert to integer (0/1) for modeling purposes
crime_m$in_liquor_moratorium_zone <- as.integer(crime_m$in_liquor_moratorium_zone)

# Inspect distribution to assess data validity
table(crime_m$in_liquor_moratorium_zone)

# Drop geometry and convert to data.frame for further analysis and modeling
crime_enriched <- crime_m |>
  st_drop_geometry() |>
  select(everything())

# Liquor Store Location

# Read CSV file containing liquor store information
liquor_csv <- read_csv("data/DC/liquor/Liquor_Licenses.csv")

# Check data structure
# glimpse(liquor_csv)

# Check coordinate fields
summary(liquor_csv$LONGITUDE)
summary(liquor_csv$LATITUDE)
table(liquor_csv$STATUS)

# Filter out records with missing latitude or longitude
liquor_csv_clean <- liquor_csv |>
  filter(STATUS == "Active") |>
  filter(!is.na(LONGITUDE), !is.na(LATITUDE))  # Handle missing values in the raw data

# Convert to sf object with geographic CRS (EPSG:4326)
liquor_store_sf <- st_as_sf(liquor_csv_clean, coords = c("LONGITUDE", "LATITUDE"), crs = 4326)

# st_geometry_type(liquor_store_sf)

# Convert to metric CRS (EPSG:3857) to align with `crime_m`
liquor_store_m <- st_transform(liquor_store_sf, 3857)

# Compute distance matrix: distances from each crime to all liquor stores
dist_matrix_liquorstore <- st_distance(crime_m, liquor_store_m)

# Generate feature columns

# Distance to the nearest liquor store (in meters)
crime_m$nearest_liquorstore_dist <- apply(dist_matrix_liquorstore, 1, min)

# Flag indicating if there is a liquor store within 200 meters
crime_m$near_liquorstore_200m <- apply(dist_matrix_liquorstore, 1, function(x) any(x < 200))
crime_m$near_liquorstore_200m <- as.integer(crime_m$near_liquorstore_200m)

# Inspect Distribution
summary(crime_m$nearest_liquorstore_dist)
table(crime_m$near_liquorstore_200m)

WiFi

# WiFi spots (point)

# Read Wifi shapefile
wifi_sf <- st_read("data/DC/wireless_hotspots/Wireless_Hotspots_from_DC_Government/Wireless_Hotspots_from_DC_Government.shp")

# Convert to WGS84 geographic CRS (EPSG:4326)
wifi_sf <- st_transform(wifi_sf, crs = 4326)

# Check geometry type
# st_geometry_type(wifi_sf)  

# Convert to metric CRS（EPSG:3857）for spatial calculations
wifi_m <- st_transform(wifi_sf, 3857)

# Compute distance matrix（crime x wifi）
dist_matrix_wifi <- st_distance(crime_m, wifi_m)

# Flag whether each crime has a WiFi point within 100 meters
crime_m$near_wifi_100m <- apply(dist_matrix_wifi, 1, function(x) any(x < 100))
crime_m$near_wifi_100m <- as.integer(crime_m$near_wifi_100m) 

# Optional: compute actual distance to nearest WiFi point (for visualization)
crime_m$nearest_wifi_dist <- apply(dist_matrix_wifi, 1, min)

# Inspect Distribution
table(crime_m$near_wifi_100m)
summary(crime_m$nearest_wifi_dist)

Low Food Access

# Low food access (polygon)
# Load low food access shapefile
lowfood_sf <- st_read("data/DC/low_food_access/Low_Food_Access_Areas/Low_Food_Access_Areas.shp")

# Convert to geographic CRS (EPSG:4326)
lowfood_sf <- st_transform(lowfood_sf, 4326)

# st_geometry_type(lowfood_sf)

# Convert to metric CRS (EPSG:3857) for spatial calculations (polygons)
lowfood_m <- st_transform(lowfood_sf, 3857)
# st_geometry_type(lowfood_m)

# Spatial intersection: determine if within low-food-access polygon area
crime_m$in_lowfood_zone <- lengths(st_intersects(crime_m, lowfood_m)) > 0

# Convert to integer (0/1) for modeling
crime_m$in_lowfood_zone <- as.integer(crime_m$in_lowfood_zone)

# Inspect Distribution 
table(crime_m$in_lowfood_zone)

Grocery Store

# Grocery Store (point)
# Load and convert
grocery_sf <- st_read("data/DC/grocery_store/Grocery_Store_Locations/Grocery_Store_Locations.shp")
grocery_sf <- st_transform(grocery_sf, 4326)
# st_geometry_type(grocery_sf)
grocery_m <- st_transform(grocery_sf, 3857)

# Calculate distance
dist_matrix_grocery <- st_distance(crime_m, grocery_m)

# Add features
crime_m$nearest_grocery_dist <- apply(dist_matrix_grocery, 1, min)
crime_m$near_grocery_300m <- apply(dist_matrix_grocery, 1, function(x) any(x < 300))
crime_m$near_grocery_300m <- as.integer(crime_m$near_grocery_300m)

# Inspect distribution
table(crime_m$near_grocery_300m)
summary(crime_m$nearest_grocery_dist)

Bank

# Bank Location (point)
bank_sf <- st_read("data/DC/bank_location/Bank_Locations/Bank_Locations.shp")
bank_sf <- st_transform(bank_sf, 4326)

# st_geometry_type(bank_sf)
bank_m <- st_transform(bank_sf, 3857)

dist_matrix_bank <- st_distance(crime_m, bank_m)

crime_m$nearest_bank_dist <- apply(dist_matrix_bank, 1, min)
crime_m$near_bank_250m <- apply(dist_matrix_bank, 1, function(x) any(x < 250))
crime_m$near_bank_250m <- as.integer(crime_m$near_bank_250m)

# Inspect Distribution 
table(crime_m$near_bank_250m)
summary(crime_m$nearest_bank_dist)

Vending Zone

# Vending Zone (polygon)
vending_sf <- st_read("data/DC/street_vending/Street_Vending_Zones/Street_Vending_Zones.shp")

# st_geometry_type(vending_sf)
vending_sf <- st_transform(vending_sf, 3857)  
# Polygons only, so only match against `crime_m

crime_m$in_vending_zone <- lengths(st_intersects(crime_m, vending_sf)) > 0
crime_m$in_vending_zone <- as.integer(crime_m$in_vending_zone)

# Inspect distribution
table(crime_m$in_vending_zone)

Police Sector

# Read the shapefile of police sectors
police_sector_sf <- st_read("data/DC/police_sector_division/Police_Sectors/Police_Sectors.shp")

# Check geometry type (usually POLYGON or MULTIPOLYGON)
# table(st_geometry_type(police_sector_sf))

# Convert projection to EPSG:3857 (meters) to match crime_m
police_sector_m <- st_transform(police_sector_sf, 3857)

# Check if each crime falls within a police sector polygon
# This will assign the sector name/code to each crime (if available)
crime_m <- st_join(crime_m, police_sector_m |> 
                     select(police_sector = SECTOR), left = TRUE)

# Check how many NA (missing sectors)
# table(!is.na(crime_m$police_sector))  # No NA

table(crime_m$police_sector)

# Optional: Convert to factor (if categorical)
crime_m$police_sector <- as.factor(crime_m$police_sector)

Military Base

# Read shapefile of military bases
military_sf <- st_read("data/DC/military_base/Military_Bases/Military_Bases.shp")

# table(st_geometry_type(military_sf))

# Check and transform CRS to EPSG:3857
military_m <- st_transform(military_sf, 3857)

# Check if each crime falls inside a military base polygon
crime_m$in_military_base <- lengths(st_intersects(crime_m, military_m)) > 0

# Convert to 0/1
crime_m$in_military_base <- as.integer(crime_m$in_military_base)

# Optional summary check
table(crime_m$in_military_base)

9.1.4 Enriched crime_incidents Data

# Retain original sf object for spatial analysis
crime_sf_final <- crime_m

# Create modeling dataframe: drop geometry
crime_enriched <- crime_m |> 
  st_drop_geometry()

# Export modeling data
write_csv(crime_enriched, "data/crime_enriched.csv")

crime_enriched <- read_csv("data/crime_enriched.csv")

9.2 Second Stage for Data Processing (ACS Data)

Here we are going to usetidycensus package and ACS data to further enrich our data frame. To do so, you need to have a Census API first.

9.2.1 Set up API for ACS data

If you installed Census API in your R global env once, use this code to check:

Sys.getenv("CENSUS_API_KEY")

If not, open “https://api.census.gov/data/key_signup.html” to get one, then:

if (!file.exists("~/.Renviron") || !any(grepl("CENSUS_API_KEY", readLines("~/.Renviron")))) {
  census_api_key("<YOUR_API_KEY_HERE>", install = TRUE)
  message("API key added. Please restart your R session.")
} else {
  message("Census API key already installed.")
}

Check for each variable of ACS5 using URL:“https://api.census.gov/data/year/acs/acs5/variables.html”

You can fill number into year from 2009 - 2023. For example, ACS5(2005 - 2009)’s link is: https://api.census.gov/data/2009/acs/acs5/variables.html

# Extract predictors of interest

acs_vars_named <- c(
  # econ and employment
  median_income = "B19013_001",
  poverty_total = "B17001_001",
  poverty_count = "B17001_002",
  unemployed = "B23025_005",
  labor_force = "B23025_003",

  # race and ethnicity
  black_count = "B02001_003",
  total_race = "B02001_001",
  hispanic_count = "B03003_003",
  total_ethnicity = "B03003_001",
  foreign_born = "B05002_013",
  total_pop_fb = "B05002_001",

  # household resources
  renter = "B25003_003",
  total_tenure = "B25003_001",
  no_vehicle = "B08201_002",
  total_vehicle = "B08201_001",

  # family structure
  single_mom = "B11003_010",
  total_family = "B11003_001",

  # education：high school + 
  edu_hsgrad = "B15003_017",
  edu_somecollege = "B15003_018",
  edu_associate = "B15003_019",
  edu_bachelor = "B15003_020",
  edu_master = "B15003_021",
  edu_professional = "B15003_022",
  edu_doctorate = "B15003_023",
  edu_other1 = "B15003_024",
  edu_other2 = "B15003_025",
  total_edu = "B15003_001",

  # Age：< 18 & 65 + 
  # male < 18
  under18_m1 = "B01001_003", 
  under18_m2 = "B01001_004",
  under18_m3 = "B01001_005", 
  under18_m4 = "B01001_006",
  # female <18
  under18_f1 = "B01001_027", 
  under18_f2 = "B01001_028",
  under18_f3 = "B01001_029", 
  under18_f4 = "B01001_030",
  # male 65+
  age65plus_m1 = "B01001_020", 
  age65plus_m2 = "B01001_021",
  age65plus_m3 = "B01001_022", 
  age65plus_m4 = "B01001_023",
  age65plus_m5 = "B01001_024", 
  age65plus_m6 = "B01001_025",
  # female 65+
  age65plus_f1 = "B01001_044", 
  age65plus_f2 = "B01001_045",
  age65plus_f3 = "B01001_046", 
  age65plus_f4 = "B01001_047",
  age65plus_f5 = "B01001_048", 
  age65plus_f6 = "B01001_049",

  # total population
  total_age = "B01001_001" # total population of different age
)

9.2.2 Variable Availability Check (2009-2023)

Before we actually put our arguments into data extraction, we can first check if they are available across 2009-2023 then adjust our args into a proper range.

# Function: check if variables are valid for a given year
check_vars <- function(year, var_vector) {
  cat("Checking year:", year, "...\n")
  available <- load_variables(year, "acs5", cache = TRUE)
  var_names <- names(var_vector)
  matched <- var_vector %in% available$name
  return(data.frame(
    year = year,
    var_label = var_names,
    var_code = var_vector,
    exists = matched
  ))
}

# Check each year from 2009 to 2023
var_check_df <- purrr::map_dfr(2009:2023, ~check_vars(.x, acs_vars_named))

# Identify which variables are missing in which years
missing_table <- var_check_df |> 
  filter(!exists) |> 
  arrange(var_label, year)

9.2.3 Starting ACS Data Retrieval from 2012

Our checks show that data from 2009 to 2011 is incomplete. Since incomplete data cannot yield reliable predictions and we aim to forecast for the most recent year (2023), we will begin retrieving data from 2012 onward.

years <- 2012:2023 # 5-year; available from 2009-2023

acs_raw <- map_dfr(years, function(y) {
  get_acs(
    geography = "tract",
    state = "11",
    year = y,
    variables = acs_vars_named,
    survey = "acs5",
    geometry = FALSE
  ) |>
    mutate(year = y)
})

# Export acs raw data
write_csv(acs_raw, "data/acs_raw.csv")

acs_raw <- read_csv("data/acs_raw.csv")

Prepare ACS data for merging by pivoting and creating derived features.

acs_wide <- acs_raw |>
  select(GEOID, year, variable, estimate) |>
  pivot_wider(
    names_from = variable,
    values_from = estimate
  )

acs_processed <- acs_wide |>
  mutate(
    # Proportion variables
    poverty_rate = poverty_count / poverty_total,
    black_pct = black_count / total_race,
    hispanic_pct = hispanic_count / total_ethnicity,
    foreign_born_pct = foreign_born / total_pop_fb,
    renter_pct = renter / total_tenure,
    no_vehicle_pct = no_vehicle / total_vehicle,
    singlemom_pct = single_mom / total_family,
    unemployed_rate = unemployed / labor_force,

    # Education proportion (sum of high school and above)
    hsplus_total = edu_hsgrad + edu_somecollege + edu_associate + edu_bachelor + edu_master + edu_professional + edu_doctorate + edu_other1 + edu_other2,
    hsplus_pct = hsplus_total / total_edu,

    # Age structure proportions
    under18_total = under18_m1 + under18_m2 + under18_m3 + under18_m4 + under18_f1 + under18_f2 + under18_f3 + under18_f4,
    under18_pct = under18_total / total_age,

    age65plus_total = age65plus_m1 + age65plus_m2 + age65plus_m3 + age65plus_m4 + age65plus_f1 + age65plus_f2 + age65plus_f3 + age65plus_f4 + age65plus_m5 + age65plus_m6 + age65plus_f5 + age65plus_f6,
    age65plus_pct = age65plus_total / total_age,
    
    # Create tract col for join
    tract = stringr::str_sub(GEOID, -6)
  )

# Export acs processed data
write_csv(acs_processed, "data/acs_processed.csv")

acs_processed <- read_csv("data/acs_processed.csv")

9.2.4 Handling Missing Values in Tract Column

As we want our data to keep the maximum valid geospatial features at tract level, we need to check if the DC open data, ACS5 data and official DC tract data are consistent before we merge them into one. We first compare tract IDs in crime_enriched and acs_processed with the official 2023 shapefile to identify missing or obsolete tracts.

# Load 2023 official DC tract shapefile 
dc_tract_sf <- st_read("data/tl_2023_11_tract/tl_2023_11_tract.shp") 

# Extract tract lists
official_tract6 <- unique(dc_tract_sf$TRACTCE)

official_tract11 <- unique(dc_tract_sf$GEOID)

# Compare crime_enriched (census_tract) with dc_tract_sf$TRACTCE
crime_tracts <- unique(na.omit((crime_enriched$census_tract)))

missing_in_crime <- setdiff(official_tract6, crime_tracts)

unexpected_in_crime <- setdiff(crime_tracts, official_tract6)

# Compare acs_processed GEOID with official dc_tract_sf$GEOID
acs_tracts <- unique(na.omit(acs_processed$GEOID))

missing_in_acs <- setdiff(official_tract11, acs_tracts)

unexpected_in_acs <- setdiff(acs_tracts, official_tract11)

The official DC shapefile (dc_tract_sf) contains 206 tracts, while both crime_enriched$census_tract and acs_processed$GEOID contain 231 unique tract identifiers after droping NAs. This means there are 25 tracts present in the datasets but missing from the 2023 TIGER shapefile.

Note: Only crime incidents from DC open data have NAs in the tract column. There’s no NAs in ACS5 tract column, meaning all tracts had been calculated its demographic feature in ACS survey. While a few crime incidents did not have their tract recorded, this may occur during law enforcement operations when detailed location data is not captured.

Since the crime_enriched data spans 2008–2025 and the acs_processed data covers 2012–2023, it is likely that changes to tract definitions, such as splits, merges, or renumbering, have occurred over the years. These changes can result in older tract codes persisting in historical data but not appearing in the most recent shapefile.

To confirm this, we compare the unexpected tracts in crime_enriched and acs_processed against the official 2023 shapefile to see if they are the same.

# Convert 11-digit ACS GEOIDs to 6-digit tract codes
unexpected_in_acs_tract6 <- substr(unexpected_in_acs, 6, 11)

# Check if both datasets contain the same extra tracts
setequal(unexpected_in_crime, unexpected_in_acs_tract6)

# Inspection of overlap and differences
intersect(unexpected_in_crime, unexpected_in_acs_tract6)

setdiff(unexpected_in_crime, unexpected_in_acs_tract6)

setdiff(unexpected_in_acs_tract6, unexpected_in_crime)

The results show that the two datasets share the exact same set of 25 unexpected tract codes.

To investigate whether these tracts are legacy codes that have since been phased out, we now filter both datasets to include only observations from 2019 to 2023, which we wanna use to build our predictive models. We then compare the resulting tract coverage to the official 2023 DC shapefile.

# Filter crime data to 2019–2023
crime_filtered <- crime_enriched |>
  mutate(report_date = lubridate::ymd_hms(str_sub(report_dat, 1, 19))) |> 
  filter(year(report_date) >= 2019 & year(report_date) <= 2023)

# Filter ACS data to 2019–2023
acs_filtered <- acs_processed |>
  filter(year >= 2019 & year <= 2023)

# Extract unique tract identifiers
crime_tracts_filtered <- unique(na.omit(crime_filtered$census_tract)) 

acs_tracts_filtered <- unique(na.omit(acs_filtered$GEOID))

We then identify any tracts that appear in the filtered datasets (2019-2023) but are not included in the 2023 TIGER shapefile.

# Compare against 2023 official tracts
missing_in_crime_2019_2023 <- setdiff(official_tract6, crime_tracts_filtered)

unexpected_in_crime_2019_2023 <- setdiff(crime_tracts_filtered, official_tract6)

missing_in_acs_2019_2023 <- setdiff(official_tract11, acs_tracts_filtered)

unexpected_in_acs_2019_2023 <- setdiff(acs_tracts_filtered, official_tract11)

The crime data from 2019–2023 aligns perfectly with the 2023 TIGER shapefile, showing no extra tracts. However, ACS still contains 25 unexpected tracts. This is likely because ACS 5-year estimates span earlier years and retain older tract definitions. To ensure consistency across all datasets, we will restrict our analysis to the 2020–2023 period.

To investigate further, we check whether these 25 tracts are the same as the unexpected tracts found in the full datasets from 2008–2025 (crime incidents) and 2012–2023 (ACS5).

# Check if the 25 unexpected tracts in 2019–2023 ACS match those in the full ACS data
setequal(unexpected_in_acs_2019_2023, unexpected_in_acs)

# Trim ACS GEOIDs to 6-digit tract codes for comparison with crime incidents data
unexpected_in_acs_tract6_filtered <- substr(unexpected_in_acs_2019_2023, 6, 11)

# Check if these match the unexpected tracts in the full crime incidents dataset
setequal(unexpected_in_acs_tract6_filtered, unexpected_in_crime)

In the full datasets:

Both crime_enriched (2008-2025) and acs_processed (2012-2023) contain 25 extra tracts not present in the 2023 TIGER shapefile.
These 25 tracts are identical across both datasets, indicating they are likely the legacy tracts.

In the 2019–2023 subset:

The unexpected tracts have been removed from the crime incidents data.
However, they still exist in the ACS data.

This suggests that the outdated tracts persist in ACS due to the rolling nature of 5-year estimates.

To test this further, we narrow ACS to 2020–2023. Because this window only includes data from 2016 onward, we expect the outdated tracts to drop off if the change occurred before or around 2016.

# Filter ACS to 2020–2023
acs_filtered_2020 <- acs_processed |>
  filter(year >= 2020 & year <= 2023)

acs_tracts_filtered_2020 <- unique(acs_filtered_2020$GEOID)

# Compare with official 2023 tract list
missing_in_acs_2020_2023 <- setdiff(official_tract11, acs_tracts_filtered_2020)

unexpected_in_acs_2020_2023 <- setdiff(acs_tracts_filtered_2020, official_tract11)

# Compare with 2019-2023 ACS tract list

setdiff(acs_tracts_filtered, acs_tracts_filtered_2020)

setdiff(acs_tracts_filtered_2020, acs_tracts_filtered)

# Check if all unexpected tracts are gone
acs_tract6_filtered_2020 <- substr(acs_tracts_filtered_2020, 6, 11)

setequal(crime_tracts_filtered, acs_tract6_filtered_2020)

We find that the 2020-2023 ACS data now matches the 2023 TIGER shapefile and the 2019–2023 crime data exactly. The outdated tract codes still appeared in the 2019–2023 results, while they disappeared in the 2020-2023 results. This may because ACS 5-year estimates include data from earlier years. 2019-2023 ACS data covers data from 2015, and 2020-2023 ACS data covers data from 2016. So the tract changes are likely to happen between 2015 and 2016.

Next, we will compare the TIGER shapefiles from 2015 and 2016 to confirm when the tract boundary changes were officially implemented to test this hypothesis.

# Load 2015 and 2016 DC tract shapefiles
tract_2015 <- st_read("data/tl_2015_11_tract/tl_2015_11_tract.shp")
tract_2016 <- st_read("data/tl_2016_11_tract/tl_2016_11_tract.shp")

# Extract unique GEOIDs
geoids_2015 <- unique(tract_2015$GEOID)
geoids_2016 <- unique(tract_2016$GEOID)

# Compare 
setdiff(geoids_2015, geoids_2016)
setdiff(geoids_2016, geoids_2015)

The results show our assumption was incorrect. No tract changes occurred between 2015 and 2016. The GEOIDs are identical for both years.

To further investigate when the tract definitions changed, we referred to the U.S. Census Bureau’s 2010 guide, which states that DC had 179 census tracts as of 2010.

However, we observed that the 2023 DC TIGER shapefile includes 206 tracts. The same count appears in the ACS 5-year data from 2020–2023 and the crime incident data from 2019–2023.

This led to a second hypothesis: DC’s total number of tracts increased between 2010 and 2023.

# Load 2010 DC tracts
dc_tract_2010 <- st_read("data/tl_2010_11001_tract10/tl_2010_11001_tract10.shp")


# Extract tract GEOIDs
dc_tracts_2010_ids <- unique(dc_tract_2010$GEOID)
dc_tracts_2023_ids <- unique(dc_tract_sf$GEOID)

# Compare tract counts and differences
length(dc_tracts_2010_ids)  # 179
length(dc_tracts_2023_ids)  # 206

setdiff(dc_tracts_2023_ids, dc_tracts_2010_ids) # 52 added
setdiff(dc_tracts_2010_ids, dc_tracts_2023_ids) # 25 removed

From this comparison, we see that 52 tracts were added and 25 were removed between 2010 and 2023. This results in a net increase of 27 tracts, confirming that DC’s tract count grew from 179 to 206 over this period. While this change isn’t explicitly documented in a single source, it can be deduced through shapefile comparison. To identify the specific year of this change, we can do this comparison year by year.

Since both 2015 and 2016 still had 179 tracts, which are the same as in 2010, we began checking for changes starting from 2017.

# Load 2017-2020 DC tract shapefiles
tract_2017 <- st_read("data/tl_2017_11_tract/tl_2017_11_tract.shp")
tract_2018 <- st_read("data/tl_2018_11_tract/tl_2018_11_tract.shp")
tract_2019 <- st_read("data/tl_2019_11_tract/tl_2019_11_tract.shp")
tract_2020 <- st_read("data/tl_2020_11_tract/tl_2020_11_tract.shp")

# Extract unique GEOIDs
geoids_2017 <- unique(tract_2017$GEOID)
geoids_2018 <- unique(tract_2018$GEOID)
geoids_2019 <- unique(tract_2019$GEOID)
geoids_2020 <- unique(tract_2020$GEOID)

# Compare year by year
setdiff(geoids_2017, geoids_2018)
setdiff(geoids_2018, geoids_2017)

setdiff(geoids_2018, geoids_2019)
setdiff(geoids_2019, geoids_2018)

setdiff(geoids_2019, geoids_2020)
setdiff(geoids_2020, geoids_2019)

The results confirm that the major tract update occurred between 2019 and 2020. During this transition, 25 tracts were removed and 52 were added, increasing the total from 179 to 206.

While ACS and crime data each follow their own update cycles, we have verified that for the years 2020–2023, the tracts in both ACS 5-year estimates and crime incidents data fully match the 2023 TIGER shapefile.

Therefore, to ensure consistency across all datasets, we revise our modeling window to cover only the four years from 2020 to 2023, instead of the originally planned 2019–2023 period.

9.2.5 Merge crime_enriched and ACS5 data

Clean and prepare the crime data to match ACS structure.

# Select required fields from `crime_enriched`
crime_enriched <- crime_enriched |>
  mutate(
    # report_dat` is a string like "2008/12/12 22:00:00+00"
    # Parse with `ymd_hms()` to automatically detect timezone
    report_date_parsed = ymd_hms(report_dat),
    year = year(report_date_parsed)
  ) |>
    rename(tract = census_tract) #rename census_tract for join

Now we merge the cleaned crime data with the ACS data by matching on tract and year.

# Perform left join by tract and year
crime_enriched <- left_join(
  crime_enriched,
  acs_processed,
  by = c("tract" = "tract", "year" = "year")
)


write_csv(crime_enriched, "data/crime_enriched_with_acs.csv")

crime_enriched <- read_csv("data/crime_enriched_with_acs.csv")

Last process to keep valid data:

crime_enriched_filtered <- crime_enriched |>
  filter(year >= 2020, year <= 2023)

# Inspect which variables for missing values
colSums(is.na(crime_enriched_filtered))

9.2.6 Data imputation on missing values

During the merge of DC Open Data and ACS datasets, several variables introduced a high proportion of missing values. Instead of dropping all rows with any NA, which would remove some tracts entirely, we first drop variables that are missing in more than 95% of (year, tract) groups. We then remove any remaining rows that still contain missing values.

# Calculate missing value proportion for each column
na_ratio <- colSums(is.na(crime_enriched_filtered)) / nrow(crime_enriched_filtered)

# Retain columns with missing rate < 30%
crime_enriched_filtered <- crime_enriched_filtered |>
  select(which(na_ratio < 0.3))

# Drop rows where `tract` is NA
crime_enriched_filtered <- crime_enriched_filtered |>
  filter(!is.na(tract))

# Identify columns to check (excluding grouping columns)
cols_to_check <- setdiff(colnames(crime_enriched_filtered), c("year", "tract"))

# For each (year, tract) group, calculate the NA rate for each variable
na_summary <- crime_enriched_filtered |>
  group_by(year, tract) |>
  summarise(across(all_of(cols_to_check), ~ mean(is.na(.)), .names = "na_{col}")) |>
  ungroup()

# Drop variables missing in >95% of (year, tract) group
vars_with_high_na <- na_summary |>
  pivot_longer(cols = starts_with("na_"), names_to = "var", values_to = "na_rate") |>
  filter(na_rate > 0.95) |>
  mutate(var = sub("^na_", "", var)) |>
  distinct(var) |>
  pull(var)

crime_enriched_filtered <- crime_enriched_filtered |>
  select(-all_of(vars_with_high_na))
  
# Check how many NA values remain per column
colSums(is.na(crime_enriched_filtered))

crime_enriched_filtered |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "column", values_to = "na_count") |>
  filter(na_count > 0) |>
  arrange(desc(na_count))

# Drop known NA columns that are not used for modeling
crime_enriched_no_na <- crime_enriched_filtered |>
  select(-c(end_date, psa, district, start_date, block_group, voting_precinct))

# Check if there's NA in crime_enriched_no_na
sum(is.na(crime_enriched_no_na))

write_csv(crime_enriched_no_na, "data/crime_enriched_acs_nona.csv")

crime_enriched_no_na <- read_csv("data/crime_enriched_acs_nona.csv")

# Check the tract column
unique(crime_enriched_no_na$tract)

9.3 Final Stage for Data Processing (NIBRS Data)

We checked the UCR Program data from Harvard Dataverse, including its administrative segment, offense segment, offender segment, arrestee segment and victim segment. Though they have proper year column that we can merge our current data frame with, they are missing the column of tract number. Actually, these datasets cover the whole US. So they might be more useful when building a predictive model at state level, instead of tract level within DC. Here, we decide not to joining the data from NIBRS for our DC-crime-predicting model.

admin_2023 <- readRDS("data/DC/nibrs/admin/nibrs_administrative_segment_2021.rds") |>
  filter(state_abb == "dc")

offense_2023 <- readRDS("data/DC/nibrs/offense/nibrs_offense_segment_2023.rds") |>
  filter(state_abb == "dc")

offender_2023 <- readRDS("data/DC/nibrs/offender/nibrs_offender_segment_2023.rds") |>
  filter(state_abb == "dc")

arrestee_2023 <- readRDS("data/DC/nibrs/arrestee/nibrs_arrestee_segment_2023.rds") |>
  filter(state_abb == "dc")

victim_2023 <- readRDS("data/DC/nibrs/victim/nibrs_victim_segment_2023.rds") |>
  filter(state_abb == "dc")