Crime Data: Creating a predictive model

Introduction

This model was completed as the final project deliverable for my graduate certificate course in Data Analytics at Boston University's Metropolitan College. The course was the final of four, Data Mining. I used a combination of R, Weka, and Excel.

Objective

The goal of this project was to test a number of different attribute selection methods and classifier algorithms and determine which combination resulted in the best predictive model.

Data

The professor in the course provided the dataset. It was originally taken from the UCI Machine Learning Lab and then edited to meet the professor's specifications.

Abstract: Communities in the US. Data combines socio-economic data from the 1990 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR.

Preprocessing

Some of the attributes are percentages (0-100%) while others are whole numbers. I was concerned that during attribute selection, the larger numbers would be given more weight (and thus seem more important to the analysis). To be able to compare attributes on an apples-to-apples level, I converted some using an index. The index used is 0-100, where the minimum value in that attribute is assigned 0 and the maximum value is assigned 1. All preprocessing was done in R.

An example of the R code I used to index the values:

data$pop <- (data$pop-min(data$pop)) / (max(data$pop) - min(data$pop)) * 100

Create test and training datasets

In the full dataset, one attribute is used to identify the levels of crime by neighborhood. That attribute is "class."

I used stratified random sampling in R to create the test dataset and checked that the test dataset maintained the same proportions by class as the original dataset (33% are class 0 and 66% are class 1).

library(sampling)
set.seed(5000)
data.order <- data[order(data$Class), ]
freq <- table(data.order$Class)
s1 <- 601 * freq / sum(freq)
s1 <- s1[s1 != 0]
str1 <- strata(data.order, stratanames=c("Class"), size=s1, method="srswor")
test <- getdata(data.order, str1)

prop.table(table(test$Class))
prop.table(table(data$Class))

I then pulled all rows in the test dataset out of the overall dataset to create the training dataset. I then checked the training dataset also had the same proportions by class.

training <- data[-c(as.numeric(rownames(test))),]
prop.table(table(training$Class))

I did a semi-join to check that no rows were duplicated between the test and training datasets.

library(dplyr)
semi_join(training,test)

I then wrote the files into an .arff format to use in Weka.

library(foreign)
write.arff(test,"filepath/test.arff", eol="\n", relation=deparse(substitute(test)))
write.arff(training,"filepath/training.arff", eol="\n", relation=deparse(substitute(training)))

Attribute selection

I used three attribute selection methods in Weka.

GainRatioAttributeEval (with the Ranker search method)
This method evaluates the attributes by measuring their gain ratio in relation to the class attribute. It ranks the attributes in order, and there is no obvious inflection point in the rankings. I’ve chosen to use the first 10. I used the training dataset to compare the selection of 10 vs. 5 attributes and 10 performed better. The attributes selected using this method were:

Population
Percentage of people living in areas classified as urban
Percentage of women who are divorced
Percentage of people in owner-occupied households
Percentage of kids born to parents who never married
Percentage of occupied housing units without a phone
Percentage of people under the poverty level
Number of kids born to parents who never married
Number of homeless people counted in the street
Number of people under the poverty level

CorrelationAttributeEval (with the Ranker search method)
This method finds the Pearson’s correlation between the attribute and the class and then ranks the attributes by evaluating their worth based on their correlation. This method ranks all the attributes in order. There is an inflection point after 17 attributes, but since I needed to remain within 5-15 attributes, I’ve chosen the first 15. I did also compare the training dataset with 10 vs. 15 attributes and 15 performed better. The attributes selected using this method were:

Percentage of women who are divorced
Percentage of men who are divorced
Percentage of kids born to parents who never married
Percentage of people under the poverty level
Percentage of occupied housing units without a phone
Percentage of the population that is Caucasian
Percentage of people in owner-occupied households
Percentage of the population that are divorced
Percentage of households with public assistance income in 1989
Percentage of the population that is African American
Percentage of housing units with fewer than 3 bedrooms
Percentage of households which are owner-occupied
Median household income
Median family income
Percentage of people 16 and older who are in the labor force and are unemployed

CfsSubsetEval (with the BestFirst search method)
This method evaluates the attributes by taking into account their individual predictive ability as well as to what degree the attributes are redundant. It favors attributes that are highly correlated with the class attribute but are not highly intercorrelated. There were 26 attributes selected by this method and I have selected the following attributes that I believe may be most relevant (there’s no information gain number provided as in the ranker, so I’ve taken an educated guess). The attributes selected using this method were:

Percentage of the population that is Caucasian
Percentage of the population that is African American
Percentage of households with wage or salary income in 1989
Percentage of households with public assistance income in 1989
Percentage of people under the poverty level
Percentage of kids born to parents who never married
Percentage of people in dense housing (more than 1 person per room)

For each of these attribute selections, I create new test and training datasets using R. For brevity's sake, I will use the CfsSubsetEval as an example, but I used the same process with each.

attributes.cfs <- c(3,4,11,15,26,40,58,87)
training.cfs <- training[,attributes.cfs]
test.cfs <- test[,attributes.cfs]
write.arff(test.cfs,"filepath/cfssubseteval/test.arff", eol="\n", 
    relation=deparse(substitute(test)))
write.arff(training.cfs,"filepath/cfssubseteval/training.arff", eol="\n", 
    relation=deparse(substitute(training)))

Create the classifier models

I applied 10 different classifier algorithms to each of the new attribute selection training sets, resulting in 30 models. I then applied those 30 models to the test datasets to determine the effectiveness of each model. A summary of the confusion matrices is listed below.

TP (true positive) - the number of attributes correctly classified by the model
FN (false negative) - the number of attributes that are actually positive but were incorrectly classified as negative
FP (false positive) - the number of attributes that are actually negative but were incorrectly classified as positive
TN (true negative) - the number of attributes correctly classified by the model
TPR (true positive rate) - the percentage of correctly classified attributes which are positive
FPR (false positive rate) - the percentage of negative attributes which were incorrectly classified as positive
Prec. (precision) - the percentage of attributes that are positive and were classified correctly (TP / P)
Spec. (specificity) - the percentage of attributes that are negative and were classified correctly (TN / N)
Acc. (accuracy) - the overall number of attributes in the model which were correctly classified (TP+TN) / (TP+TN+FP+FN)
F-measure - assigns equal weight to sensitivity and precision, combined in one metric (2 precision sensitivity) / (precision + sensitivity)

Selecting the best model

My first step was to look at the accuracies of the 30 classifier algorithms. This gave me a list of 5 that had accuracies above 80%.

I checked the TP, FN, FP, and TN to see if any of them had obviously unbalanced ratios (they didn’t).

I considered the Precision, FPR, and Specificity. None were vastly different, but the algorithm that across the board performed best was Bagging with J48 with the CorrelationAttributeEval attribute selection method. It had the nearly the highest precision and specificity, and had the highest accuracy. It also had the lowest FPR.

Learnings and observations

I learned that when creating an effective model, it’s always best to run many algorithms and selection methods. There’s a reason why we use computers for data processing, and if we’re looking for the objectively best model, using only self-selected attributes can bias the data, or at a minimum result in poor performance.

One of the most interesting observations was how poorly the models turns out for the selection algorithm where I chose attributes that I thought made sense. This shouldn’t be surprising, but I think in many cases, we tend to overestimate our ability to predict the results of the data using “logic” or “judgment” that are really just subjective measures. The models I ran which were based purely on Weka’s ranking were the ones that performed the best.