Create a Random Forest and Support Vector Machines Classifier and analyse the models with a ROC Curve

We will create a Random Forest and a Support Vector Machine model for income prediction. We will also learn what a Receiver Operating Characteristic (ROC) is and how to interpret it.

Objectives: Model Creation, Random Forest, SVM, Prediction, ROC-Curve
Requirements: R Basics, R Data Mining

Data

Adult data set is provided by “Center for Machine Learning and Intelligent Systems”. It includes 32k datasets with information on age, workclass, education, …, and finally the information on income. Income might be “<=50K” or “>=50K”.

Data Preparation

We need some packages

rio for data import
randomForest to create the model
ROCR to create receiver-operating-curve

url defines the place to download the file. This can be downloaded with download.file and imported with import.

The column names need to be changed and for factor levels need to be created, because columns are just characters after import. Target variable income is a factor, but needs to be numeric for further analysis. After this, income is either 1 (meaning “<=50K”) or 2 (“>=50K”). After final deletion of not required columns, data preparation is done.

# load libraries
suppressPackageStartupMessages(library(rio))  # data import
suppressPackageStartupMessages(library(randomForest))  # random forest
suppressPackageStartupMessages(library(e1071))  # svm
suppressPackageStartupMessages(library(ROCR))  # ROC
suppressPackageStartupMessages(library(ggplot2))  # visualisation
suppressPackageStartupMessages(library(knitr))  # for nice table (kable)
suppressPackageStartupMessages(library(dplyr))  # for filtering


url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
#download.file(url = url, destfile = "./data/adult.txt")

adult <- import(file = "./data/adult.txt")
colnames(adult) <- c("age", "workclass", "fnlwgt","education", "education_num", "marital_status", 
             "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
             "hours_per_week", "native_country", "income")

# chars to factors
adult$workclass <- as.factor(adult$workclass)
adult$education <- as.factor(adult$education)
adult$marital_status <- as.factor(adult$marital_status)
adult$occupation <- as.factor(adult$occupation)
adult$relationship <- as.factor(adult$relationship)
adult$race <- as.factor(adult$race)
adult$sex <- as.factor(adult$sex)
adult$native_country <- as.factor(adult$native_country)
adult$income <- as.numeric(as.factor(adult$income))

# delete columns
adult$fnlwgt <- NULL
adult$education <- NULL

Random-Forest Model Creation

Now the interesting part starts. First, we need to split data in data used for testing and data used for training. Rule of thumb is 80 % training, 20 % testing. Since the training of model takes extremely long we use 20 % for training and 80 % for testing. To get reproducable results I set the seed, so that the sample function uses the same data each time.

set.seed(1000)
adult_train <- sample(x = 1:nrow(adult), size = .2 * nrow(adult))
adult_test <- setdiff(1:nrow(adult), adult_train)  # remaining data

The model will be stored in variable rf_fit. For this function randomForest is called. It requires a formula, which defines target variable and which indepenpent variables influence it. Here, target variable is income and all other variables . are independent variables. Data needs to be defined as well as the training subset.

After a while model is created and predictions are made based on test subset.

rf_fit <- randomForest(income ~ ., data = adult, subset = adult_train)
rf_pred <- predict(object = rf_fit, newdata = adult[adult_test, ] )

Let’s take a look at predictions.

Predictions are continuous and range from 1 to 2. It depends on a threshold to get a classification. E.g. you could define a threshold at 1.5. Each value below is classified 1 and each value above as 2.

The classifier results need to be evaluated by creating a prediction object. Thus, input data is standardised. performance is used to create the data for ROC-curve. Here, true positive rate tpr and false positive rate fpr is applied.

rf_prediction_object <- prediction(rf_pred, adult$income[adult_test])
rf_perf <- performance(rf_prediction_object, "tpr", "fpr")

Support Vector Machine Model Creation

The process is very similar to the previous model. svm is the function to create the corresponding model. Parameters are the same as before. predict is the same. Only the object is now the svm_fit model. Creation of prediction object and performance is the same.

svm_fit <- svm(income ~ ., data = adult, subset = adult_train)
svm_pred <- predict(object = svm_fit, newdata = adult[adult_test, ])
svm_prediction_object <- prediction(svm_pred, adult$income[adult_test])
svm_perf <- performance(svm_prediction_object, "tpr", "fpr")

ROC-Curve

ROC Curve is a diagram that shows performance of a classifier for different thresholds. In our example true positive rate (TPR) and false positive rate (FPR) are used. The optimal model has the largest area under the curve.

We create a dataframe from performance-object and extract x (FPR), y (TPR), and alpha (threshold)-values. This data is plotted with ggplot.

rf_perf_df <- data.frame(x = rf_perf@x.values, 
             y = rf_perf@y.values,
             alpha = rf_perf@alpha.values)
colnames(rf_perf_df) <- c("x", "y", "alpha")
rf_perf_df$model <- "Random Forest"

svm_perf_df <- data.frame(x = svm_perf@x.values, 
              y = svm_perf@y.values,
              alpha = svm_perf@alpha.values)
colnames(svm_perf_df) <- c("x", "y", "alpha")
svm_perf_df$model <- "SVM"

perf_combined <- rbind(rf_perf_df, svm_perf_df)


g <- ggplot(perf_combined, aes(x, y, color = model))
g <- g + geom_line(size =1)
g <- g + theme_bw()
g <- g + xlab ("False Positive Rate [-]")
g <- g + ylab ("True Positive Rate [-]")
g <- g + ggtitle ("ROC-Curve")
g <- g + geom_abline(intercept = 0)
g <- g + theme(plot.margin=unit(c(0,0,0,0),"cm"))
g

In our example Random Forest provides a better model than SVM. But how can the optimal threshold be defined?

Optimal Threshold

The optimal threshold is where a 45 degree line meets ROC-curve at exactly one point (tangent). It is

The optimal threshold can be read from data frame rf_perf_df.

rf_perf_df %>% filter(x >= .1999 & x <= .2) %>% kable()

x	y	alpha	model
0.1999089	0.8307937	1.237036	Random Forest
0.1999595	0.8307937	1.236964	Random Forest
0.1999595	0.8309524	1.236948	Random Forest
0.1999595	0.8311111	1.236685	Random Forest
0.1999595	0.8312698	1.236603	Random Forest

Confusion Matrix

A confusion matrix, or error matrix, is a table, which represents performance of a model with a specific threshold. Actual and prediced values are shown in a table. Each cell has a name

	predicted_pos	predicted_neg
Condition Positive	True Positive	False Negative
Condition Negative	False Positive	True Negative

Many performance metrics can be calculated.

threshold <- 1.24
rf_pred_thres <- ifelse(rf_pred > threshold, 2, 1)
cm <- table (actual = adult$income[adult_test], 
         predicted = rf_pred_thres)
cm

##       predicted
## actual     1     2
##      1 15855  3894
##      2  1081  5219

Here, accuracy is calculated. accuracy is defined as the ratio of true predictions (positive and negative) and sum of total population. It can be calculated with the sum of diagonal of matrix divided by total sum.

accuracy <- cm %>% diag() %>% sum() / cm %>% sum() * 100
accuracy

## [1] 80.90138

Our model is able to correctly prediction 80.8 %. It sounds good, but is this really good or even bad? You can’t tell until you compare this to the naive estimator. Naive estimator takes probabibilities of data and guesses randomly according to probabilities.

Let’s take a look at our example. We need to calculate the number of classes in out test-data.

sum(adult$income[adult_test] == 1) / length(adult$income[adult_test]) * 100

## [1] 75.81481

So there are 76 % of persons with “<=50K” and 24 % of persons with “>=50K”. Our naive estimators prediction would exactly match these ratios.

Our model increases precision from a randomly 76 % to nearly 81 %. Good for starters, but model could be tuned with many parameters to produce better results.

In this article test set (training and testing) is used. In the article on Cross Validation you can find a more sophisticated approach.

Bibliography

Adult Data http://archive.ics.uci.edu/ml/datasets/Adult

ROC Curve https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Confusion Matrix https://en.wikipedia.org/wiki/Confusion_matrix