Create a Random Forest and Support Vector Machines Classifier and analyse the models with a ROC Curve

We will create a Random Forest and a Support Vector Machine model for income prediction. We will also learn what a Receiver Operating Characteristic (ROC) is and how to interpret it.

  • Objectives: Model Creation, Random Forest, SVM, Prediction, ROC-Curve
  • Requirements: R Basics, R Data Mining

Data

Adult data set is provided by “Center for Machine Learning and Intelligent Systems”. It includes 32k datasets with information on age, workclass, education, …, and finally the information on income. Income might be “<=50K” or “>=50K”.

Data Preparation

We need some packages

  • rio for data import
  • randomForest to create the model
  • ROCR to create receiver-operating-curve

url defines the place to download the file. This can be downloaded with download.file and imported with import.

The column names need to be changed and for factor levels need to be created, because columns are just characters after import. Target variable income is a factor, but needs to be numeric for further analysis. After this, income is either 1 (meaning “<=50K”) or 2 (“>=50K”). After final deletion of not required columns, data preparation is done.

# load libraries
suppressPackageStartupMessages(library(rio))  # data import
suppressPackageStartupMessages(library(randomForest))  # random forest
suppressPackageStartupMessages(library(e1071))  # svm
suppressPackageStartupMessages(library(ROCR))  # ROC
suppressPackageStartupMessages(library(ggplot2))  # visualisation
suppressPackageStartupMessages(library(knitr))  # for nice table (kable)
suppressPackageStartupMessages(library(dplyr))  # for filtering


url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
#download.file(url = url, destfile = "./data/adult.txt")

adult <- import(file = "./data/adult.txt")
colnames(adult) <- c("age", "workclass", "fnlwgt","education", "education_num", "marital_status", 
             "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
             "hours_per_week", "native_country", "income")

# chars to factors
adult$workclass <- as.factor(adult$workclass)
adult$education <- as.factor(adult$education)
adult$marital_status <- as.factor(adult$marital_status)
adult$occupation <- as.factor(adult$occupation)
adult$relationship <- as.factor(adult$relationship)
adult$race <- as.factor(adult$race)
adult$sex <- as.factor(adult$sex)
adult$native_country <- as.factor(adult$native_country)
adult$income <- as.numeric(as.factor(adult$income))

# delete columns
adult$fnlwgt <- NULL
adult$education <- NULL

Random-Forest Model Creation

Now the interesting part starts. First, we need to split data in data used for testing and data used for training. Rule of thumb is 80 % training, 20 % testing. Since the training of model takes extremely long we use 20 % for training and 80 % for testing. To get reproducable results I set the seed, so that the sample function uses the same data each time.

set.seed(1000)
adult_train <- sample(x = 1:nrow(adult), size = .2 * nrow(adult))
adult_test <- setdiff(1:nrow(adult), adult_train)  # remaining data

The model will be stored in variable rf_fit. For this function randomForest is called. It requires a formula, which defines target variable and which indepenpent variables influence it. Here, target variable is income and all other variables . are independent variables. Data needs to be defined as well as the training subset.

After a while model is created and predictions are made based on test subset.

rf_fit <- randomForest(income ~ ., data = adult, subset = adult_train)
rf_pred <- predict(object = rf_fit, newdata = adult[adult_test, ] )

Let’s take a look at predictions.

plot of chunk unnamed-chunk-4

Predictions are continuous and range from 1 to 2. It depends on a threshold to get a classification. E.g. you could define a threshold at 1.5. Each value below is classified 1 and each value above as 2.

The classifier results need to be evaluated by creating a prediction object. Thus, input data is standardised. performance is used to create the data for ROC-curve. Here, true positive rate tpr and false positive rate fpr is applied.

rf_prediction_object <- prediction(rf_pred, adult$income[adult_test])
rf_perf <- performance(rf_prediction_object, "tpr", "fpr")

Support Vector Machine Model Creation

The process is very similar to the previous model. svm is the function to create the corresponding model. Parameters are the same as before. predict is the same. Only the object is now the svm_fit model. Creation of prediction object and performance is the same.

svm_fit <- svm(income ~ ., data = adult, subset = adult_train)
svm_pred <- predict(object = svm_fit, newdata = adult[adult_test, ])
svm_prediction_object <- prediction(svm_pred, adult$income[adult_test])
svm_perf <- performance(svm_prediction_object, "tpr", "fpr")

ROC-Curve

ROC Curve is a diagram that shows performance of a classifier for different thresholds. In our example true positive rate (TPR) and false positive rate (FPR) are used. The optimal model has the largest area under the curve.

We create a dataframe from performance-object and extract x (FPR), y (TPR), and alpha (threshold)-values. This data is plotted with ggplot.

rf_perf_df <- data.frame(x = rf_perf@x.values, 
             y = rf_perf@y.values,
             alpha = rf_perf@alpha.values)
colnames(rf_perf_df) <- c("x", "y", "alpha")
rf_perf_df$model <- "Random Forest"

svm_perf_df <- data.frame(x = svm_perf@x.values, 
              y = svm_perf@y.values,
              alpha = svm_perf@alpha.values)
colnames(svm_perf_df) <- c("x", "y", "alpha")
svm_perf_df$model <- "SVM"

perf_combined <- rbind(rf_perf_df, svm_perf_df)


g <- ggplot(perf_combined, aes(x, y, color = model))
g <- g + geom_line(size =1)
g <- g + theme_bw()
g <- g + xlab ("False Positive Rate [-]")
g <- g + ylab ("True Positive Rate [-]")
g <- g + ggtitle ("ROC-Curve")
g <- g + geom_abline(intercept = 0)
g <- g + theme(plot.margin=unit(c(0,0,0,0),"cm"))
g

plot of chunk unnamed-chunk-7

In our example Random Forest provides a better model than SVM. But how can the optimal threshold be defined?

Optimal Threshold

The optimal threshold is where a 45 degree line meets ROC-curve at exactly one point (tangent). It is
plot of chunk unnamed-chunk-8

The optimal threshold can be read from data frame rf_perf_df.

rf_perf_df %>% filter(x >= .1999 & x <= .2) %>% kable()
x y alpha model
0.1999089 0.8307937 1.237036 Random Forest
0.1999595 0.8307937 1.236964 Random Forest
0.1999595 0.8309524 1.236948 Random Forest
0.1999595 0.8311111 1.236685 Random Forest
0.1999595 0.8312698 1.236603 Random Forest

Confusion Matrix

A confusion matrix, or error matrix, is a table, which represents performance of a model with a specific threshold. Actual and prediced values are shown in a table. Each cell has a name

predicted_pos predicted_neg
Condition Positive True Positive False Negative
Condition Negative False Positive True Negative

Many performance metrics can be calculated.

threshold <- 1.24
rf_pred_thres <- ifelse(rf_pred > threshold, 2, 1)
cm <- table (actual = adult$income[adult_test], 
         predicted = rf_pred_thres)
cm
##       predicted
## actual     1     2
##      1 15855  3894
##      2  1081  5219

Here, accuracy is calculated. accuracy is defined as the ratio of true predictions (positive and negative) and sum of total population. It can be calculated with the sum of diagonal of matrix divided by total sum.

accuracy <- cm %>% diag() %>% sum() / cm %>% sum() * 100
accuracy
## [1] 80.90138

Our model is able to correctly prediction 80.8 %. It sounds good, but is this really good or even bad? You can’t tell until you compare this to the naive estimator. Naive estimator takes probabibilities of data and guesses randomly according to probabilities.

Let’s take a look at our example. We need to calculate the number of classes in out test-data.

sum(adult$income[adult_test] == 1) / length(adult$income[adult_test]) * 100
## [1] 75.81481

So there are 76 % of persons with “<=50K” and 24 % of persons with “>=50K”. Our naive estimators prediction would exactly match these ratios.

Our model increases precision from a randomly 76 % to nearly 81 %. Good for starters, but model could be tuned with many parameters to produce better results.

Related Posts

In this article test set (training and testing) is used. In the article on Cross Validation you can find a more sophisticated approach.

Bibliography

Adult Data http://archive.ics.uci.edu/ml/datasets/Adult

ROC Curve https://en.wikipedia.org/wiki/Receiver_operating_characteristic

Confusion Matrix https://en.wikipedia.org/wiki/Confusion_matrix

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close