We will create a Random Forest and a Support Vector Machine model for income prediction. We will also learn what a Receiver Operating Characteristic (ROC) is and how to interpret it.
- Objectives: Model Creation, Random Forest, SVM, Prediction, ROC-Curve
- Requirements: R Basics, R Data Mining
Data
Adult data set is provided by “Center for Machine Learning and Intelligent Systems”. It includes 32k datasets with information on age, workclass, education, …, and finally the information on income. Income might be “<=50K” or “>=50K”.
Data Preparation
We need some packages
- rio for data import
- randomForest to create the model
- ROCR to create receiver-operating-curve
url defines the place to download the file. This can be downloaded with download.file and imported with import.
The column names need to be changed and for factor levels need to be created, because columns are just characters after import. Target variable income is a factor, but needs to be numeric for further analysis. After this, income is either 1 (meaning “<=50K”) or 2 (“>=50K”). After final deletion of not required columns, data preparation is done.
# load libraries
suppressPackageStartupMessages(library(rio)) # data import
suppressPackageStartupMessages(library(randomForest)) # random forest
suppressPackageStartupMessages(library(e1071)) # svm
suppressPackageStartupMessages(library(ROCR)) # ROC
suppressPackageStartupMessages(library(ggplot2)) # visualisation
suppressPackageStartupMessages(library(knitr)) # for nice table (kable)
suppressPackageStartupMessages(library(dplyr)) # for filtering
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
#download.file(url = url, destfile = "./data/adult.txt")
adult <- import(file = "./data/adult.txt")
colnames(adult) <- c("age", "workclass", "fnlwgt","education", "education_num", "marital_status",
"occupation", "relationship", "race", "sex", "capital_gain", "capital_loss",
"hours_per_week", "native_country", "income")
# chars to factors
adult$workclass <- as.factor(adult$workclass)
adult$education <- as.factor(adult$education)
adult$marital_status <- as.factor(adult$marital_status)
adult$occupation <- as.factor(adult$occupation)
adult$relationship <- as.factor(adult$relationship)
adult$race <- as.factor(adult$race)
adult$sex <- as.factor(adult$sex)
adult$native_country <- as.factor(adult$native_country)
adult$income <- as.numeric(as.factor(adult$income))
# delete columns
adult$fnlwgt <- NULL
adult$education <- NULL
Random-Forest Model Creation
Now the interesting part starts. First, we need to split data in data used for testing and data used for training. Rule of thumb is 80 % training, 20 % testing. Since the training of model takes extremely long we use 20 % for training and 80 % for testing. To get reproducable results I set the seed, so that the sample function uses the same data each time.
set.seed(1000)
adult_train <- sample(x = 1:nrow(adult), size = .2 * nrow(adult))
adult_test <- setdiff(1:nrow(adult), adult_train) # remaining data
The model will be stored in variable rf_fit. For this function randomForest is called. It requires a formula, which defines target variable and which indepenpent variables influence it. Here, target variable is income and all other variables . are independent variables. Data needs to be defined as well as the training subset.
After a while model is created and predictions are made based on test subset.
rf_fit <- randomForest(income ~ ., data = adult, subset = adult_train)
rf_pred <- predict(object = rf_fit, newdata = adult[adult_test, ] )
Let’s take a look at predictions.
Predictions are continuous and range from 1 to 2. It depends on a threshold to get a classification. E.g. you could define a threshold at 1.5. Each value below is classified 1 and each value above as 2.
The classifier results need to be evaluated by creating a prediction object. Thus, input data is standardised. performance is used to create the data for ROC-curve. Here, true positive rate tpr and false positive rate fpr is applied.
rf_prediction_object <- prediction(rf_pred, adult$income[adult_test])
rf_perf <- performance(rf_prediction_object, "tpr", "fpr")
Support Vector Machine Model Creation
The process is very similar to the previous model. svm is the function to create the corresponding model. Parameters are the same as before. predict is the same. Only the object is now the svm_fit model. Creation of prediction object and performance is the same.
svm_fit <- svm(income ~ ., data = adult, subset = adult_train)
svm_pred <- predict(object = svm_fit, newdata = adult[adult_test, ])
svm_prediction_object <- prediction(svm_pred, adult$income[adult_test])
svm_perf <- performance(svm_prediction_object, "tpr", "fpr")
ROC-Curve
ROC Curve is a diagram that shows performance of a classifier for different thresholds. In our example true positive rate (TPR) and false positive rate (FPR) are used. The optimal model has the largest area under the curve.
We create a dataframe from performance-object and extract x (FPR), y (TPR), and alpha (threshold)-values. This data is plotted with ggplot.
rf_perf_df <- data.frame(x = rf_perf@x.values,
y = rf_perf@y.values,
alpha = rf_perf@alpha.values)
colnames(rf_perf_df) <- c("x", "y", "alpha")
rf_perf_df$model <- "Random Forest"
svm_perf_df <- data.frame(x = svm_perf@x.values,
y = svm_perf@y.values,
alpha = svm_perf@alpha.values)
colnames(svm_perf_df) <- c("x", "y", "alpha")
svm_perf_df$model <- "SVM"
perf_combined <- rbind(rf_perf_df, svm_perf_df)
g <- ggplot(perf_combined, aes(x, y, color = model))
g <- g + geom_line(size =1)
g <- g + theme_bw()
g <- g + xlab ("False Positive Rate [-]")
g <- g + ylab ("True Positive Rate [-]")
g <- g + ggtitle ("ROC-Curve")
g <- g + geom_abline(intercept = 0)
g <- g + theme(plot.margin=unit(c(0,0,0,0),"cm"))
g
In our example Random Forest provides a better model than SVM. But how can the optimal threshold be defined?
Optimal Threshold
The optimal threshold is where a 45 degree line meets ROC-curve at exactly one point (tangent). It is
The optimal threshold can be read from data frame rf_perf_df.
rf_perf_df %>% filter(x >= .1999 & x <= .2) %>% kable()
x | y | alpha | model |
---|---|---|---|
0.1999089 | 0.8307937 | 1.237036 | Random Forest |
0.1999595 | 0.8307937 | 1.236964 | Random Forest |
0.1999595 | 0.8309524 | 1.236948 | Random Forest |
0.1999595 | 0.8311111 | 1.236685 | Random Forest |
0.1999595 | 0.8312698 | 1.236603 | Random Forest |
Confusion Matrix
A confusion matrix, or error matrix, is a table, which represents performance of a model with a specific threshold. Actual and prediced values are shown in a table. Each cell has a name
predicted_pos | predicted_neg | |
---|---|---|
Condition Positive | True Positive | False Negative |
Condition Negative | False Positive | True Negative |
Many performance metrics can be calculated.
threshold <- 1.24
rf_pred_thres <- ifelse(rf_pred > threshold, 2, 1)
cm <- table (actual = adult$income[adult_test],
predicted = rf_pred_thres)
cm
## predicted ## actual 1 2 ## 1 15855 3894 ## 2 1081 5219
Here, accuracy is calculated. accuracy is defined as the ratio of true predictions (positive and negative) and sum of total population. It can be calculated with the sum of diagonal of matrix divided by total sum.
accuracy <- cm %>% diag() %>% sum() / cm %>% sum() * 100
accuracy
## [1] 80.90138
Our model is able to correctly prediction 80.8 %. It sounds good, but is this really good or even bad? You can’t tell until you compare this to the naive estimator. Naive estimator takes probabibilities of data and guesses randomly according to probabilities.
Let’s take a look at our example. We need to calculate the number of classes in out test-data.
sum(adult$income[adult_test] == 1) / length(adult$income[adult_test]) * 100
## [1] 75.81481
So there are 76 % of persons with “<=50K” and 24 % of persons with “>=50K”. Our naive estimators prediction would exactly match these ratios.
Our model increases precision from a randomly 76 % to nearly 81 %. Good for starters, but model could be tuned with many parameters to produce better results.
Related Posts
In this article test set (training and testing) is used. In the article on Cross Validation you can find a more sophisticated approach.
Bibliography
Adult Data http://archive.ics.uci.edu/ml/datasets/Adult
ROC Curve https://en.wikipedia.org/wiki/Receiver_operating_characteristic
Confusion Matrix https://en.wikipedia.org/wiki/Confusion_matrix