Cross Validation – Gollnick Data

In this tutorial you will learn what the bias-variance tradeoff is and how this problem can be solved with cross validation.

Bias-Variance Tradeoff

In machine learning two errors should be minimized at the same time: bias and variance.

Bias: Bias is an error caused by false assumptions in algorithm. High bias might cause underfitting.
Variance: Variance is an error stemming from sensitivity in training data. A high variance might cause overfitting. In this case noise in training data is modeled rather than the underlying relationship.

Both errors cannot be minimized at the same time, so a reasonable tradeoff is required.

Resampling Methods

There are different resampling methods. In this post hold-out, k-fold cross validation and leave-one-out cross validation are presented.

Holdout Method

Easiest approach is to take a certain ratio of data, e.g. say 80 % for training, and the residual 20 % for testing. Problems might be that it is not affordable to hold back a subset for validation. Also, the split might have a negative impact on the validation error.

Some methods have been developed to overcome these limitations.

k-Fold Cross Validation (CV)

This method is a generalisation of hold-out method. Data is randomly splitted in k-folds, typically 10. Let’s assume 10 folds for now. Folds 2 to 10 are used for training the model, and the residual first fold for validation of the model. Now, the process is repeated. But this time the second fold is used for validation and folds 1, 3 to 10 are used for training the model. This process is repeated k times.

Final predictor is the average of the models.

Leave-One-Out Cross Validation (LOOCV)

This method requires (n-1) data for training, and 1 data set for validation. This process is repeated n times. It is numerically very costly, but also is prone to overfitting.

Example: Wine Classification

We will learn these methods on a classification example. 178 wines were chemically tested. All are grown in the same italian region, but derived from three different cultivars. 13 chemical attributes are available. We will use random forest algorithm for the model.

First, load required packages, which are caret and randomForest. The data is provided under specified url by UCI Machine Learning Repository. Data is imported with read.csv().

Data Preparation

# Load Packages
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(randomForest))

# Load Data
url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine <- read.csv(url)

Column names are not part of the data, so they have to be prepared manually.

colnames(wine) <- c("class", "alcohol","malic_acid", "ash", "alcalinity_ash", "magnesium", 
            "total_phenols", "flavanoids", "nonflavanoid phenols", "proanthocyanins",
            "color_intensity", "hue", "OD280_OD315", "proline")

Model with Cross Validation

trainControl() defines some parameters of the train function, which will be used for fitting the model. First, 10-fold CV is defined. train has a formula describing name of target variable (here: “class”) and predictor variables (here: “.”, which means all other variables). Also the data (“wine”), method (“rf” for random forest) and trControl are passed as parameters.

train_control <- trainControl(method="cv", number=10)
model <- train(class ~ ., data = wine, 
           trControl=train_control, 
           method="rf")
model

## Random Forest 
## 
## 177 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 159, 159, 159, 160, 160, 159, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1759535  0.9544400  0.12799178
##    7    0.1555891  0.9527028  0.08930017
##   13    0.1636014  0.9448971  0.08436650
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.

Model result shows a default variation of parameters. For random forest there is parameter mtry that is changed and its impact on RMSE and Rsquared is shown. mtry parameter defines number of variables randomly sampled at each split. mtry of 2 has highest Rsquared, but mtry of 7 has lowest RMSE. So it is unclear which of these two models should be used. The model result relies on RMSE and concludes mtry of 7 is optimal.

In general, both provide very good predictions.

Model with LOOCV

If LOOCV should be used code is only marginally changed. Parameter method in trainControl is changed to “LOOCV”.

train_control <- trainControl(method="LOOCV")
model <- train(class ~ ., data = wine, 
           trControl=train_control, 
           method="rf")
model

## Random Forest 
## 
## 177 samples
##  13 predictor
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 176, 176, 176, 176, 176, 176, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##    2    0.1806537  0.9551362  0.12518701
##    7    0.1720328  0.9526184  0.08821337
##   13    0.1849784  0.9436547  0.08422712
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.

You experience numerical effort, because it takes a while until calculation is done. RMSE and Rsquared are very comparable to 10-fold CV result.

Summary

You learned different resampling methods: Cross Validation and Leave-one-out Cross Validation. Their purpose and advantages / disadvantages are presented. Finally, their application in an example is shown.

Bibliography

Bias-variance tradeoff https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
Cross Validation https://en.wikipedia.org/wiki/Cross-validation_(statistics)
UCI Machine Learning Repository http://archive.ics.uci.edu/ml
Wine Dataset https://archive.ics.uci.edu/ml/datasets/Wine