Train / Validation / Test Split

This interactive dashboard will help you to understand train / validation / test splits.

You can modify the data count between 10 and 1000. As default I set 60 % training ratio. That leaves 40 % for validation and testing. With the second slider you can set validation ratio. Test ratio is set implicitly, because it is the residual to 100 %. To get consistently the same results you can specify the seed. This way you will see the exact same splitting as me in the lecture.

Selection Type

The selection type can be either linear or random. Linear is easier to understand. You take the first data, corresponding to training ratio and use it for training. The next partition is used for validation, and the last for testing.

The problem with this approach is, that it relies on an assumption: that the data is randomly distributed over the dataset. Imagine, you develop a binary classifier. And the first half of the data is of class A and the second half of class B. You can easily understand that your training data might only have class A in it, so the classifier would be useless. For this, it is much better to randomly select data from the dataset. If your training ratio is 60 %, then 60 % of the data is randomly sampled from the dataset. The validation and testing data is sampled from the residual dataset.

Split Ratio

How should you set the split ratio. That depends on the total number of samples and the actual model. Some models require more training data than others. In general, the validation data should be big enough to detect differences between the models. Models with few hyperparameters will be easier to validate and thus need a smaller validation dataset. Models with many hyperparameters will be harder to validate and thus require a larger validation dataset. Let’s take an example. You have 1.000 datasets in total. You want to have 200 datasets for each: validation and testing. So you reserve 20 % for validation and 20 % for testing. But if your dataset increases and you have 10.000 datasets, then you can reduce the ratio for validation and testing to 2 % each.

Go and find out yourself by modifying the parameters.