We will learn to detect outliers for univariate distributions. We will analyse “iris” dataset for this purpose.
- Learning Objective: Outlier Detection
- Requirements: ggplot2, plyr, dplyr
Introduction
We will use iris data set. So we first load it with data().
data(iris)
We want to detect outliers for one specific variable (here: “Petal.Length”). Let’s visualise the data with ggplot(). If you are not familiar with ggplot() I recommend to take a look at this tutorial…
The graph shows a boxplot for each grouping (here: “Species”). A boxplot is a helpful visualisation for univariate distributions. If you need some refreshment on how to read it, you find a link at the end of this article.
The dots represent outliers.
- Outliers are defined as values that are 3 times IQR above third quantile or 3 times IQR below first quantile.
- Suspected outliers are 1.5 times IQR (inter quartile range) above third quantile or 1.5 times IQR below first quantile.
In other sources you find the naming mild (1.5 IQR) and extreme (3 IQR) outliers.
Outlier Calculation
The calculation is straightforward.
- We calculate first quantile (Q1) and third quantile (Q3).
- With this we can calculate IQR (= Q3 – Q1).
- Now we calculate upper and lower limits according to shown definition.
I use ddply() function from plyr package. Since piping is used, dplyr package is loaded as well.
library(plyr)
suppressPackageStartupMessages(library(dplyr))
outlier_limits <- iris %>%
ddply(.(Species), summarise,
Q1 = quantile(Petal.Length, probs = 0.25),
Q3 = quantile(Petal.Length, probs = 0.75),
IQR = Q3 - Q1,
upper_inner_limit = Q3 + 1.5 * IQR,
lower_inner_limit = Q1 - 1.5 * IQR)
outlier_limits
## Species Q1 Q3 IQR upper_inner_limit lower_inner_limit ## 1 setosa 1.4 1.575 0.175 1.8375 1.1375 ## 2 versicolor 4.0 4.600 0.600 5.5000 3.1000 ## 3 virginica 5.1 5.875 0.775 7.0375 3.9375
This information now is applied to our dataset. We will first join it with “iris”, before we perform some simple comparisons to find outliers.
For the joining we use left_join() from dplyr package. We a nicer view we remove variables that are not needed any more with select(), and take a look at the head of this dataset.
If you need a refreshment on dplyr package I warmly recommend tutorial … to you.
iris <- left_join(iris, outlier_limits, by = "Species") %>%
select (-Q1, -Q3, -IQR)
iris %>% head() %>% kable()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | upper_inner_limit | lower_inner_limit |
---|---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 1.8375 | 1.1375 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 1.8375 | 1.1375 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 1.8375 | 1.1375 |
4.6 | 3.1 | 1.5 | 0.2 | setosa | 1.8375 | 1.1375 |
5.0 | 3.6 | 1.4 | 0.2 | setosa | 1.8375 | 1.1375 |
5.4 | 3.9 | 1.7 | 0.4 | setosa | 1.8375 | 1.1375 |
We are nearly done. Now, we create a new variable “Petal.Length.Outlier”, which has the value “no outlier”, if its value is between lower and upper limit, and “outlier” otherwise.
iris$Petal.Length.Outlier <- ifelse(iris$Petal.Length > iris$lower_inner_limit &
iris$Petal.Length < iris$upper_inner_limit, "no outlier", "outlier")
Let’s take a look at our outliers.
iris %>% filter(Petal.Length.Outlier == "outlier") %>% kable()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | upper_inner_limit | lower_inner_limit | Petal.Length.Outlier |
---|---|---|---|---|---|---|---|
4.3 | 3.0 | 1.1 | 0.1 | setosa | 1.8375 | 1.1375 | outlier |
4.6 | 3.6 | 1.0 | 0.2 | setosa | 1.8375 | 1.1375 | outlier |
4.8 | 3.4 | 1.9 | 0.2 | setosa | 1.8375 | 1.1375 | outlier |
5.1 | 3.8 | 1.9 | 0.4 | setosa | 1.8375 | 1.1375 | outlier |
5.1 | 2.5 | 3.0 | 1.1 | versicolor | 5.5000 | 3.1000 | outlier |
You see there are four outliers for species setosa, and one outlier for versicolor. This are the same points that we have seen in our ggplot boxplot.
It is up to you to perform the calculation for extreme outliers (3 times IQR).