Outlier Detection – Gollnick Data

We will learn to detect outliers for univariate distributions. We will analyse “iris” dataset for this purpose.

Learning Objective: Outlier Detection
Requirements: ggplot2, plyr, dplyr

Introduction

We will use iris data set. So we first load it with data().

data(iris)

We want to detect outliers for one specific variable (here: “Petal.Length”). Let’s visualise the data with ggplot(). If you are not familiar with ggplot() I recommend to take a look at this tutorial…

The graph shows a boxplot for each grouping (here: “Species”). A boxplot is a helpful visualisation for univariate distributions. If you need some refreshment on how to read it, you find a link at the end of this article.

The dots represent outliers.

Outliers are defined as values that are 3 times IQR above third quantile or 3 times IQR below first quantile.
Suspected outliers are 1.5 times IQR (inter quartile range) above third quantile or 1.5 times IQR below first quantile.

In other sources you find the naming mild (1.5 IQR) and extreme (3 IQR) outliers.

Outlier Calculation

The calculation is straightforward.

We calculate first quantile (Q1) and third quantile (Q3).
With this we can calculate IQR (= Q3 – Q1).
Now we calculate upper and lower limits according to shown definition.

I use ddply() function from plyr package. Since piping is used, dplyr package is loaded as well.

library(plyr)
suppressPackageStartupMessages(library(dplyr))
outlier_limits <- iris %>% 
    ddply(.(Species), summarise,
          Q1 = quantile(Petal.Length, probs = 0.25),
          Q3 = quantile(Petal.Length, probs = 0.75),
          IQR = Q3 - Q1,
          upper_inner_limit = Q3 + 1.5 * IQR,
          lower_inner_limit = Q1 - 1.5 * IQR)
outlier_limits

##      Species  Q1    Q3   IQR upper_inner_limit lower_inner_limit
## 1     setosa 1.4 1.575 0.175            1.8375            1.1375
## 2 versicolor 4.0 4.600 0.600            5.5000            3.1000
## 3  virginica 5.1 5.875 0.775            7.0375            3.9375

This information now is applied to our dataset. We will first join it with “iris”, before we perform some simple comparisons to find outliers.

For the joining we use left_join() from dplyr package. We a nicer view we remove variables that are not needed any more with select(), and take a look at the head of this dataset.

If you need a refreshment on dplyr package I warmly recommend tutorial … to you.

iris <- left_join(iris, outlier_limits, by = "Species") %>% 
    select (-Q1, -Q3, -IQR)
iris %>% head() %>% kable()

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	upper_inner_limit	lower_inner_limit
5.1	3.5	1.4	0.2	setosa	1.8375	1.1375
4.9	3.0	1.4	0.2	setosa	1.8375	1.1375
4.7	3.2	1.3	0.2	setosa	1.8375	1.1375
4.6	3.1	1.5	0.2	setosa	1.8375	1.1375
5.0	3.6	1.4	0.2	setosa	1.8375	1.1375
5.4	3.9	1.7	0.4	setosa	1.8375	1.1375

We are nearly done. Now, we create a new variable “Petal.Length.Outlier”, which has the value “no outlier”, if its value is between lower and upper limit, and “outlier” otherwise.

iris$Petal.Length.Outlier <- ifelse(iris$Petal.Length > iris$lower_inner_limit & 
    iris$Petal.Length < iris$upper_inner_limit, "no outlier", "outlier")

Let’s take a look at our outliers.

iris %>% filter(Petal.Length.Outlier == "outlier") %>% kable()

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species	upper_inner_limit	lower_inner_limit	Petal.Length.Outlier
4.3	3.0	1.1	0.1	setosa	1.8375	1.1375	outlier
4.6	3.6	1.0	0.2	setosa	1.8375	1.1375	outlier
4.8	3.4	1.9	0.2	setosa	1.8375	1.1375	outlier
5.1	3.8	1.9	0.4	setosa	1.8375	1.1375	outlier
5.1	2.5	3.0	1.1	versicolor	5.5000	3.1000	outlier

You see there are four outliers for species setosa, and one outlier for versicolor. This are the same points that we have seen in our ggplot boxplot.

It is up to you to perform the calculation for extreme outliers (3 times IQR).

More Information

Boxplot http://www.physics.csbsju.edu/stats/box2.html