Outlier Detection

We will learn to detect outliers for univariate distributions. We will analyse “iris” dataset for this purpose.

  • Learning Objective: Outlier Detection
  • Requirements: ggplot2, plyr, dplyr

Introduction

We will use iris data set. So we first load it with data().

data(iris)

We want to detect outliers for one specific variable (here: “Petal.Length”). Let’s visualise the data with ggplot(). If you are not familiar with ggplot() I recommend to take a look at this tutorial…

The graph shows a boxplot for each grouping (here: “Species”). A boxplot is a helpful visualisation for univariate distributions. If you need some refreshment on how to read it, you find a link at the end of this article.

plot of chunk unnamed-chunk-1

The dots represent outliers.

  • Outliers are defined as values that are 3 times IQR above third quantile or 3 times IQR below first quantile.
  • Suspected outliers are 1.5 times IQR (inter quartile range) above third quantile or 1.5 times IQR below first quantile.

In other sources you find the naming mild (1.5 IQR) and extreme (3 IQR) outliers.

Outlier Calculation

The calculation is straightforward.

  1. We calculate first quantile (Q1) and third quantile (Q3).
  2. With this we can calculate IQR (= Q3 – Q1).
  3. Now we calculate upper and lower limits according to shown definition.

I use ddply() function from plyr package. Since piping is used, dplyr package is loaded as well.

library(plyr)
suppressPackageStartupMessages(library(dplyr))
outlier_limits <- iris %>% 
    ddply(.(Species), summarise,
          Q1 = quantile(Petal.Length, probs = 0.25),
          Q3 = quantile(Petal.Length, probs = 0.75),
          IQR = Q3 - Q1,
          upper_inner_limit = Q3 + 1.5 * IQR,
          lower_inner_limit = Q1 - 1.5 * IQR)
outlier_limits
##      Species  Q1    Q3   IQR upper_inner_limit lower_inner_limit
## 1     setosa 1.4 1.575 0.175            1.8375            1.1375
## 2 versicolor 4.0 4.600 0.600            5.5000            3.1000
## 3  virginica 5.1 5.875 0.775            7.0375            3.9375

This information now is applied to our dataset. We will first join it with “iris”, before we perform some simple comparisons to find outliers.

For the joining we use left_join() from dplyr package. We a nicer view we remove variables that are not needed any more with select(), and take a look at the head of this dataset.

If you need a refreshment on dplyr package I warmly recommend tutorial … to you.

iris <- left_join(iris, outlier_limits, by = "Species") %>% 
    select (-Q1, -Q3, -IQR)
iris %>% head() %>% kable()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species upper_inner_limit lower_inner_limit
5.1 3.5 1.4 0.2 setosa 1.8375 1.1375
4.9 3.0 1.4 0.2 setosa 1.8375 1.1375
4.7 3.2 1.3 0.2 setosa 1.8375 1.1375
4.6 3.1 1.5 0.2 setosa 1.8375 1.1375
5.0 3.6 1.4 0.2 setosa 1.8375 1.1375
5.4 3.9 1.7 0.4 setosa 1.8375 1.1375

We are nearly done. Now, we create a new variable “Petal.Length.Outlier”, which has the value “no outlier”, if its value is between lower and upper limit, and “outlier” otherwise.

iris$Petal.Length.Outlier <- ifelse(iris$Petal.Length > iris$lower_inner_limit & 
    iris$Petal.Length < iris$upper_inner_limit, "no outlier", "outlier")

Let’s take a look at our outliers.

iris %>% filter(Petal.Length.Outlier == "outlier") %>% kable()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species upper_inner_limit lower_inner_limit Petal.Length.Outlier
4.3 3.0 1.1 0.1 setosa 1.8375 1.1375 outlier
4.6 3.6 1.0 0.2 setosa 1.8375 1.1375 outlier
4.8 3.4 1.9 0.2 setosa 1.8375 1.1375 outlier
5.1 3.8 1.9 0.4 setosa 1.8375 1.1375 outlier
5.1 2.5 3.0 1.1 versicolor 5.5000 3.1000 outlier

You see there are four outliers for species setosa, and one outlier for versicolor. This are the same points that we have seen in our ggplot boxplot.

It is up to you to perform the calculation for extreme outliers (3 times IQR).

More Information

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close