Data Manipulation with dplyr

dplyr is one of the most popular and most useful packages in the R-universe. It covers all aspects of data manipulation: filtering, selection, sorting and many more.

  • Objectives: data manipulation with dplyr
  • Requirements: none

In our example we will find out most popular girl-names from 2014 and get more details of their popularity from 1880 to 2014.

Packages

We will analyse baby names from package babynames. Please make sure you have installed it before trying to load it. The same holds for dplyr.

  • babynames: includes data for this tutorial
  • dplyr: includes all data manipulation
library(babynames)  # data for this tutorial
library(dplyr)  # data manipulation
library(ggplot2)  # visualisation of results

Data

You can load the dataframe using data and get an overview of structure with tbl_df.

data("babynames")
tbl_df(babynames)
## # A tibble: 1,858,689 x 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # ... with 1,858,679 more rows

There are five columns: year, sex, name, a count n, and corresponding proportion prop.

Piping

dplyr makes use of piping, which is a very helpful technique and I strongly recommend to use it, because it makes code more readable. It avoids nested code. Instead it takes code from the left side as input and passes it on to its right side as an argument. The piping operator %>% is used.

It is best explained using an example. The example is not very complicated since it only includes three functions. You can imagine it gets much more complicated if you nest much more functions. To understand what it does you have to work from inside to outside. Here, you start with a sequence from 1 to 10000, then you filter and only take the first 100 elements. Finally you calculate the sum of this.

sum (head (seq (1:10000), n=100))
## [1] 5050

Here is a piped version, which is much easier to understand.

seq(1: 10000) %>% head(n=100) %>% sum()
## [1] 5050

Filtering and Piping

We want to keep only data from 2014 and can get this with filter. First, take a look at the pipeless version. The first argument is the data, followed by the filter to be applied.

most_popular_female <- filter(babynames, year == 2014)

Here is how it works with piping.

most_popular_female <- babynames %>% 
    filter (year == 2014)

We have a dataframe with all names, but only from 2014. For the remainder we will always use piping where applicable.

We also have to filter for sex, so add an additional filter.

most_popular_female <- babynames %>% 
    filter (year == 2014) %>% 
    filter (sex == "F")

Hint: some identical function names are part of other packages as well, e.g. filter and might be masked by them. This means if you just call filter the filter function from a different package is used instead of dplyr. Of course this does not work. Some workarounds: make sure that only dplyr is loaded; make sure dplyr is loaded before the other packages with identical function names. Another option to avoid trouble is to specifically define filter function from dplyr package by stating dplyr::filter().

Sorting

Data can be sorted with arrange. This function can have several sorting columns. Default-order is ascending. If descending is required, it can be done with desc().

In our example data is sorted according to sex and descending proportion prop.

most_popular_female <- babynames %>% 
    filter (year == 2014) %>% 
    filter (sex == "F") %>% 
    arrange (sex, desc(prop)) 

Select

Often, not all columns are needed for further analysis. select can be applied to reduce number of columns. It can be done by defining columns to be kept or by defining columns to be excluded.

In our example only name is kept. An alternative would have been select(-year, -sex, -n, -prop).

most_popular_female <- babynames %>% 
    filter (year == 2014) %>% 
    filter (sex == "F") %>% 
    arrange (sex, desc(prop)) %>% 
    select (name)

Only the Top5 will be analysed. So at the end head is applied.

most_popular_female <- babynames %>% 
    filter (year == 2014) %>% 
    filter (sex == "F") %>% 
    arrange (sex, desc(prop)) %>% 
    select (name) %>% 
    head( n = 5)
as.character(most_popular_female$name)
## [1] "Emma"     "Olivia"   "Sophia"   "Isabella" "Ava"

These are in decreasing order the most famous girlnames of 2014.

Plot Results

Now, the development of these names should be shown. Original dataframe is filtered for these names. Function mutate can create new variables.

babies_to_plot <- babynames %>% 
    filter (sex == "F") %>% 
    mutate (prop = prop * 100) %>% 
    filter (name %in% most_popular_female$name) 

Plot is created with ggplot. It shows Top5 names of 2014 and their development since 1880.

ggplot(babies_to_plot, aes(year, prop, color = name)) +
    geom_line(size = 2) +
    xlab ("Year [-]") + 
    ylab ("Proportion [%]") + 
    ggtitle ("Most Popular Female Names in 2014, and Before") + 
    theme_bw()

plot of chunk unnamed-chunk-12

Emma was very famous in 1880 and following decades and only recently experienced a revival. The other Top5 names got famous only recently.

Conclusion

We learned dplyr package and some of its very important functions. There is more to discover. For a quick overview take a look at “Data Manipulation with dplyr, tidyr” cheatsheet (Help –> Cheatsheets).

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close