In this tutorial we will learn what wide and tidy data is, why it is used and how you can transform it from one format in the other. Transforming data from one format in the other is a very common task in data analysis.
What is wide and tidy data?
Wide data means that each row contains more than one observation, whereas with tidy data there is exactly one observation for each row. Let’s take a look at an example.
We have a small school class with three students (Stuart, Bob, Kevin), who got grades in three different subjects (math, sport, art). The shown table would be a typical representation of Excel, SPSS or other programs.
school_wide <- data.frame(name = c("Stuart", "Bob", "Kevin"),
math = c(2, 3, 4),
sport = c(3, 1, 2),
art = c(3, 2, 1))
school_wide
## name math sport art ## 1 Stuart 2 3 3 ## 2 Bob 3 1 2 ## 3 Kevin 4 2 1
From wide to tidy
If you want to transform data from wide to tidy you can use functions from package tidyr. The relevant function is gather(). You need to provide some parameters
- a key parameter name, which represents the columns. I will call the resulting column subject
- a value parameter, which will be the name of the columns holding the values
- Finally, you need to define which columns to use for transformation. Here this is 2:4.
library(tidyr)
school_tidy <- school_wide %>%
gather(key = "subject", value = "grade", 2:4)
school_tidy
## name subject grade ## 1 Stuart math 2 ## 2 Bob math 3 ## 3 Kevin math 4 ## 4 Stuart sport 3 ## 5 Bob sport 1 ## 6 Kevin sport 2 ## 7 Stuart art 3 ## 8 Bob art 2 ## 9 Kevin art 1
This is the same data, in a tidy-representation.
From tidy to wide
The function with the opposite effect is spread(). You need to pass the exact same parameters:
- key: here you define the column name of the key-column
- value: you pass the chosen name of the column with the values
school_tidy %>%
spread(key = "subject", value = "grade")
## name art math sport ## 1 Bob 2 3 1 ## 2 Kevin 1 4 2 ## 3 Stuart 3 2 3
You see, that you get the original wide-data.
Why are there two different formats?
Both have their right to exist. Wide data typically comes from other programs. Excel or SPSS usually work with wide-data.
For specific tasks it is better to transform wide-data (also called messy data) into tidy-data. I will show two examples.
Example 1: Calculating group averages
Assume you want to calculate average class-grades per subject. With wide data you would do something like this:
mean(school_wide$math)
## [1] 3
mean(school_wide$sport)
## [1] 2
mean(school_wide$art)
## [1] 2
You repeat your code several times, which should be avoided. If there are only three columns, that is ok, but imagine you want to calculate the average for 50 columns.
Here is the solution based on tidy-data. You need to load dplyr package and make use of group_by() and summarise(). The code is much cleaner.
dplyr and tidyr are part of the tidyverse, which is a bunch of different packages, which rely on tidy-data.
library(dplyr)
school_tidy %>%
group_by(subject) %>%
summarise(mean_grade = mean(grade))
## # A tibble: 3 x 2 ## subject mean_grade ## <chr> <dbl> ## 1 art 2 ## 2 math 3 ## 3 sport 2
Example 2: Plotting with ggplot2
Assume you want to plot the results with ggplot2. This is done with three lines of code. Trying to get a similar representation based on wide-data will result in much more complex code.
library(ggplot2)
g <- ggplot(school_tidy, aes(name, grade, fill = subject))
g <- g + geom_bar(stat = "identity", position = "dodge")
g
Conclusion
You have seen how easy it is to transform data from wide to tidy, and reverse. You also saw some examples in which tidy-data has advantages over wide-data.