Creating Heatmaps in R

Heatmaps are a visually appealing way to present information in three dimensions. In this tutorial I will show you three different packages for heatmap creation. I will also present how data has to be formatted for each heatmap function.

I chose data for income prediction. We will see how age and eduction impact income.

Data Preparation

First, I load required packages. plyr, dplyr and tidyr are used for data manipulation. gplots, ggplot2 and pheatmap are the packages for heatmap-creation.

library(pacman)
p_load(plyr, dplyr, tidyr, gplots, ggplot2, pheatmap)

Data is downloaded via URL. Information on dataset you find at the end of the article. Header information is not included in dataset, so that column names need to be defined.

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read.delim(url, sep = ",", header = F)
colnames(census) <- c("age", "workclass", "fnlwgt", "education", "education_num", 
              "marital_status", "occupation", "relationship", "race", "sex", 
              "capital_gain", "capital_loss", "hours_per_week", 
              "native_country", "income")

Income is “<=50K” or “>50K”. This will be transformed to 0 or 1. Age is binned with 5-year intervals. Only columns “age_bin”, “education_num” and “income_num” are kept for further analysis.

census$income_num <- as.numeric(census$income) - 1
census$age_bin <- census$age %/%5 * 5

census_filt <- census %>% 
    select (age_bin, education_num, income_num)

Package ggplot2

ggplot is the allrounder amongst graphics packages. I presented its basics in this tutorial. I start with filtered data. ggplot requires tidy data. A two-dimensional binning is created with group_by() and summarise() functions. The result is piped to ggplot() function. Age is used as x-variable, education as y-variable and income for filling the cells.

The function for creating the heatmap is geom_tile(). With geom_text() numbers are filled in the cells. The other function calls define x-label, y-label, title and color code for cells.

g <- census_filt %>% 
    group_by(age_bin, education_num) %>% 
    summarise(income_num = mean(income_num) * 100) %>%  
    ggplot(., aes(x =age_bin, 
              y = education_num, 
              fill = income_num)) 
g <- g + geom_tile()
g <- g + geom_text(aes(label = round(income_num, 1)))
g <- g + scale_fill_gradientn(colours = 
     c("blue","green","yellow","orange","red","brown","black"))
g <- g + xlab("Age [-]")
g <- g + ylab("Education Level [-]")
g <- g + ggtitle("Income vs Age and Education")
g

plot of chunk unnamed-chunk-2

Package gplots

Another package for creating a heatmap is gplots. We first need to arrange data. The heatmap function requires a matrix as input. Again I bin data. To change from tidy to wide dataformat I use spread() from tidyr package.

Heatmap is created with heatmap.2() function. It requires a matrix. Many other parameters can be tuned.

census_spread <- census_filt %>%
    group_by(age_bin, education_num) %>% 
    summarise(income_num2 = mean(income_num, na.rm=T) * 100,
          count = n()) %>%
    filter (count > 5) %>% 
    select (-count) %>% 
    spread(key = "age_bin", value = "income_num2") %>% 
    as.matrix.data.frame()
census_spread <- census_spread[, -1]
col_pal <- colorRampPalette(c("blue","green","yellow","orange","red","brown","black"))

heatmap.2(x = census_spread, key=T, keysize=1.5, symkey=F, 
          col = col_pal, 
          breaks=c(seq(0, 80 ,0.01)), Rowv=F, Colv=F, 
          dendrogram="none", 
          ylab="Education Level [÷]", xlab="Age [-]", 
          cexRow=.8, cexCol=.8,
          cellnote=round(census_spread, 1), 
          notecex=.8, notecol="white",
          trace="none",scale ="none")

plot of chunk heatmap.2

Package pheatmap

Package pheatmap only provides one function with the same name. Similar to gplots it requires a matrix as input.

pheatmap(mat = census_spread, display_numbers = T, 
     breaks = 0:100, border_color = "black", drop_levels = T, kmeans_k = NA, 
     cluster_rows = F, cluster_cols = F, main = "Income vs. Age/Education")

plot of chunk pheatmap

Conclusion

We learn, not surprisingly, that highest proportion of income above 50 K$ is found for highest education levels and ages around 45 to 50.

More importantly, we learned to handle three different packages for heatmap creation. It is your choice to choose the one you like most.

More Information

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close