Heatmaps are a visually appealing way to present information in three dimensions. In this tutorial I will show you three different packages for heatmap creation. I will also present how data has to be formatted for each heatmap function.

I chose data for income prediction. We will see how age and eduction impact income.

Data Preparation

First, I load required packages. plyr, dplyr and tidyr are used for data manipulation. gplots, ggplot2 and pheatmap are the packages for heatmap-creation.

library(pacman)
p_load(plyr, dplyr, tidyr, gplots, ggplot2, pheatmap)

Data is downloaded via URL. Information on dataset you find at the end of the article. Header information is not included in dataset, so that column names need to be defined.

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
census <- read.delim(url, sep = ",", header = F)
colnames(census) <- c("age", "workclass", "fnlwgt", "education", "education_num", 
              "marital_status", "occupation", "relationship", "race", "sex", 
              "capital_gain", "capital_loss", "hours_per_week", 
              "native_country", "income")

Income is “<=50K” or “>50K”. This will be transformed to 0 or 1. Age is binned with 5-year intervals. Only columns “age_bin”, “education_num” and “income_num” are kept for further analysis.

census$income_num <- as.numeric(census$income) - 1
census$age_bin <- census$age %/%5 * 5

census_filt <- census %>% 
    select (age_bin, education_num, income_num)

Package ggplot2

ggplot is the allrounder amongst graphics packages. I presented its basics in this tutorial. I start with filtered data. ggplot requires tidy data. A two-dimensional binning is created with group_by() and summarise() functions. The result is piped to ggplot() function. Age is used as x-variable, education as y-variable and income for filling the cells.

The function for creating the heatmap is geom_tile(). With geom_text() numbers are filled in the cells. The other function calls define x-label, y-label, title and color code for cells.

g <- census_filt %>% 
    group_by(age_bin, education_num) %>% 
    summarise(income_num = mean(income_num) * 100) %>%  
    ggplot(., aes(x =age_bin, 
              y = education_num, 
              fill = income_num)) 
g <- g + geom_tile()
g <- g + geom_text(aes(label = round(income_num, 1)))
g <- g + scale_fill_gradientn(colours = 
     c("blue","green","yellow","orange","red","brown","black"))
g <- g + xlab("Age [-]")
g <- g + ylab("Education Level [-]")
g <- g + ggtitle("Income vs Age and Education")
g

Package gplots

Another package for creating a heatmap is gplots. We first need to arrange data. The heatmap function requires a matrix as input. Again I bin data. To change from tidy to wide dataformat I use spread() from tidyr package.

Heatmap is created with heatmap.2() function. It requires a matrix. Many other parameters can be tuned.

census_spread <- census_filt %>%
    group_by(age_bin, education_num) %>% 
    summarise(income_num2 = mean(income_num, na.rm=T) * 100,
          count = n()) %>%
    filter (count > 5) %>% 
    select (-count) %>% 
    spread(key = "age_bin", value = "income_num2") %>% 
    as.matrix.data.frame()
census_spread <- census_spread[, -1]
col_pal <- colorRampPalette(c("blue","green","yellow","orange","red","brown","black"))

heatmap.2(x = census_spread, key=T, keysize=1.5, symkey=F, 
          col = col_pal, 
          breaks=c(seq(0, 80 ,0.01)), Rowv=F, Colv=F, 
          dendrogram="none", 
          ylab="Education Level [Ã·]", xlab="Age [-]", 
          cexRow=.8, cexCol=.8,
          cellnote=round(census_spread, 1), 
          notecex=.8, notecol="white",
          trace="none",scale ="none")

Package pheatmap

Package pheatmap only provides one function with the same name. Similar to gplots it requires a matrix as input.

pheatmap(mat = census_spread, display_numbers = T, 
     breaks = 0:100, border_color = "black", drop_levels = T, kmeans_k = NA, 
     cluster_rows = F, cluster_cols = F, main = "Income vs. Age/Education")

Conclusion

We learn, not surprisingly, that highest proportion of income above 50 K$ is found for highest education levels and ages around 45 to 50.

More importantly, we learned to handle three different packages for heatmap creation. It is your choice to choose the one you like most.

More Information

Dataset Census Income