ggplot2 is a plotting environment and delivers very appealing graphs, has compact and good-readable code and is easy to learn. It only takes small changes to get complex visualisations. To me it is by far the best plotting environment in R.
Data Understanding and Preparation
We will use “iris” dataset. It is a multivariate data set, created by Fisher in 1936. It consists of 50 samples of three different Iris species. Measured features are lengths and widths of sepals and petals. Unit is centimeters.
First, we load ggplot2 package. Please make sure you have installed it before loading. Data is loaded with data() function. “iris” is part of datasets package, which is preloaded at R startup.
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
data(iris)
tbl_df (iris)
## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## # ... with 140 more rows
ggplot2 Components
Each ggplot2 graph has the following components
- data: a dataframe is used as input data
- aesthetics: define the axes (x, y), color, size, shape, text, fill, …
- geometry: type of plot (line, bar, histogram)
Bar Plot
We create the very first plot. It will be a bar plot, showing the count of elements per species. In general a ggplot is built up in steps. First, we assign a new variable “g”, which will be loaded with all plot information. We start by calling ggplot() function and pass data (here: “iris”), and the aesthetics, which is column “Species”. In the second step we define geometry (here: geom_bar()). Last, we show the plot by calling variable “g”.
g <- ggplot (data = iris, aes(x = Species))
g <- g + geom_bar()
g
As a result we find out, there are three species. Data is balanced: each group has 50 elements.
Histogram
A histogram shows distribution of one variable. It is applied with geom_histogram().
g <- ggplot (data = iris, aes(x = Sepal.Length))
g <- g + geom_histogram()
g
Point Plot
We continue with a point plot. For this we need an “x”-column and a “y”-column. We use an additional feature and ink points according to their group with “color”. If graph is printed in black and white colors might be not distinguishable, so changing the shape of point according to species is necessary. All this is defined in aesthetics.
Since we want to get a point plot, we now define geometry with geom_point(). Default size of points is too small, so we change it with “size = 2”.
As a bonus a smoothed line is added with geom_smooth(). A linear regression line is defined with parameter (method = “lm”).
g <- ggplot (iris, aes(x = Sepal.Length, y = Petal.Length, color = Species, shape = Species))
g <- g + geom_point(size = 2)
g <- g + geom_smooth(method = "lm")
g
Box-Plot
A boxplot is useful to show distribution properties
g <- ggplot (iris, aes(x = Species, y = Sepal.Length))
g <- g + geom_boxplot()
g
Faceting
One of the most impressive features of ggplot() is faceting. Thus for each group different subplots are created. This is achieved with facet_grid(). Parameter is “. ~ Species”, which means that different species-plots are shown horizontally.
g <- ggplot (iris, aes(x = Sepal.Length, y = Petal.Length))
g <- g + geom_point()
g <- g + geom_smooth(method = "lm")
g <- g + facet_grid(. ~ Species)
g
Axes and Scales
Axes labels and scales can be modified. We change the previous plot and add x-label und y-label with xlab() and ylab(). Scales are modified with scale_x_continuous().
g <- ggplot (iris, aes(x = Sepal.Length, y = Petal.Length))
g <- g + geom_point()
g <- g + geom_smooth(method = "lm")
g <- g + facet_grid(. ~ Species)
g <- g + xlab ("Sepal Length [cm]")
g <- g + ylab ("Petal Length [cm]")
g <- g + scale_x_continuous(breaks = seq(4, 8, .5))
g <- g + scale_y_continuous(breaks = seq(0, 7, .5))
g
Themes
Themes define the general look of a plot. You can use a pre-defined theme, e.g. with theme_bw(). You can also specify each component of theme. Here, “legend.position” is changed from default (right) to bottom.
g <- ggplot (iris, aes(x = Sepal.Length, y = Petal.Length, color = Species))
g <- g + geom_point()
g <- g + theme_bw()
g <- g + theme(legend.position = "bottom")
g
Saving a Plot
A ggplot can be saved with ggsave() function. Many parameter can be defined, e.g. height, width, dpi, or units. File type is implicitely defined within “filename” extension.
ggsave(filename = "my_first_ggplot.png", plot = g, height = 20, width = 20, units = "cm", dpi = 300)
More Information
For a quick overview you can use “Data visualisiation with ggplot2” cheatsheet (RStudio –> Help –> Cheatsheets).
- Iris Datasets https://en.wikipedia.org/wiki/Iris_flower_data_set