- broom is a helpful package for transforming models into dataframes for further analysis.
- modelr has some useful functions for working with models.
Objectives: learn to handle packages broom and modelr, randomForest model, linear, quadratic and cubic model, and a bit map() from purrr package
Requirement: piping, experience of dplyr might help
Data Preparation
We will apply packages broom and modelr for model enhancing. For the creation of models randomForest package. purrr is applied to analyse several models at once. ggplot2 is used for data visualisation. Please make sure you installed it before loading. Linear model is included in stats which is preloaded.
We will use data cars. It has only two variables: distance dist and speed.
suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(modelr))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(ggplot2))
data(cars)
We first take a look at the data. There is a correlation of distance and speed. Please don’t be surprised about the low speeds – data is from 1920’s. The correlation might be linear, but maybe a quadratic or cubic interpolation fits better.
Linear, quadratic and cubic models are created with lm. For linear model formula is just dist ~ speed. For quadratic and cubic model poly() function is used with the corresponding order. You only need to call randomForest() function to get this model.
mod_lm <- lm(formula = dist ~ speed, data = cars)
mod_squared <- lm(formula = dist ~ poly(speed, 2), data = cars)
mod_cubic <- lm(formula = dist ~ poly(speed, 3), data = cars)
mod_rf <- randomForest(formula = dist ~ speed, data = cars)
Each model is significant. Cubic model has the highest r_squared, but information criteria are slightly higher compared to quadratic model. But both outperform linear model.
Package broom
We can take a look at model parameters with tidy(). The result is a dataframe.
mod_lm %>% tidy()
## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -17.6 6.76 -2.60 1.23e- 2 ## 2 speed 3.93 0.416 9.46 1.49e-12
Model results can be extracted with glance(). Since we want to follow DRY (don’t repeat yourself) principle, we don’t pipe each model on its own to be handled with glance(). Instead I show you how to handle all models at once. For this you can use purrr (please also see the purrr tutorial). All models are summarised in a list, which is piped and handled by map(). map() has the function glance as parameter. Unfortunately at this stage glance() does not accept a model of class “randomForest”. But with this approach we get a very good overview of all models at once.
list(linear = mod_lm,
quadratic = mod_squared,
cubic = mod_cubic) %>%
map(., .f = glance)
## $linear ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.651 0.644 15.4 89.6 1.49e-12 2 -207. 419. 425. ## # ... with 2 more variables: deviance <dbl>, df.residual <int> ## ## $quadratic ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.667 0.653 15.2 47.1 5.85e-12 3 -205. 419. 426. ## # ... with 2 more variables: deviance <dbl>, df.residual <int> ## ## $cubic ## # A tibble: 1 x 11 ## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC ## * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.673 0.652 15.2 31.6 3.07e-11 4 -205. 420. 429. ## # ... with 2 more variables: deviance <dbl>, df.residual <int>
Another useful function is augment(), which returns raw data together with model results.
mod_lm %>%
augment() %>%
head()
## # A tibble: 6 x 9 ## dist speed .fitted .se.fit .resid .hat .sigma .cooksd .std.resid ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2 4 -1.85 5.21 3.85 0.115 15.5 0.00459 0.266 ## 2 10 4 -1.85 5.21 11.8 0.115 15.4 0.0435 0.819 ## 3 4 7 9.95 4.11 -5.95 0.0715 15.5 0.00620 -0.401 ## 4 22 7 9.95 4.11 12.1 0.0715 15.4 0.0255 0.813 ## 5 16 8 13.9 3.77 2.12 0.0600 15.5 0.000645 0.142 ## 6 10 9 17.8 3.44 -7.81 0.0499 15.5 0.00713 -0.521
Package modelr
A very helpful function is data_grid(). It creates a dataframe with a evenly spaced grid points. It is similar to basic function expand but less demanding, because it does not ask for details like step size. This result is piped to spread_predictions(). This function has all models as a parameter and provides model predictions.
mod_grid_spread <- data_grid(cars, speed) %>%
spread_predictions(mod_lm, mod_squared, mod_cubic, mod_rf)
mod_grid_spread %>% head()
## # A tibble: 6 x 5 ## speed mod_lm mod_squared mod_cubic mod_rf ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 -1.85 7.72 2.76 7.75 ## 2 7 9.95 13.8 14.5 12.5 ## 3 8 13.9 16.2 17.8 13.9 ## 4 9 17.8 18.8 20.9 14.6 ## 5 10 21.7 21.6 23.8 25.1 ## 6 11 25.7 24.6 26.6 23.0
Very similar is gather_predictions(). Like in dplyr it reshapes columns to rows, which is sometimes easier for further processing, e.g. with ggplot().
mod_grid_gather <- data_grid(cars, speed, .model = mod_lm) %>%
gather_predictions(mod_lm, mod_squared, mod_cubic, mod_rf)
mod_grid_gather %>% head()
## # A tibble: 6 x 3 ## model speed pred ## <chr> <dbl> <dbl> ## 1 mod_lm 4 -1.85 ## 2 mod_lm 7 9.95 ## 3 mod_lm 8 13.9 ## 4 mod_lm 9 17.8 ## 5 mod_lm 10 21.7 ## 6 mod_lm 11 25.7
Finally let’s take a look at the data in combination with model predictions.
g <- ggplot(data = cars, aes(x = speed, y = dist))
g <- g + geom_point()
g <- g + geom_line(data = mod_grid_gather, aes(x = speed,
y = pred,
color = as.factor(model)),
size = 2)
g <- g + xlab("Speed [miles/hour]")
g <- g + ylab("Distance [miles]")
g <- g + ggtitle("Model Predictions")
g <- g + coord_cartesian(ylim = c(0, 100))
g
Bibliography
- Package broom https://cran.r-project.org/web/packages/broom/index.html
- Package modelr https://cran.r-project.org/web/packages/modelr/index.html