Enhance your models with broom an modelr

  • broom is a helpful package for transforming models into dataframes for further analysis.
  • modelr has some useful functions for working with models.

Objectives: learn to handle packages broom and modelr, randomForest model, linear, quadratic and cubic model, and a bit map() from purrr package

Requirement: piping, experience of dplyr might help

Data Preparation

We will apply packages broom and modelr for model enhancing. For the creation of models randomForest package. purrr is applied to analyse several models at once. ggplot2 is used for data visualisation. Please make sure you installed it before loading. Linear model is included in stats which is preloaded.

We will use data cars. It has only two variables: distance dist and speed.

suppressPackageStartupMessages(library(broom))
suppressPackageStartupMessages(library(modelr))
suppressPackageStartupMessages(library(randomForest))
suppressPackageStartupMessages(library(purrr))
suppressPackageStartupMessages(library(ggplot2))

data(cars)

We first take a look at the data. There is a correlation of distance and speed. Please don’t be surprised about the low speeds – data is from 1920’s. The correlation might be linear, but maybe a quadratic or cubic interpolation fits better.

plot of chunk unnamed-chunk-2

Linear, quadratic and cubic models are created with lm. For linear model formula is just dist ~ speed. For quadratic and cubic model poly() function is used with the corresponding order. You only need to call randomForest() function to get this model.

mod_lm <- lm(formula = dist ~ speed, data = cars)
mod_squared <- lm(formula = dist ~ poly(speed, 2), data = cars)
mod_cubic <- lm(formula = dist ~ poly(speed, 3), data = cars)
mod_rf <- randomForest(formula = dist ~ speed, data = cars)

Each model is significant. Cubic model has the highest r_squared, but information criteria are slightly higher compared to quadratic model. But both outperform linear model.

Package broom

We can take a look at model parameters with tidy(). The result is a dataframe.

mod_lm %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -17.6      6.76      -2.60 1.23e- 2
## 2 speed           3.93     0.416      9.46 1.49e-12

Model results can be extracted with glance(). Since we want to follow DRY (don’t repeat yourself) principle, we don’t pipe each model on its own to be handled with glance(). Instead I show you how to handle all models at once. For this you can use purrr (please also see the purrr tutorial). All models are summarised in a list, which is piped and handled by map(). map() has the function glance as parameter. Unfortunately at this stage glance() does not accept a model of class “randomForest”. But with this approach we get a very good overview of all models at once.

list(linear = mod_lm, 
     quadratic = mod_squared, 
     cubic = mod_cubic) %>% 
    map(., .f = glance)
## $linear
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.651         0.644  15.4      89.6 1.49e-12     2  -207.  419.  425.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
## 
## $quadratic
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.667         0.653  15.2      47.1 5.85e-12     3  -205.  419.  426.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
## 
## $cubic
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
## *     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.673         0.652  15.2      31.6 3.07e-11     4  -205.  420.  429.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

Another useful function is augment(), which returns raw data together with model results.

mod_lm %>% 
    augment() %>% 
    head()
## # A tibble: 6 x 9
##    dist speed .fitted .se.fit .resid   .hat .sigma  .cooksd .std.resid
##   <dbl> <dbl>   <dbl>   <dbl>  <dbl>  <dbl>  <dbl>    <dbl>      <dbl>
## 1     2     4   -1.85    5.21   3.85 0.115    15.5 0.00459       0.266
## 2    10     4   -1.85    5.21  11.8  0.115    15.4 0.0435        0.819
## 3     4     7    9.95    4.11  -5.95 0.0715   15.5 0.00620      -0.401
## 4    22     7    9.95    4.11  12.1  0.0715   15.4 0.0255        0.813
## 5    16     8   13.9     3.77   2.12 0.0600   15.5 0.000645      0.142
## 6    10     9   17.8     3.44  -7.81 0.0499   15.5 0.00713      -0.521

Package modelr

A very helpful function is data_grid(). It creates a dataframe with a evenly spaced grid points. It is similar to basic function expand but less demanding, because it does not ask for details like step size. This result is piped to spread_predictions(). This function has all models as a parameter and provides model predictions.

mod_grid_spread <- data_grid(cars, speed) %>% 
    spread_predictions(mod_lm, mod_squared, mod_cubic, mod_rf) 
mod_grid_spread %>% head()
## # A tibble: 6 x 5
##   speed mod_lm mod_squared mod_cubic mod_rf
##   <dbl>  <dbl>       <dbl>     <dbl>  <dbl>
## 1     4  -1.85        7.72      2.76   7.75
## 2     7   9.95       13.8      14.5   12.5 
## 3     8  13.9        16.2      17.8   13.9 
## 4     9  17.8        18.8      20.9   14.6 
## 5    10  21.7        21.6      23.8   25.1 
## 6    11  25.7        24.6      26.6   23.0

Very similar is gather_predictions(). Like in dplyr it reshapes columns to rows, which is sometimes easier for further processing, e.g. with ggplot().

mod_grid_gather <- data_grid(cars, speed, .model = mod_lm) %>% 
    gather_predictions(mod_lm, mod_squared, mod_cubic, mod_rf)
mod_grid_gather %>% head()
## # A tibble: 6 x 3
##   model  speed   pred
##   <chr>  <dbl>  <dbl>
## 1 mod_lm     4  -1.85
## 2 mod_lm     7   9.95
## 3 mod_lm     8  13.9 
## 4 mod_lm     9  17.8 
## 5 mod_lm    10  21.7 
## 6 mod_lm    11  25.7

Finally let’s take a look at the data in combination with model predictions.

g <- ggplot(data = cars, aes(x = speed, y = dist))
g <- g + geom_point()
g <- g + geom_line(data = mod_grid_gather, aes(x = speed, 
                           y = pred, 
                           color = as.factor(model)),
           size = 2)
g <- g + xlab("Speed [miles/hour]")
g <- g + ylab("Distance [miles]")
g <- g + ggtitle("Model Predictions")
g <- g + coord_cartesian(ylim = c(0, 100))
g

plot of chunk unnamed-chunk-9

Bibliography

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close