Introduction to H2O – Gollnick Data

H2O is an open source platform for machine learning. It can be installed locally. It offers API access to R. This lecture will give you an introduction on how to use it.

Objectives: Installation and Usage of H2O
Requirements: R Data-Mining

Data Preparation

If you want most recent version go to H2O website and follow the instructions. You can also install package via CRAN, but it might be an older version. We will use mushroom dataset.

It is taken from UCI Machine Learning Repository. It covers 8124 datasets on mushrooms and their properties. Target variable is “edibility”. There are 22 attributes. More details are found on UCI homepage (see link at the end of the article). Our task is to create a model that predicts edibility. All data is categorical.

suppressPackageStartupMessages(library(h2o))

url <- "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
mushrooms <- read.csv(file = url, header = F)
colnames(mushrooms) <- c("edibility", "cap_shape", "cap_surface", "cap_color", "bruises", "odor", "gill_att", "gill_spacing", "gill_size", "gill_color", "stalk_shape", "stalk_root", "stalk_surf_above", "stalk_surf_below", "stalk_color_above", "stalk_color_below", "veil_type", "veil_color", "ring_nr", "ring_type", "spore_print_color", "population", "habitat")

H2O Connection

We initialize H2O and start connection to local server. Then, data frame is uploaded to H2O.

h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         4 minutes 31 seconds 
##     H2O cluster timezone:       Europe/Berlin 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.22.1.1 
##     H2O cluster version age:    1 month and 23 days  
##     H2O cluster name:           H2O_started_from_R_bertg_cpz431 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.23 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)

h2o.getConnection()

## IP Address: localhost 
## Port      : 54321 
## Name      : NA 
## Session ID: _sid_b55f 
## Key Count : 0

mushroom_hex <- as.h2o(mushrooms)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Let’s take a look at variable “mushroom_hex”. It is of type “H2OFrame”.

class(mushroom_hex)

## [1] "H2OFrame"

Model Preparation

We use test set method. First, we create as much random numbers as datasets. They range from 0 to 1. We reserve 40 % of data for testing and take 60 % for training. With h2o.assign() training and test data is assigned in H2O. With h2o.table() the numbers of edible and poisonous mushrooms are presented. Both groups are similar balanced in train and test data.

random_nr <- h2o.runif(mushroom_hex, seed = 3456)
mushroom_train <- mushroom_hex[random_nr <= 0.6, ]
mushroom_train <- h2o.assign(mushroom_train, key = "train_data")
mushroom_test <- mushroom_hex[random_nr > 0.6, ]
mushroom_test <- h2o.assign(mushroom_test, key = "test_data")
h2o.table(mushroom_train[, 1])

##   edibility Count
## 1         e  2548
## 2         p  2339
## 
## [2 rows x 2 columns]

h2o.table(mushroom_test[, 1])

##   edibility Count
## 1         e  1660
## 2         p  1577
## 
## [2 rows x 2 columns]

Model Creation

Gradient boosted classification trees are created with h2o.gmb(). Parameters are x attribute columns, the target variable y column, training dataframe, and test dataframe.

fit_gbm_h2o <- h2o.gbm(x = 2:23, y = 1, 
                   training_frame = mushroom_train, 
                   validation_frame = mushroom_test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Let’s take a look at the validation performance metrics. It performs perfect. Accuracy is 100 %.

fit_gbm_h2o@model$validation_metrics

## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  1.280867e-05
## RMSE:  0.003578921
## LogLoss:  0.003369485
## Mean Per-Class Error:  0
## AUC:  1
## pr_auc:  0.9974635
## Gini:  1
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           e    p    Error     Rate
## e      1660    0 0.000000  =0/1660
## p         0 1577 0.000000  =0/1577
## Totals 1660 1577 0.000000  =0/3237
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.951888 1.000000   5
## 2                       max f2  0.951888 1.000000   5
## 3                 max f0point5  0.951888 1.000000   5
## 4                 max accuracy  0.951888 1.000000   5
## 5                max precision  0.999142 1.000000   0
## 6                   max recall  0.951888 1.000000   5
## 7              max specificity  0.999142 1.000000   0
## 8             max absolute_mcc  0.951888 1.000000   5
## 9   max min_per_class_accuracy  0.951888 1.000000   5
## 10 max mean_per_class_accuracy  0.951888 1.000000   5
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Model Predictions

Equivalent to predict() function works H2O function h2o.predict(). Parameters are the model and the data. The result is an “H2OFrame”, which can be converted to an R-dataframe. Confusion matrix is created

mushroom_predict <- h2o.predict(fit_gbm_h2o, newdata = mushroom_test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

mushroom_predict <- as.data.frame(mushroom_predict)
actual <- as.data.frame(mushroom_test)
actual <- as.factor(actual$edibility)
prediction <- mushroom_predict$predict
table(actual, prediction)

##       prediction
## actual    e    p
##      e 1660    0
##      p    0 1577

This is the same result as shown before. There is much more to discover and learn. This is just a kick-off and should support you getting started.

More Information

H2O Online http://www.h2o.ai/
Mushroom Data Set http://archive.ics.uci.edu/ml/datasets/Mushroom