5) Loading Data into h2O with R

This Blog entry is from the Deep Learning section in Learn R.

Start by loading the FraudRisk.csv file into R using readr:

library(readr)
FraudRisk <- read_csv("C:/Users/Richard/Desktop/Bundle/Data/FraudRisk/FraudRisk.csv")
1.png

Run the block of script to console:

2.png

The training process will make use of a test dataset and a sample dataset. The preferred method to randomly split a dataframe is to create a vector which comprises random values, then append this vector to the dataframe.  Using Vector sub setting, data frames will be split based on a random value.

Start by observing the length of the dataframe by typing (on any dataframe variable):

length(FraudRisk$Dependent)
3.png

Run the line of script to console:

4.png

Having established that the dataframe has 1827 records, use this value to create a vector of the same size containing random values between 0 and 1.  The RunIf function is used to create vectors or a prescribed length with random values between a certain range:

RandomDigit <- runif(1827,0,1)
5.png

Run the line of script to console:

6.png

A vector containing random digits, of same length as the dataframe, has been created.  Validate vector by typing:

RandomDigit
7.png

Run the line of script to console:

8.png

The random digits are written out showing there to be values created, on a random basis, between 0 and 1 with a high degree of precision.  Append this vector to the dataframe as using Dplyr and Mutate:

library(dplyr)
FraudRisk <- mutate(FraudRisk,RandomDigit)
9.png

Run the block of script to console:

10.png

The RandomDigit vector is now appended to the FraudRisk dataframe and can be used in sub setting and splitting.  Create the cross-validation dataset by creating a filter creating a new data frame by assignment:

CV <- filter(FraudRisk,RandomDigit < 0.2)
11.png

Run the line of script to console:

12.png

A new data frame by the name of CV has been created.  Observe the CV data frame length:

length(CV$Dependent)
13.png

Run the line of script to console:

14.png

It can be seen that the data frame has 386 records, which is broadly 20% of the FraudRisk data frames records.  The task remains to create the training dataset, which is similar albeit sub setting for a larger opposing random digit filter:

Training <- filter(FraudRisk,RandomDigit >= 0.2)
15.png

Run the line of script to console:

16.png

Validate the length of the Training data frame:

length(Training$Dependent)
17.png

Run the line of script to console:

18.png

It can be observed that the Training dataset is 1463 records in length, which is broadly 70% of the file.  So not to accidently use the RandomDigit vector in training, drop it from the Training and CV data frames:

CV$RandomDigit <- NULL
Training$RandomDigit <- NULL
19.png

Run the block of script to console:

20.png

H2O requires that the Dependent Variable is a factor, it is after all a classification problem.  Convert the dependent variable to a factor for the training and cross validation dataset:

21.png

Run the line of script to console:

22.png

At this stage, there now exists a randomly selected Training dataset as well as a randomly selection Cross Validation training set.  Keep in mind that H2O requires that the dataframe is converted to the native hex format, achieved through the creation of a parsed data object for each dataset. Think of this process as being the loading of data into the H2O server, more so than a conversion to Hex:

Training.hex <- as.h2o(Training)
CV.hex <- as.h2o(CV)
23.png

Run the block of script to console:

24.png

All models that are available to be trained via the Flow interface are available via the R interface, with the hex files being ready to be passed as parameters.

4) Recalling a Logistic Regression model with Flow

This Blog entry is from the Deep Learning section in Learn R.

To recall this logistic regression model from flow, navigate to:

Scores >>> Predict

1.png

The predict cell will be added to the Flow:

2.png

The recall of the model may assume that a new frame has been created in flow, but for this example, the validation frame will be recalled via the logistic regression, trained, model.  Firstly, set the model to recall:

3.png

Thereafter, select the data frame to process through the model:

4.png

Upon selecting the input parameters, click the predict button to complete the prediction.  A cell detailing the output will be created:

5.png

It is sensible at this stage to combine the predictions with the original dataset.  To combine the predictions with the original dataset, simply click the Combine Predictions with Frame button:

6.png

Upon combining the predictions with the original dataset, the dataset will be available for download:

7.png

To interact with the newly created data frame click on the View Frame button:

8.png

The View Frame functionality provides for the downloading and further manipulation of the data frame:

9.png

The process thus far uses the Flow user interface to create something akin to a script, where it is the flow tool that is sending instructions to the H2O API.  It would be far less cumbersome to use R scripting to achieve such flows.

3) Creating a Logistic Regression model in H2O (GLM)

This Blog entry is from the Deep Learning section in Learn R.

With the data loaded, a model now needs to be trained.  Navigate to Models to see the available algorithms:

Models

1.png

In this case, the algorithm is Generalised Linear Modelling (this is Logistic Regression).  Click this model to create the cell in flow:

2.png

There are a multitude of parameters that are quite outside the scope of this document, for the purposes of this document, simply specify the Training and Validation Hex sets:

3.png

Thereafter, specify the dependent variable, known as the Response Column in H2O:

4.png

In this case the Dependent Variable is titled as the same:

5.png

Scroll to the base of the cell and click Build Model to initiate the training process:

6.png

The training process will begin with progress being written out to a newly created job cell:

7.png

At this stage a Logistic Regression model has been created. It is a good idea to save the flow by navigating:

Flow >>> Save Flow

8.png

2) Loading Data into H2O with Flow

This Blog entry is from the Deep Learning section in Learn R.

In this Blog entry a logistic regression model will be created, using Flow, achieving the same results as achieved in the GLM functions of R and Exhaustive.

In the Flow user interface, start by navigating:

Flow >>> New Flow

1.png

If prompted to create a new workbook, affirm this:

2.png

To add a cell for the importing of data, navigate to:

Data >>> Import Files

3.png

It can be seen that Import Files Cell has been added to the Flow:

4.png

In the Search dialog box, enter the location of the FraudRisk.csv file until a drop down is populated, for example:

5.png

Click on the Search Icon to bring back the contents of this directory:

6.png

Click on the file or plus sign to add the file to the cell:

7.png

Click the Import Button to import the file to H2O:

8.png

Note that the file is not parsed to the H2O column compressed format, known as Hex.  To achieve parsing, simply click the button titled 'Parse These Files':

9.png

The next screen allows for the specification and data types to be more robustly configured.  In this example, a cursory check to ensure that the data types are correct is sufficient:

10.png

Upon satisfaction, click parse to mount the dataset in H20 as Hex:

11.png

A background job will start the process of transforming the data from FraudRisk.csv to the H2O hex format:

12.png

H2O supports the concept of training and validation datasets robustly, henceforth the hex file needs to be split into training and validation.  To split a Hex frame, navigate to:

Data >>> Split Frame

13.png

Click on the menu item to create the split data frame cell:

14.png

Select the frame to be split, in this case FraudRisk.hex:

15.png

The default frame split is 75% by 25%, confirm this by clicking the Create button:

16.png

There now exists two frames in the flow, the smaller of which will be used for validation:

17.png

1) Install H2O package, instantiate and browse to the Flow User Interface

This Blog entry is from the Deep Learning section in Learn R.

Even though H2O is server software and runs externally to R, it can be installed and initialised from with R.  Installing the entire H2O server is no more complex than installing any other R package.

To install H20, use RStudio and begin by installing the H20 package:

1.png

Wait for the installation to complete, although this will take a little bit longer than most packages as it is big:

2.png

Load the H2O package by typing:

library(h2o)
3.png

Run the line of script to console:

4.png

The H2O server needs to be started externally, but this can be achieved through a helper function available to the H2O library.  To start the H2O server, use the h2o.init function with the default parameters (i.e. no parameters):

H2oServer <- h2o.init()
5.png

Run the line of script to console and wait for confirmation to be provided that the h2o server has been started externally to R:

6.png

The h2o server acts as a web server which serves up the Flow application. To navigate to the Flow application, open a browser such as Chrome:

7.png

Navigate to the URL:

http://localhost:54321

8.png

The H2o server is now installed and available for use via the Flow user interface, API or R commands.