7) Recalling a Neural Network with R

This Blog entry is from the Deep Learning section in Learn R.

Once a model is trained in H2O it can be recalled very gracefully with the predict() function of the H2O package.  It is a simple matter of passing the trained model and the hex dataframe to be used for recall:

Scores <- h2o.predict(Model,CVHex.hex)
1.png

Run the line of script to console:

2.png

A progress bar is broadcast from the H2O server and will be written out to the console.  To review the output, enter the object:

Scores
3.png

Run the line of script to console:

4.png

The Scores output appears similar to a matrix, but it has created a vector which details the actual prediction for a record, hence, this can be subset to a final vector detailing the predictions:

Predict <- Scores[1]
5.png

Run the line of script to console:

6.png

The Predict vector can be compared to the Dependent vector of the CV dataframe in the same manner as previous models within R to obtain Confusion Matrices as well a ROC curves.

6) Creating a Neural Network with R

This Blog entry is from the Deep Learning section in Learn R.

Although all of the work is offloaded to H2O, the instruction to train a model looks a lot like previous examples where a variety of R packages have been used.  In this example the deeplearning function of the H2O package is going to be used (this is really the only reason that we are using H2O in the first place).

In order to make the command easier to understand, typed parameters will be used as follows:

Parameter

Description

x

c("Count_Transactions_1_Day","Authenticated","Count_Transactions_PIN_Decline_1_Day","Count_Transactions_Declined_1_Day","Count_Unsafe_Terminals_1_Day","Count_In_Person_1_Day","Count_Internet_1_Day","ATM","Count_ATM_1_Day","Count_Over_30_SEK_1_Day","In_Person","Transaction_Amt","Sum_Transactions_1_Day","Sum_ATM_Transactions_1_Day","Foreign","Different_Country_Transactions_1_Week","Different_Merchant_Types_1_Week","Different_Decline_Reasons_1_Day","Different_Cities_1_Week","Count_Same_Merchant_Used_Before_1_Week","Has_Been_Abroad","Cash_Transaction","High_Risk_Country")

y

c("Dependent")

training_frame

TrainingHex

validation_frame

CVHex

standardise

FALSE

activation

Rectifier

epochs

50

seed

12345

hidden

5

variable_importance

TRUE

nfolds

5

adaptive_rate

FALSE

The deeplearning function in H2O takes a function two vectors that contain the dependent and independent variables.    For readability, create these string vectors to be passed to the deeplearning function in advance, rather than use the c() function, inside the function call.  To create a list of eligible independent variables for the purposes of this example, enter:

x <- c("Count_Transactions_1_Day","Authenticated","Count_Transactions_PIN_Decline_1_Day","Count_Transactions_Declined_1_Day","Count_Unsafe_Terminals_1_Day","Count_In_Person_1_Day","Count_Internet_1_Day","ATM","Count_ATM_1_Day","Count_Over_30_SEK_1_Day","In_Person","Transaction_Amt","Sum_Transactions_1_Day","Sum_ATM_Transactions_1_Day","Foreign","Different_Country_Transactions_1_Week","Different_Merchant_Types_1_Week","Different_Decline_Reasons_1_Day","Different_Cities_1_Week","Count_Same_Merchant_Used_Before_1_Week","Has_Been_Abroad","Cash_Transaction","High_Risk_Country")
1.png

Run the line of script to console:

2.png

To instruct H2O to begin deep learning, enter:

Model <- h2o.deeplearning(x=x, y="Dependent",training_frame=TrainingHex.hex,validation_frame=CVHex.hex,activation="Rectifier",epochs=50,seed=12345,hidden=5,variable_importance=TRUE,nfolds=5,adaptive_rate=FALSE,standardize=TRUE)
3.png

Run the line of script to console:

4.png

Feedback from the H2O cluster will be received, detailing training progress.

5) Loading Data into h2O with R

This Blog entry is from the Deep Learning section in Learn R.

Start by loading the FraudRisk.csv file into R using readr:

library(readr)
FraudRisk <- read_csv("C:/Users/Richard/Desktop/Bundle/Data/FraudRisk/FraudRisk.csv")
1.png

Run the block of script to console:

2.png

The training process will make use of a test dataset and a sample dataset. The preferred method to randomly split a dataframe is to create a vector which comprises random values, then append this vector to the dataframe.  Using Vector sub setting, data frames will be split based on a random value.

Start by observing the length of the dataframe by typing (on any dataframe variable):

length(FraudRisk$Dependent)
3.png

Run the line of script to console:

4.png

Having established that the dataframe has 1827 records, use this value to create a vector of the same size containing random values between 0 and 1.  The RunIf function is used to create vectors or a prescribed length with random values between a certain range:

RandomDigit <- runif(1827,0,1)
5.png

Run the line of script to console:

6.png

A vector containing random digits, of same length as the dataframe, has been created.  Validate vector by typing:

RandomDigit
7.png

Run the line of script to console:

8.png

The random digits are written out showing there to be values created, on a random basis, between 0 and 1 with a high degree of precision.  Append this vector to the dataframe as using Dplyr and Mutate:

library(dplyr)
FraudRisk <- mutate(FraudRisk,RandomDigit)
9.png

Run the block of script to console:

10.png

The RandomDigit vector is now appended to the FraudRisk dataframe and can be used in sub setting and splitting.  Create the cross-validation dataset by creating a filter creating a new data frame by assignment:

CV <- filter(FraudRisk,RandomDigit < 0.2)
11.png

Run the line of script to console:

12.png

A new data frame by the name of CV has been created.  Observe the CV data frame length:

length(CV$Dependent)
13.png

Run the line of script to console:

14.png

It can be seen that the data frame has 386 records, which is broadly 20% of the FraudRisk data frames records.  The task remains to create the training dataset, which is similar albeit sub setting for a larger opposing random digit filter:

Training <- filter(FraudRisk,RandomDigit >= 0.2)
15.png

Run the line of script to console:

16.png

Validate the length of the Training data frame:

length(Training$Dependent)
17.png

Run the line of script to console:

18.png

It can be observed that the Training dataset is 1463 records in length, which is broadly 70% of the file.  So not to accidently use the RandomDigit vector in training, drop it from the Training and CV data frames:

CV$RandomDigit <- NULL
Training$RandomDigit <- NULL
19.png

Run the block of script to console:

20.png

H2O requires that the Dependent Variable is a factor, it is after all a classification problem.  Convert the dependent variable to a factor for the training and cross validation dataset:

21.png

Run the line of script to console:

22.png

At this stage, there now exists a randomly selected Training dataset as well as a randomly selection Cross Validation training set.  Keep in mind that H2O requires that the dataframe is converted to the native hex format, achieved through the creation of a parsed data object for each dataset. Think of this process as being the loading of data into the H2O server, more so than a conversion to Hex:

Training.hex <- as.h2o(Training)
CV.hex <- as.h2o(CV)
23.png

Run the block of script to console:

24.png

All models that are available to be trained via the Flow interface are available via the R interface, with the hex files being ready to be passed as parameters.

4) Recalling a Logistic Regression model with Flow

This Blog entry is from the Deep Learning section in Learn R.

To recall this logistic regression model from flow, navigate to:

Scores >>> Predict

1.png

The predict cell will be added to the Flow:

2.png

The recall of the model may assume that a new frame has been created in flow, but for this example, the validation frame will be recalled via the logistic regression, trained, model.  Firstly, set the model to recall:

3.png

Thereafter, select the data frame to process through the model:

4.png

Upon selecting the input parameters, click the predict button to complete the prediction.  A cell detailing the output will be created:

5.png

It is sensible at this stage to combine the predictions with the original dataset.  To combine the predictions with the original dataset, simply click the Combine Predictions with Frame button:

6.png

Upon combining the predictions with the original dataset, the dataset will be available for download:

7.png

To interact with the newly created data frame click on the View Frame button:

8.png

The View Frame functionality provides for the downloading and further manipulation of the data frame:

9.png

The process thus far uses the Flow user interface to create something akin to a script, where it is the flow tool that is sending instructions to the H2O API.  It would be far less cumbersome to use R scripting to achieve such flows.

3) Creating a Logistic Regression model in H2O (GLM)

This Blog entry is from the Deep Learning section in Learn R.

With the data loaded, a model now needs to be trained.  Navigate to Models to see the available algorithms:

Models

1.png

In this case, the algorithm is Generalised Linear Modelling (this is Logistic Regression).  Click this model to create the cell in flow:

2.png

There are a multitude of parameters that are quite outside the scope of this document, for the purposes of this document, simply specify the Training and Validation Hex sets:

3.png

Thereafter, specify the dependent variable, known as the Response Column in H2O:

4.png

In this case the Dependent Variable is titled as the same:

5.png

Scroll to the base of the cell and click Build Model to initiate the training process:

6.png

The training process will begin with progress being written out to a newly created job cell:

7.png

At this stage a Logistic Regression model has been created. It is a good idea to save the flow by navigating:

Flow >>> Save Flow

8.png