3) Creating a Logistic Regression model in H2O (GLM)

This Blog entry is from the Deep Learning section in Learn R.

With the data loaded, a model now needs to be trained.  Navigate to Models to see the available algorithms:

Models

1.png

In this case, the algorithm is Generalised Linear Modelling (this is Logistic Regression).  Click this model to create the cell in flow:

2.png

There are a multitude of parameters that are quite outside the scope of this document, for the purposes of this document, simply specify the Training and Validation Hex sets:

3.png

Thereafter, specify the dependent variable, known as the Response Column in H2O:

4.png

In this case the Dependent Variable is titled as the same:

5.png

Scroll to the base of the cell and click Build Model to initiate the training process:

6.png

The training process will begin with progress being written out to a newly created job cell:

7.png

At this stage a Logistic Regression model has been created. It is a good idea to save the flow by navigating:

Flow >>> Save Flow

8.png

2) Loading Data into H2O with Flow

This Blog entry is from the Deep Learning section in Learn R.

In this Blog entry a logistic regression model will be created, using Flow, achieving the same results as achieved in the GLM functions of R and Exhaustive.

In the Flow user interface, start by navigating:

Flow >>> New Flow

1.png

If prompted to create a new workbook, affirm this:

2.png

To add a cell for the importing of data, navigate to:

Data >>> Import Files

3.png

It can be seen that Import Files Cell has been added to the Flow:

4.png

In the Search dialog box, enter the location of the FraudRisk.csv file until a drop down is populated, for example:

5.png

Click on the Search Icon to bring back the contents of this directory:

6.png

Click on the file or plus sign to add the file to the cell:

7.png

Click the Import Button to import the file to H2O:

8.png

Note that the file is not parsed to the H2O column compressed format, known as Hex.  To achieve parsing, simply click the button titled 'Parse These Files':

9.png

The next screen allows for the specification and data types to be more robustly configured.  In this example, a cursory check to ensure that the data types are correct is sufficient:

10.png

Upon satisfaction, click parse to mount the dataset in H20 as Hex:

11.png

A background job will start the process of transforming the data from FraudRisk.csv to the H2O hex format:

12.png

H2O supports the concept of training and validation datasets robustly, henceforth the hex file needs to be split into training and validation.  To split a Hex frame, navigate to:

Data >>> Split Frame

13.png

Click on the menu item to create the split data frame cell:

14.png

Select the frame to be split, in this case FraudRisk.hex:

15.png

The default frame split is 75% by 25%, confirm this by clicking the Create button:

16.png

There now exists two frames in the flow, the smaller of which will be used for validation:

17.png

9) Grading the ROC Performance with AUC.

This Blog entry is from the Logistic Regression section in Learn R.

Visually the ROC Curve plot created in the previous Blog entry suggests a that the model created has some predictive power.  A more succinct method to measure model performance is the Area Under Curve statistics which can be calculated with ease by requesting "auc" as the measure to the performance object:

AUC <- performance(ROCRPredictions,measure = "auc")
1.png

Run the line of script to console:

2.png

To write out the contents of the AUC object:

AUC

3.png

Run the line of script to console:

4.png

The value to gravitate towards is the y.values,  which will have a value ranging between 0.5 and 1:

5.png

In this example, the AUC value is 0.827767 which suggests that the model has an excellent utility. By way of grading, AUC scores would correspond:

·         A: Outstanding > 0.9

·         B: Excellent > 0.8 and <= 0.9

·         C: Acceptable > 0.7 and <= 0.8

·         D: Poor > 0.6 and <= 0.7

·         E: Junk > 0.5 and <= 0.6

8) Creating a ROC Curve.

This Blog entry is from the Logistic Regression section in Learn R.

The ROCR package provides a set of functions that simplifies the process of appraising the performance of classification models, comparing the actual outcome with a probability prediction.  It can be noted that although a logistic regression model outputs between -5 and + 5, converting this value to an intuitive probability.

Firstly, install the ROCR package from the RStudio package installation utility.

1.png

Click install to proceed with the installation:

2.png

Reference the ROC Library:

library(ROCR)
3.png

Run the block of script to console:

4.png

Two vectors and inputs are needed to create a visualisation, the first is the predictions expressed as a probability, the second being the actual outcome.  In this example, it will be the vector FraudRisk$ PAutomaticLogisticRegression And FraudRisk$Dependent.  To create the predictions object in ROCR:

ROCRPredictions <- prediction(FraudRisk$PAutomaticLogisticRegression, FraudRisk$Dependent)
5.png

Once the prediction object has been created it needs to be morphed into a performance object using the performance() function.  The performance function takes the prediction object yet also an indication as to the performance measures to be used, in this case true positive rate (tpr) vs false positive rate (fpr).  The performance function outputs an object that can be used in conjunction with the base graphic plot() function:

ROCRPerformance <- performance(ROCRPredictions,measure = "tpr",x.measure = "fpr")
6.png

Run the line of script to console:

7.png

Simply plot the ROCRPerformance object by passing as an argument to the plot() base graphic function:

8.png

Run the line of script to console:

9.png

It can be seen that a curve plot has been created in the plots window in RStudio:

10.png

It can be seen that the line is not diagonal, leading to an inference that the model has some predictive power.

7) Output Logistic Regression Model as Probability.

This Blog entry is from the Logistic Regression section in Learn R.

The logistic regression output ranges from –5 to +5, yet oftentimes it is substantially more intuitive to present this output as a probability.  The formula to convert a logistic regression output to a probability is:

P = exp(Ouput) / (1+exp(Ouput))

It follows that vector arithmetic can be used, simply swapping the output with a vector of values created by the logistic regression model:

1.png

Run the line of script to console:

2.png

For completeness merge the probability values into the FraudRisk data frame:

FraudRisk <- mutate(FraudRisk, PAutomaticLogisticRegression)
3.png

Run the line of script to console:

4.png