3) Filter Data Frame for Activation and Produce Summary Statistics to prescribe

This Blog entry is from the Monte Carlo Model section in Learn R.

Keeping in mind that the H2O neural network was trained on real data and is a very good approximation of fraud, by simulating millions of random variables through this model while saving these simulations, it becomes feasible to present summary statistics which can explain what the activation scenario most likely looks like.

The task is to create summary statistics upon the simulations for only those records which have been activated.  Start by filtering only those records classified as fraud to a new data frame (keeping in mind dplyr has already been loaded):

SimulatedAndActivated <- filter(SimulatedDataFrame,SimulatedActvations == 1)
1.png

Run the line of script to console:

2.png

The SimulatedAndActivated data frame is now a picture of the activated scenario only, henceforth a series of summary statistics can be executed against this dataframe to begin to understand the environment of fraud. In the following example, a summary of the Count_Transactions_1_Day is provided:

summary(Count_Transactions_1_Day)
3.png

Run the line of script to console:

4.png

In this case, it would seem that the average number of transactions on a fraudulent account is 10.  When taken in conjunction with other such summary statistics and used in conjunction with the original summary statistics observed from the simulated dataset, this can provide compelling prescriptions.

2) Process Random Data Frame against Neural Network Model

This Blog entry is from the Monte Carlo Model section in Learn R.

The data frame can be used with all of the machine learning algorithms presented in this guide thus far, although to use the data frame with H2O, it needs to be loaded into H2O as hex:

To load the data frame into H2O use:

SimulatedHex <- as.h2o(SimulatedDataFrame)
1.png

Run the line of script to console:

2.png

As before, use the H2O predict function to execute the model, passing the simulated dataframe in the place of real data:

SimulatedScores <- h2o.predict(Model,SimulatedHex)
3.png

Parse the Activation to a standalone vector:

SimulatedActvations <- as.vector(SimulatedScores[1])
4.png

Run the line of script to console:

5.png

Append the vector to the simulations data frame (keeping in mind that dplyr is already loaded):

SimulatedDataFrame <-mutate(SimulatedDataFrame, SimulatedActvations)
6.png

Run the line of script to console:

7.png

Viewing the simulated data frame, scrolling to the last column:

View(SimulatedDataFrame)
8.png

It can be seen that the simulated dataframe has been passed through the H2O neural network as if it were production data.  The last column contains the predicted activation, in this case fraud prevention.  This data frame can now be used to describe the most likely scenario surrounding an activation.

1) Create Discrete Vectors with triangle for each model parameter

This Blog entry is from the Monte Carlo Model section in Learn R.

In this example, the result is the simulation of the neural network model that was created in H2O.  It follows that we need to create a dataframe with the same specification the training data set.

For the purposes of our example, we are going to create triangular distributions comprised of the Minimum Value, the Maximum Value and the Mean.  This simulated dataframe will be 100,000 records in length.

This procedure will focus on creating this vector for a single variable, before providing a block of script to achieve this for each variable at the end of the procedure.

Firstly, install the triangle package:

1.png

Load the library:

library(triangle)
2.png

Run the line of script to console:

3.png

The rtriangle() function accepts four parameters:

Name

Description

Example

Simulations

This is the size of the return vector and number of simulations to create.

100000

Min

The smallest value to be created in the simulation.

0

Max

The largest value to be created in the simulation.

100

Mean or Mode

The Mean or Mode used to skew the distribution to more closely align to the real data.

10

The dataframe needs to be as closely aligned to the real data as possible and as such the triangular distribution points are going to be taken from the training dataframe rather than created manually.  To create a vector for the first variable used in H2O model training use the following line of script:

Count_Transactions_1_Day <- rtriangle(100000,min(FraudRisk$Count_Transactions_1_Day),max(FraudRisk$Count_Transactions_1_Day),mean(FraudRisk$Count_Transactions_1_Day))
4.png

Run the line of script to console:

5.png

Validate the vector by inspecting it as a histogram:

hist(Count_Transactions_1_Day)
6.png

Run the line of script to console:

7.png

It can be seen that a triangular distribution has been created, slightly skewed to axis.  The task now remains to repeat this for each of the variables required of the H2O model.  The construct and principle for this procedure will be the same, for each variable:

Authenticated <- rtriangle(100000,min(FraudRisk$Authenticated),max(FraudRisk$Authenticated),mean(FraudRisk$Authenticated))
Count_Transactions_PIN_Decline_1_Day <- rtriangle(100000,min(FraudRisk$Count_Transactions_PIN_Decline_1_Day),max(FraudRisk$Count_Transactions_PIN_Decline_1_Day),mean(FraudRisk$Count_Transactions_PIN_Decline_1_Day))
Count_Transactions_Declined_1_Day <- rtriangle(100000,min(FraudRisk$Count_Transactions_Declined_1_Day),max(FraudRisk$Count_Transactions_Declined_1_Day),mean(FraudRisk$Count_Transactions_Declined_1_Day))
Count_Unsafe_Terminals_1_Day <- rtriangle(100000,min(FraudRisk$Count_Unsafe_Terminals_1_Day),max(FraudRisk$Count_Unsafe_Terminals_1_Day),mean(FraudRisk$Count_Unsafe_Terminals_1_Day))
Count_In_Person_1_Day <- rtriangle(100000,min(FraudRisk$Count_In_Person_1_Day),max(FraudRisk$Count_In_Person_1_Day),mean(FraudRisk$Count_In_Person_1_Day))
Count_Internet_1_Day <- rtriangle(100000,min(FraudRisk$Count_Internet_1_Day),max(FraudRisk$Count_Internet_1_Day),mean(FraudRisk$Count_Internet_1_Day))
ATM <- rtriangle(100000,min(FraudRisk$ATM),max(FraudRisk$ATM),mean(FraudRisk$ATM))
Count_ATM_1_Day <- rtriangle(100000,min(FraudRisk$Count_ATM_1_Day),max(FraudRisk$Count_ATM_1_Day),mean(FraudRisk$Count_ATM_1_Day))
Count_Over_30_SEK_1_Day <- rtriangle(100000,min(FraudRisk$Count_Over_30_SEK_1_Day),max(FraudRisk$Count_Over_30_SEK_1_Day),mean(FraudRisk$Count_Over_30_SEK_1_Day))
In_Person <- rtriangle(100000,min(FraudRisk$In_Person),max(FraudRisk$In_Person),mean(FraudRisk$In_Person))
Transaction_Amt <- rtriangle(100000,min(FraudRisk$Transaction_Amt),max(FraudRisk$Transaction_Amt),mean(FraudRisk$Transaction_Amt))
Sum_Transactions_1_Day <- rtriangle(100000,min(FraudRisk$Sum_Transactions_1_Day),max(FraudRisk$Sum_Transactions_1_Day),mean(FraudRisk$Sum_Transactions_1_Day))
Sum_ATM_Transactions_1_Day <- rtriangle(100000,min(FraudRisk$Sum_ATM_Transactions_1_Day),max(FraudRisk$Sum_ATM_Transactions_1_Day),mean(FraudRisk$Sum_ATM_Transactions_1_Day))
Foreign <- rtriangle(100000,min(FraudRisk$Foreign),max(FraudRisk$Foreign),mean(FraudRisk$Foreign))
Different_Country_Transactions_1_Week <- rtriangle(100000,min(FraudRisk$Different_Country_Transactions_1_Week),max(FraudRisk$Different_Country_Transactions_1_Week),mean(FraudRisk$Different_Country_Transactions_1_Week))
Different_Merchant_Types_1_Week <- rtriangle(100000,min(FraudRisk$Different_Merchant_Types_1_Week),max(FraudRisk$Different_Merchant_Types_1_Week),mean(FraudRisk$Different_Merchant_Types_1_Week))
Different_Decline_Reasons_1_Day <- rtriangle(100000,min(FraudRisk$Different_Decline_Reasons_1_Day),max(FraudRisk$Different_Decline_Reasons_1_Day),mean(FraudRisk$Different_Decline_Reasons_1_Day))
Different_Cities_1_Week <- rtriangle(100000,min(FraudRisk$Different_Cities_1_Week ),max(FraudRisk$Different_Cities_1_Week ),mean(FraudRisk$Different_Cities_1_Week ))
Count_Same_Merchant_Used_Before_1_Week <- rtriangle(100000,min(FraudRisk$Count_Same_Merchant_Used_Before_1_Week),max(FraudRisk$Count_Same_Merchant_Used_Before_1_Week),mean(FraudRisk$Count_Same_Merchant_Used_Before_1_Week))
Has_Been_Abroad <- rtriangle(100000,min(FraudRisk$Has_Been_Abroad),max(FraudRisk$Has_Been_Abroad),mean(FraudRisk$Has_Been_Abroad))
Cash_Transaction <- rtriangle(100000,min(FraudRisk$Cash_Transaction),max(FraudRisk$Cash_Transaction),mean(FraudRisk$Cash_Transaction))
High_Risk_Country <- rtriangle(100000,min(FraudRisk$High_Risk_Country),max(FraudRisk$High_Risk_Country),mean(FraudRisk$High_Risk_Country))
8.png

Run the block of script to console:

9.png

There now exists many randomly simulated vectors, created using a triangular distribution for each input variable for the H2O neural network model.  They now need to be brought together in a dataframe using the data.frame function:

SimulatedDataFrame <- data.frame(Count_Transactions_1_Day,Authenticated,Count_Transactions_PIN_Decline_1_Day,Count_Transactions_Declined_1_Day,Count_Unsafe_Terminals_1_Day,Count_In_Person_1_Day,Count_Internet_1_Day,ATM,Count_ATM_1_Day,Count_Over_30_SEK_1_Day,In_Person,Transaction_Amt,Sum_Transactions_1_Day,Sum_ATM_Transactions_1_Day,Foreign,Different_Country_Transactions_1_Week,Different_Merchant_Types_1_Week,Different_Decline_Reasons_1_Day,Different_Cities_1_Week,Count_Same_Merchant_Used_Before_1_Week,Has_Been_Abroad,Cash_Transaction,High_Risk_Country)
10.png

Run the line of script to console:

11.png

On viewing the SimuatedDataFrame, it can be seen that a new data frame has been created comprising random values.  This data frame can now be used in model recall in a variety of R models:

View(SimuatedDataFrame)
12.png

Run the line of script to console:

13.png

7) Recalling a Neural Network with R

This Blog entry is from the Deep Learning section in Learn R.

Once a model is trained in H2O it can be recalled very gracefully with the predict() function of the H2O package.  It is a simple matter of passing the trained model and the hex dataframe to be used for recall:

Scores <- h2o.predict(Model,CVHex.hex)
1.png

Run the line of script to console:

2.png

A progress bar is broadcast from the H2O server and will be written out to the console.  To review the output, enter the object:

Scores
3.png

Run the line of script to console:

4.png

The Scores output appears similar to a matrix, but it has created a vector which details the actual prediction for a record, hence, this can be subset to a final vector detailing the predictions:

Predict <- Scores[1]
5.png

Run the line of script to console:

6.png

The Predict vector can be compared to the Dependent vector of the CV dataframe in the same manner as previous models within R to obtain Confusion Matrices as well a ROC curves.

6) Creating a Neural Network with R

This Blog entry is from the Deep Learning section in Learn R.

Although all of the work is offloaded to H2O, the instruction to train a model looks a lot like previous examples where a variety of R packages have been used.  In this example the deeplearning function of the H2O package is going to be used (this is really the only reason that we are using H2O in the first place).

In order to make the command easier to understand, typed parameters will be used as follows:

Parameter

Description

x

c("Count_Transactions_1_Day","Authenticated","Count_Transactions_PIN_Decline_1_Day","Count_Transactions_Declined_1_Day","Count_Unsafe_Terminals_1_Day","Count_In_Person_1_Day","Count_Internet_1_Day","ATM","Count_ATM_1_Day","Count_Over_30_SEK_1_Day","In_Person","Transaction_Amt","Sum_Transactions_1_Day","Sum_ATM_Transactions_1_Day","Foreign","Different_Country_Transactions_1_Week","Different_Merchant_Types_1_Week","Different_Decline_Reasons_1_Day","Different_Cities_1_Week","Count_Same_Merchant_Used_Before_1_Week","Has_Been_Abroad","Cash_Transaction","High_Risk_Country")

y

c("Dependent")

training_frame

TrainingHex

validation_frame

CVHex

standardise

FALSE

activation

Rectifier

epochs

50

seed

12345

hidden

5

variable_importance

TRUE

nfolds

5

adaptive_rate

FALSE

The deeplearning function in H2O takes a function two vectors that contain the dependent and independent variables.    For readability, create these string vectors to be passed to the deeplearning function in advance, rather than use the c() function, inside the function call.  To create a list of eligible independent variables for the purposes of this example, enter:

x <- c("Count_Transactions_1_Day","Authenticated","Count_Transactions_PIN_Decline_1_Day","Count_Transactions_Declined_1_Day","Count_Unsafe_Terminals_1_Day","Count_In_Person_1_Day","Count_Internet_1_Day","ATM","Count_ATM_1_Day","Count_Over_30_SEK_1_Day","In_Person","Transaction_Amt","Sum_Transactions_1_Day","Sum_ATM_Transactions_1_Day","Foreign","Different_Country_Transactions_1_Week","Different_Merchant_Types_1_Week","Different_Decline_Reasons_1_Day","Different_Cities_1_Week","Count_Same_Merchant_Used_Before_1_Week","Has_Been_Abroad","Cash_Transaction","High_Risk_Country")
1.png

Run the line of script to console:

2.png

To instruct H2O to begin deep learning, enter:

Model <- h2o.deeplearning(x=x, y="Dependent",training_frame=TrainingHex.hex,validation_frame=CVHex.hex,activation="Rectifier",epochs=50,seed=12345,hidden=5,variable_importance=TRUE,nfolds=5,adaptive_rate=FALSE,standardize=TRUE)
3.png

Run the line of script to console:

4.png

Feedback from the H2O cluster will be received, detailing training progress.