3) Filter Data Frame for Activation and Produce Summary Statistics to prescribe

This Blog entry is from the Monte Carlo Model section in Learn R.

Keeping in mind that the H2O neural network was trained on real data and is a very good approximation of fraud, by simulating millions of random variables through this model while saving these simulations, it becomes feasible to present summary statistics which can explain what the activation scenario most likely looks like.

The task is to create summary statistics upon the simulations for only those records which have been activated.  Start by filtering only those records classified as fraud to a new data frame (keeping in mind dplyr has already been loaded):

SimulatedAndActivated <- filter(SimulatedDataFrame,SimulatedActvations == 1)

Run the line of script to console:

The SimulatedAndActivated data frame is now a picture of the activated scenario only, henceforth a series of summary statistics can be executed against this dataframe to begin to understand the environment of fraud. In the following example, a summary of the Count_Transactions_1_Day is provided:

summary(Count_Transactions_1_Day)

Run the line of script to console:

In this case, it would seem that the average number of transactions on a fraudulent account is 10.  When taken in conjunction with other such summary statistics and used in conjunction with the original summary statistics observed from the simulated dataset, this can provide compelling prescriptions.

2) Process Random Data Frame against Neural Network Model

This Blog entry is from the Monte Carlo Model section in Learn R.

The data frame can be used with all of the machine learning algorithms presented in this guide thus far, although to use the data frame with H2O, it needs to be loaded into H2O as hex:

To load the data frame into H2O use:

SimulatedHex <- as.h2o(SimulatedDataFrame)

Run the line of script to console:

As before, use the H2O predict function to execute the model, passing the simulated dataframe in the place of real data:

SimulatedScores <- h2o.predict(Model,SimulatedHex)

Parse the Activation to a standalone vector:

SimulatedActvations <- as.vector(SimulatedScores)

Run the line of script to console:

Append the vector to the simulations data frame (keeping in mind that dplyr is already loaded):

SimulatedDataFrame <-mutate(SimulatedDataFrame, SimulatedActvations)

Run the line of script to console:

Viewing the simulated data frame, scrolling to the last column:

View(SimulatedDataFrame)

It can be seen that the simulated dataframe has been passed through the H2O neural network as if it were production data.  The last column contains the predicted activation, in this case fraud prevention.  This data frame can now be used to describe the most likely scenario surrounding an activation.

1) Create Discrete Vectors with triangle for each model parameter

This Blog entry is from the Monte Carlo Model section in Learn R.

In this example, the result is the simulation of the neural network model that was created in H2O.  It follows that we need to create a dataframe with the same specification the training data set.

For the purposes of our example, we are going to create triangular distributions comprised of the Minimum Value, the Maximum Value and the Mean.  This simulated dataframe will be 100,000 records in length.

This procedure will focus on creating this vector for a single variable, before providing a block of script to achieve this for each variable at the end of the procedure.

Firstly, install the triangle package:

library(triangle)

Run the line of script to console:

The rtriangle() function accepts four parameters:

 Name Description Example Simulations This is the size of the return vector and number of simulations to create. 100000 Min The smallest value to be created in the simulation. 0 Max The largest value to be created in the simulation. 100 Mean or Mode The Mean or Mode used to skew the distribution to more closely align to the real data. 10

The dataframe needs to be as closely aligned to the real data as possible and as such the triangular distribution points are going to be taken from the training dataframe rather than created manually.  To create a vector for the first variable used in H2O model training use the following line of script:

Count_Transactions_1_Day <- rtriangle(100000,min(FraudRisk\$Count_Transactions_1_Day),max(FraudRisk\$Count_Transactions_1_Day),mean(FraudRisk\$Count_Transactions_1_Day))

Run the line of script to console:

Validate the vector by inspecting it as a histogram:

hist(Count_Transactions_1_Day)

Run the line of script to console:

It can be seen that a triangular distribution has been created, slightly skewed to axis.  The task now remains to repeat this for each of the variables required of the H2O model.  The construct and principle for this procedure will be the same, for each variable:

Authenticated <- rtriangle(100000,min(FraudRisk\$Authenticated),max(FraudRisk\$Authenticated),mean(FraudRisk\$Authenticated))
Count_Transactions_PIN_Decline_1_Day <- rtriangle(100000,min(FraudRisk\$Count_Transactions_PIN_Decline_1_Day),max(FraudRisk\$Count_Transactions_PIN_Decline_1_Day),mean(FraudRisk\$Count_Transactions_PIN_Decline_1_Day))
Count_Transactions_Declined_1_Day <- rtriangle(100000,min(FraudRisk\$Count_Transactions_Declined_1_Day),max(FraudRisk\$Count_Transactions_Declined_1_Day),mean(FraudRisk\$Count_Transactions_Declined_1_Day))
Count_Unsafe_Terminals_1_Day <- rtriangle(100000,min(FraudRisk\$Count_Unsafe_Terminals_1_Day),max(FraudRisk\$Count_Unsafe_Terminals_1_Day),mean(FraudRisk\$Count_Unsafe_Terminals_1_Day))
Count_In_Person_1_Day <- rtriangle(100000,min(FraudRisk\$Count_In_Person_1_Day),max(FraudRisk\$Count_In_Person_1_Day),mean(FraudRisk\$Count_In_Person_1_Day))
Count_Internet_1_Day <- rtriangle(100000,min(FraudRisk\$Count_Internet_1_Day),max(FraudRisk\$Count_Internet_1_Day),mean(FraudRisk\$Count_Internet_1_Day))
ATM <- rtriangle(100000,min(FraudRisk\$ATM),max(FraudRisk\$ATM),mean(FraudRisk\$ATM))
Count_ATM_1_Day <- rtriangle(100000,min(FraudRisk\$Count_ATM_1_Day),max(FraudRisk\$Count_ATM_1_Day),mean(FraudRisk\$Count_ATM_1_Day))
Count_Over_30_SEK_1_Day <- rtriangle(100000,min(FraudRisk\$Count_Over_30_SEK_1_Day),max(FraudRisk\$Count_Over_30_SEK_1_Day),mean(FraudRisk\$Count_Over_30_SEK_1_Day))
In_Person <- rtriangle(100000,min(FraudRisk\$In_Person),max(FraudRisk\$In_Person),mean(FraudRisk\$In_Person))
Transaction_Amt <- rtriangle(100000,min(FraudRisk\$Transaction_Amt),max(FraudRisk\$Transaction_Amt),mean(FraudRisk\$Transaction_Amt))
Sum_Transactions_1_Day <- rtriangle(100000,min(FraudRisk\$Sum_Transactions_1_Day),max(FraudRisk\$Sum_Transactions_1_Day),mean(FraudRisk\$Sum_Transactions_1_Day))
Sum_ATM_Transactions_1_Day <- rtriangle(100000,min(FraudRisk\$Sum_ATM_Transactions_1_Day),max(FraudRisk\$Sum_ATM_Transactions_1_Day),mean(FraudRisk\$Sum_ATM_Transactions_1_Day))
Foreign <- rtriangle(100000,min(FraudRisk\$Foreign),max(FraudRisk\$Foreign),mean(FraudRisk\$Foreign))
Different_Country_Transactions_1_Week <- rtriangle(100000,min(FraudRisk\$Different_Country_Transactions_1_Week),max(FraudRisk\$Different_Country_Transactions_1_Week),mean(FraudRisk\$Different_Country_Transactions_1_Week))
Different_Merchant_Types_1_Week <- rtriangle(100000,min(FraudRisk\$Different_Merchant_Types_1_Week),max(FraudRisk\$Different_Merchant_Types_1_Week),mean(FraudRisk\$Different_Merchant_Types_1_Week))
Different_Decline_Reasons_1_Day <- rtriangle(100000,min(FraudRisk\$Different_Decline_Reasons_1_Day),max(FraudRisk\$Different_Decline_Reasons_1_Day),mean(FraudRisk\$Different_Decline_Reasons_1_Day))
Different_Cities_1_Week <- rtriangle(100000,min(FraudRisk\$Different_Cities_1_Week ),max(FraudRisk\$Different_Cities_1_Week ),mean(FraudRisk\$Different_Cities_1_Week ))
Count_Same_Merchant_Used_Before_1_Week <- rtriangle(100000,min(FraudRisk\$Count_Same_Merchant_Used_Before_1_Week),max(FraudRisk\$Count_Same_Merchant_Used_Before_1_Week),mean(FraudRisk\$Count_Same_Merchant_Used_Before_1_Week))
Cash_Transaction <- rtriangle(100000,min(FraudRisk\$Cash_Transaction),max(FraudRisk\$Cash_Transaction),mean(FraudRisk\$Cash_Transaction))
High_Risk_Country <- rtriangle(100000,min(FraudRisk\$High_Risk_Country),max(FraudRisk\$High_Risk_Country),mean(FraudRisk\$High_Risk_Country))

Run the block of script to console:

There now exists many randomly simulated vectors, created using a triangular distribution for each input variable for the H2O neural network model.  They now need to be brought together in a dataframe using the data.frame function:

Run the line of script to console:

On viewing the SimuatedDataFrame, it can be seen that a new data frame has been created comprising random values.  This data frame can now be used in model recall in a variety of R models:

View(SimuatedDataFrame)

Run the line of script to console:

7) Recalling a Neural Network with R

This Blog entry is from the Deep Learning section in Learn R.

Once a model is trained in H2O it can be recalled very gracefully with the predict() function of the H2O package.  It is a simple matter of passing the trained model and the hex dataframe to be used for recall:

Scores <- h2o.predict(Model,CVHex.hex)

Run the line of script to console:

A progress bar is broadcast from the H2O server and will be written out to the console.  To review the output, enter the object:

Scores

Run the line of script to console:

The Scores output appears similar to a matrix, but it has created a vector which details the actual prediction for a record, hence, this can be subset to a final vector detailing the predictions:

Predict <- Scores

Run the line of script to console:

The Predict vector can be compared to the Dependent vector of the CV dataframe in the same manner as previous models within R to obtain Confusion Matrices as well a ROC curves.

6) Creating a Neural Network with R

This Blog entry is from the Deep Learning section in Learn R.

Although all of the work is offloaded to H2O, the instruction to train a model looks a lot like previous examples where a variety of R packages have been used.  In this example the deeplearning function of the H2O package is going to be used (this is really the only reason that we are using H2O in the first place).

In order to make the command easier to understand, typed parameters will be used as follows:

 Parameter Description x c("Count_Transactions_1_Day","Authenticated","Count_Transactions_PIN_Decline_1_Day","Count_Transactions_Declined_1_Day","Count_Unsafe_Terminals_1_Day","Count_In_Person_1_Day","Count_Internet_1_Day","ATM","Count_ATM_1_Day","Count_Over_30_SEK_1_Day","In_Person","Transaction_Amt","Sum_Transactions_1_Day","Sum_ATM_Transactions_1_Day","Foreign","Different_Country_Transactions_1_Week","Different_Merchant_Types_1_Week","Different_Decline_Reasons_1_Day","Different_Cities_1_Week","Count_Same_Merchant_Used_Before_1_Week","Has_Been_Abroad","Cash_Transaction","High_Risk_Country") y c("Dependent") training_frame TrainingHex validation_frame CVHex standardise FALSE activation Rectifier epochs 50 seed 12345 hidden 5 variable_importance TRUE nfolds 5 adaptive_rate FALSE

The deeplearning function in H2O takes a function two vectors that contain the dependent and independent variables.    For readability, create these string vectors to be passed to the deeplearning function in advance, rather than use the c() function, inside the function call.  To create a list of eligible independent variables for the purposes of this example, enter: