9) Grading the ROC Performance with AUC.

This Blog entry is from the Logistic Regression section in Learn R.

Visually the ROC Curve plot created in the previous Blog entry suggests a that the model created has some predictive power.  A more succinct method to measure model performance is the Area Under Curve statistics which can be calculated with ease by requesting "auc" as the measure to the performance object:

AUC <- performance(ROCRPredictions,measure = "auc")
1.png

Run the line of script to console:

2.png

To write out the contents of the AUC object:

AUC

3.png

Run the line of script to console:

4.png

The value to gravitate towards is the y.values,  which will have a value ranging between 0.5 and 1:

5.png

In this example, the AUC value is 0.827767 which suggests that the model has an excellent utility. By way of grading, AUC scores would correspond:

·         A: Outstanding > 0.9

·         B: Excellent > 0.8 and <= 0.9

·         C: Acceptable > 0.7 and <= 0.8

·         D: Poor > 0.6 and <= 0.7

·         E: Junk > 0.5 and <= 0.6

2) Create an Abstraction Deviation Independent Vector.

This Blog entry is from the Logistic Regression section in Learn R.

In behavioural analytics, especially, one of the most powerful improvements that can be made to a variable is a transformation to compare the value for that records against the value typically observed in this vector for a customer \ product \ portfolio.  There are of course several normalisations that are appropriate for such a task, such as a Z score, however in this instance given the data being skewed a range normalisation may be more appropriate.

A range normalisation will establish the largest value observed in the vector, the smallest value and establish where a test value exists on that range in percentage terms.  In this example, a range normalisation will be performed on the columns Count_Transactions_1_Day.  Firstly, establish the maximum and minimum values:

Min_Count_Transactions_1_Day <- min(FraudRisk$Count_Transactions_1_Day)
Max_Count_Transactions_1_Day <- max(FraudRisk$Count_Transactions_1_Day)
1.png

Run the block of script to console:

2.png

At this stage, the minimum and maximum values have been stored as vectors for Count_Transactions_1_Day.  To create a new vector as a range normalisation:

Range_Deviation_Count_Transactions_1_Day <- (FraudRisk$Count_Transactions_1_Day - Min_Count_Transactions_1_Day) / (Max_Count_Transactions_1_Day - Min_Count_Transactions_1_Day)
3.png

Run the line of script to console:

4.png

Append the newly created vector to the FraudRisk data frame:

FraudRisk <- mutate(FraudRisk, Range_Deviation_Count_Transactions_1_Day)
5.png

Run the line of script to console:

6.png

It can be seen that a plot has been created between the variable Count_Unsafe_Terminals_1_Day and the Dependent variable, and on the basis, that the fraud can either be or not, it has plotted nothing between the points on the Y axis: