9) Grading the ROC Performance with AUC.

This Blog entry is from the Logistic Regression section in Learn R.

Visually the ROC Curve plot created in the previous Blog entry suggests a that the model created has some predictive power.  A more succinct method to measure model performance is the Area Under Curve statistics which can be calculated with ease by requesting "auc" as the measure to the performance object:

``AUC <- performance(ROCRPredictions,measure = "auc")``

Run the line of script to console:

To write out the contents of the AUC object:

AUC

Run the line of script to console:

The value to gravitate towards is the y.values,  which will have a value ranging between 0.5 and 1:

In this example, the AUC value is 0.827767 which suggests that the model has an excellent utility. By way of grading, AUC scores would correspond:

·         A: Outstanding > 0.9

·         B: Excellent > 0.8 and <= 0.9

·         C: Acceptable > 0.7 and <= 0.8

·         D: Poor > 0.6 and <= 0.7

·         E: Junk > 0.5 and <= 0.6

8) Creating a ROC Curve.

This Blog entry is from the Logistic Regression section in Learn R.

The ROCR package provides a set of functions that simplifies the process of appraising the performance of classification models, comparing the actual outcome with a probability prediction.  It can be noted that although a logistic regression model outputs between -5 and + 5, converting this value to an intuitive probability.

Firstly, install the ROCR package from the RStudio package installation utility.

Click install to proceed with the installation:

Reference the ROC Library:

``library(ROCR)``

Run the block of script to console:

Two vectors and inputs are needed to create a visualisation, the first is the predictions expressed as a probability, the second being the actual outcome.  In this example, it will be the vector FraudRisk\$ PAutomaticLogisticRegression And FraudRisk\$Dependent.  To create the predictions object in ROCR:

``ROCRPredictions <- prediction(FraudRisk\$PAutomaticLogisticRegression, FraudRisk\$Dependent)``

Once the prediction object has been created it needs to be morphed into a performance object using the performance() function.  The performance function takes the prediction object yet also an indication as to the performance measures to be used, in this case true positive rate (tpr) vs false positive rate (fpr).  The performance function outputs an object that can be used in conjunction with the base graphic plot() function:

``ROCRPerformance <- performance(ROCRPredictions,measure = "tpr",x.measure = "fpr")``

Run the line of script to console:

Simply plot the ROCRPerformance object by passing as an argument to the plot() base graphic function:

Run the line of script to console:

It can be seen that a curve plot has been created in the plots window in RStudio:

It can be seen that the line is not diagonal, leading to an inference that the model has some predictive power.

7) Output Logistic Regression Model as Probability.

This Blog entry is from the Logistic Regression section in Learn R.

The logistic regression output ranges from –5 to +5, yet oftentimes it is substantially more intuitive to present this output as a probability.  The formula to convert a logistic regression output to a probability is:

P = exp(Ouput) / (1+exp(Ouput))

It follows that vector arithmetic can be used, simply swapping the output with a vector of values created by the logistic regression model:

Run the line of script to console:

For completeness merge the probability values into the FraudRisk data frame:

``FraudRisk <- mutate(FraudRisk, PAutomaticLogisticRegression)``

Run the line of script to console:

6) Activating Logistic Regression and Creating a Confusion Matrix.

This Blog entry is from the Logistic Regression section in Learn R.

A logistic regression model outputs values between – 5 and +5, representing zero probability to 100% percent probability.  Zero would represent a 50/50 probability, anything greater than zero would denote the outcome being more likely than not.

In this example, suppose that activation is to take place based upon the balance of probabilities and anything greater than 0 should be considered as being predicted, in this example, as fraud.  The ifelse() function can facilitate the creation of an activation function:

``ActivateAutomaticLogisticRegression <- ifelse(AutomaticLogisticRegression > 0,1,0)``

Run the line of script to console:

For completeness merge the Activated Logistic Regression model into the fraud risk data frame:

``FraudRisk <- mutate(FraudRisk, ActivateAutomaticLogisticRegression)``

Run the line of script to console:

To create a confusion matrix using the table() function based upon the predicted \ ActivateAutomaticLogisticRegression vs the Actual \ Dependent variable:

``table (FraudRisk\$Dependent, FraudRisk\$ActivateAutomaticLogisticRegression)``

Run the line of script to console to output the confusion matrix:

In this example, it can be seen that of 901 records in total, 576 were judged to be fraudulent by the model and were in fraudulent in actuality, some 63.9% a figure for which improvement should be sought via stepwise logistic regression.

The process of calculating the performance of the confusion matrix in this manner is quite laborious and there exist several packages that help layout the confusion matrix with more readily available performance measures.  Install the gmodels package:

Once the gmodels library is installed it needs to be referenced.  To create the confusion matrix, the line of script resembles the table() function almost absolutely, except making use of the CrossTable() function of the gmodels package:

``library("gmodels")``
``CrossTable(FraudRisk\$Dependent, FraudRisk\$ActivateAutomaticLogisticRegression)``

Run the line of script to console:

It can be seen that a confusion matrix has been created in much the same manner except for it has created the summary statistics across both axis of the table.

5) Recalling a Logistic Regression Model

This Blog entry is from the Logistic Regression section in Learn R.

It is fairly self-explanatory to deploy a logistic model, recall is performed in the same manner as a linear regression model and as described beforehand.  As with the lm() product, the glm() model has a predict.gml() function to create a prediction for all values in a data frame.  The signature bears stark resemblance to that of the predict.lm() function:

``AutomaticLogisticRegression <- predict.glm(LogisticRegressionModel,FraudRisk)``

Run the line of script to console:

It can be seen that a new vector has been created in the environment pane which will contain the predictions for each entry in the FraudRisk Data Frame:

For completeness, merge the newly created vector into the FraudRisk data frame:

``FraudRisk <- mutate(FraudRisk, AutomaticLogisticRegression)``

Run the line of script to console: