9) Boosting and Recalling in C5.

This Blog entry is from the Probability and Trees section in Learn R.

Boosting is a mechanism inside the C5 package that will create many different models, then give opportunity for each model to vote a classification, with the most widely suggested classification being the prevailing classification.  The majority classification voted for wins. It could be argued that this is a form of Abstraction.

Simply add the argument 10 to indicate that there should be ten trials to vote:

C50Tree <- C5.0(CreditRisk[-1],CreditRisk$Dependent,trials = 10) 
1.png

Run the line of script to console:

2.png

The summary function will produce a report:

summary(C50Tree)
3.png

In this instance, however, upon scrolling up, it can be seen that several different models \ trials have been created:

4.png

In the above example the decision tree for the 9th trial has been evidenced.  Prediction takes place in exactly the same manner, using the predict() function,  except for it will run several models and established a voted majority classification.  This is boosting:

CreditRiskPrediction <- predict(C50Tree,CreditRisk)
5.png

Run the line of script to console:

6.png

In the above example the decision tree for the 9th trial has been evidenced.  Prediction takes place in exactly the same manner, using the predict() function,  except for it will run several models and established a voted majority classification.  This is boosting:

CreditRiskPrediction <- predict(C50Tree,CreditRisk)
7.png

Run the line of script to console:

8.png

A confusion matrix can be created to compare this object with that created in procedure 100:

CrossTable(CreditRisk$Dependent, CreditRiskPrediction)
9.png

Run the line of script to console:

10.png

In this example, it can be observed that there were 281 accounts where predicted to be bad, taking the CreditRiskPrediction column-wise, it can be observed there was a 1 account classification as bad in error.  Out of 281 classifications as bad, it can be said that the error rate is just 0.3%.  Referring to the original model as created, it can be seen that an 11% increase in performance has been achieved from boosting.

There is such a thing as a model being too good, which would indicate that the model is perhaps over-fit. Over-fitting is dealt with in more detail while exploring Gradient Boosting Machines and Neural Networks, however, at this stage it is sufficient to explain that one should never test a model on the same data used to train.

8) Expressing Business Rules from C5.

This Blog entry is from the Probability and Trees section in Learn R.

In traversing the C5 decision tree it is almost certain that when coming to deploy the model, beyond using the predict() function,  that it will be expressed or programmed as logical statements,  for example:

If Status_Of_Existing_Checking_Account < 200 EUR

AND Credit_History in ("All_Paid","No_Credit_Open_Or_All_Paid")

AND Housing = "Owner"

AND Purpose = "New Car"

AND Duration_In_Month < 22 THEN "Good"

1.png

To display the model as rules rather than a tree, it is necessary to rebuild the model specifying rules argument to be true:

C50Tree <- C5.0(CreditRisk[-1],CreditRisk$Dependent,rules=TRUE)
2.png

Thereafter, the summary() function can be used to output a series of rules created in the rebuild as opposed to a decision tree:

summary(C50Tree)
3.png

Run the line of script to console:

4.png

Scrolling up in the console, it can be observed, towards the top, that in place of a decision tree a series of rules has been created:

5.png

These rules can be deployed with very small modification far more intuitively in a variety of languages, not least SQL.

6) Creating a Confusion Matrix for a C5 Decision Tree.

This Blog entry is from the Probability and Trees section in Learn R.

Beyond the summary statistic created, the confusion matrix is the most convenient means to appraise the utility of a classification model. The confusion matrix for the C5 decision tree model will be created using the CorssTable function of the gmodels() package:

library("gmodels")
CrossTable(CreditRisk$Dependent, CreditRiskPrediction)
1.png

Run the line of script to console:

2.png

The overall utility of the C5 decision tree model can be inferred in the same manner as procedure 100.

The confusion matrix classified 206 records as being bad correctly, taking CreditRiskPrediction column wise, it can be seen that 28 records were classified as Bad yet they were in fact Good.  It can be said that there is an 11.9% error rate on records classified as bad by the model.  Taking note of this metric, in procedure 112 boosting will be attempted which should bring about improvement of this model.

7) Visualising a C5 Decision Tree.

This Blog entry is from the Probability and Trees section in Learn R.

To visualise a C5 Decision tree, the plot() function from the R base functions can be used, passing the C5 decision tree model as the argument:

plot(C50Tree)
1.png

Run the line of script to console:

2.png

It can be seen that a visualisation has been written out to the plots pane:

3.png

If the tree is very large, then the zoom feature will need to be used to ensure that the plot fits the screen.  Even with zoon, it is possibly more appropriate to communicate the product of C5 decision trees as a list of rules, as covered in Blog entries that follow).

4) Creating a C5 Decision Tree object.

This Blog entry is from the Probability and Trees section in Learn R.

Install the C50 package using RStudio:

1.png

Click Install to download and install the package:

2.png

The CreditRisk data frame contains loan application data and a dependent variable which details the overall loan performance, titled Dependent for consistency.  The first and most obvious difference between this data frame and those used previously is the extent to which data is categorical and string based:

View(CreditRisk)
3.png

Run the line of script to console:

4.png

Emphasising, the dataset is far more categorical in nature.  To begin training a C5 Decision Tree load the library:

library(C50)
5.png

Run the line of script to console:

6.png

The input parameters to the C5.0() function, which is used to train a decision tree, is slightly different to that observed in preceding Blog entries.  A data frame containing only the independent variables (no dependent variable), then a vector containing the dependent variable is required to train a model and in this regard, it differs from many of the other Blog entries.

In this example, the CreditRisk data frame contains both dependent and independent variables and needs splitting, in this case using negative sub-setting to negate the first column then referencing the dependent variable explicitly:

C50Tree <- C5.0(CreditRisk[-1],CreditRisk$Dependent)
7.png

Run the line of script to console:

8.png

The C5 decision tree has now been created and stored in the C50Tree object.    To view basic information about the tree:

C50Tree
9.png

Run the line of script to console:

10.png

Use the summary() function to output the C5 decision tree and view the logic required to implement the classification tool:

summary(C50Tree)
11.png

Run the line of script to console:

12.png

The summary output is overwhelming, however, scrolling up through the pane of results reveals the decision tree:

13.png

The interpretation of this decision tree is very similar that of a regression tree.  One such branch in this example would suggest that the following scenario would yield a bad account:

If Housing = Owner AND Purpose = "New Car" AND the Loan_Duration <= 22 Months Then BAD

In the above example, out of 1000 cases, it can be seen that 7 cases had this disposition:

14.png

Scrolling down further, below the tree output, is the performance measures of the model overall:

15.png

It can be seen in this example that the error rate has been assessed at 12.2%, suggesting that 87.8% of the time the model correctly classified.  A confusion matrix has been written out, however it is more convenient to use the CrossTable function for the purposes of understanding false positive ratios.