10) Creating a Gradient Boosting Machine.

This Blog entry is from the Probability and Trees section in Learn R.

A relatively underutilised classification tool, which is built upon the concept of boosted decision trees, is the Gradient Boosting Machine, or GBM.  The GBM is a fairly black box implementation of the methods covered thus far, in this section.  The concept of Boosting refers to taking under-performing classifications and singling them out for boosting, or rather creating a dedicated model targeting the weaker performing data.  The GBM is part of the GBM package, as such install that package:

1.png

Click Install to download and install the package:

2.png

Load the library:

library(GBM)
3.png

Run the line of script to console:

4.png

The warning messages can be ignored as we can be reasonably assured of backward compatibility between the package build and this version of R.

Creating a GBM is similar to the familiar interfaces of regression, except for having a few parameters relating to the taming of the GBM:

gbm = gbm(Dependent ~., CreditRisk,
          n.trees=1000,
          shrinkage=0.01,
          distribution="gaussian",
          interaction.depth=7,
          bag.fraction=0.9,
          cv.fold=10,
          n.minobsinnode = 50
)

Run the block of script to console:

5.png

Run the line of script to console, it may take some time:

6.png

To review the performance statistics of the GBM, simply recall the model:

gbm
7.png

Run the line of script to console:

8.png

The most salient information from this summary is that 1000 iterations were performed, with the cross validation diverging at tree 542.  A visual inspection of the cross validation can be presented by:

gbm.perf(gbm)
9.png

Run the line of script to console:

10.png

It can be seen that the line was drawn at the point divergence started:

11.png

As decision trees can become a little unwieldy, it might be prudent to inspect the relative importance of each of the independent variables with a view to pruning and rerunning the GBM training.  To understand the importance of each Independent Variable, wrap the summary function around the GBM:

summary(GBM)
12.png

Run the line of script to console:

13.png

The most useful and important variable is written out first, with the less important being written out last.  This is also displayed in a bar chart giving the overall usefulness of the independent variables at a glance:

14.png

7) Visualising a C5 Decision Tree.

This Blog entry is from the Probability and Trees section in Learn R.

To visualise a C5 Decision tree, the plot() function from the R base functions can be used, passing the C5 decision tree model as the argument:

plot(C50Tree)
1.png

Run the line of script to console:

2.png

It can be seen that a visualisation has been written out to the plots pane:

3.png

If the tree is very large, then the zoom feature will need to be used to ensure that the plot fits the screen.  Even with zoon, it is possibly more appropriate to communicate the product of C5 decision trees as a list of rules, as covered in Blog entries that follow).

4) Creating a C5 Decision Tree object.

This Blog entry is from the Probability and Trees section in Learn R.

Install the C50 package using RStudio:

1.png

Click Install to download and install the package:

2.png

The CreditRisk data frame contains loan application data and a dependent variable which details the overall loan performance, titled Dependent for consistency.  The first and most obvious difference between this data frame and those used previously is the extent to which data is categorical and string based:

View(CreditRisk)
3.png

Run the line of script to console:

4.png

Emphasising, the dataset is far more categorical in nature.  To begin training a C5 Decision Tree load the library:

library(C50)
5.png

Run the line of script to console:

6.png

The input parameters to the C5.0() function, which is used to train a decision tree, is slightly different to that observed in preceding Blog entries.  A data frame containing only the independent variables (no dependent variable), then a vector containing the dependent variable is required to train a model and in this regard, it differs from many of the other Blog entries.

In this example, the CreditRisk data frame contains both dependent and independent variables and needs splitting, in this case using negative sub-setting to negate the first column then referencing the dependent variable explicitly:

C50Tree <- C5.0(CreditRisk[-1],CreditRisk$Dependent)
7.png

Run the line of script to console:

8.png

The C5 decision tree has now been created and stored in the C50Tree object.    To view basic information about the tree:

C50Tree
9.png

Run the line of script to console:

10.png

Use the summary() function to output the C5 decision tree and view the logic required to implement the classification tool:

summary(C50Tree)
11.png

Run the line of script to console:

12.png

The summary output is overwhelming, however, scrolling up through the pane of results reveals the decision tree:

13.png

The interpretation of this decision tree is very similar that of a regression tree.  One such branch in this example would suggest that the following scenario would yield a bad account:

If Housing = Owner AND Purpose = "New Car" AND the Loan_Duration <= 22 Months Then BAD

In the above example, out of 1000 cases, it can be seen that 7 cases had this disposition:

14.png

Scrolling down further, below the tree output, is the performance measures of the model overall:

15.png

It can be seen in this example that the error rate has been assessed at 12.2%, suggesting that 87.8% of the time the model correctly classified.  A confusion matrix has been written out, however it is more convenient to use the CrossTable function for the purposes of understanding false positive ratios.