11) Recalling a Gradient Boosting Machine.

This Blog entry is from the Probability and Trees section in Learn R.

Recalling the GBM is quite initiative and obeys the standardised predict signature.  To recall the GBM:

GBMPredictions <- predict(GBM,CreditRisk,type = "response")
1.png

Run the line of script to console:

2.png

A distinct peculiarity, given that the CreditRisk data frame has a dependent variable which is a factor, is that the binary classification has been modelled between 1 and 2, being the levels of the factor with 1 being Bad, and Good being two:

3.png

 It follows that predictions that are closer to 2, than 1 would be considered to be Good, whereas vice versa, 1.  To appraise the model performance, a confusion matrix should be created.  Create a vector using the ifelse() function to classify between Good and Bad:

CreditRiskGBMClassifications <- ifelse(GBMPredictions >= 1.5,"Good","Bad")
4.png

Run the line of script to console:

5.png

Create a confusion matrix between the actual value and the value predicted by the GBM:

CrossTable(CreditRisk$Dependent, CreditRiskGBMClassifications)
6.png

Run the line of script to console:

7.png

It can be seen in this example that the GBM has mustered a strong performance.  Of 220 accounts that were bad, it can be seen that the GBM classified 182 of them correctly, which gives an overall accuracy rating of 82%. This is a more realistic figure when compared to C5 boosting, as over-fitting will have been contended with.

10) Creating a Gradient Boosting Machine.

This Blog entry is from the Probability and Trees section in Learn R.

A relatively underutilised classification tool, which is built upon the concept of boosted decision trees, is the Gradient Boosting Machine, or GBM.  The GBM is a fairly black box implementation of the methods covered thus far, in this section.  The concept of Boosting refers to taking under-performing classifications and singling them out for boosting, or rather creating a dedicated model targeting the weaker performing data.  The GBM is part of the GBM package, as such install that package:

1.png

Click Install to download and install the package:

2.png

Load the library:

library(GBM)
3.png

Run the line of script to console:

4.png

The warning messages can be ignored as we can be reasonably assured of backward compatibility between the package build and this version of R.

Creating a GBM is similar to the familiar interfaces of regression, except for having a few parameters relating to the taming of the GBM:

gbm = gbm(Dependent ~., CreditRisk,
          n.trees=1000,
          shrinkage=0.01,
          distribution="gaussian",
          interaction.depth=7,
          bag.fraction=0.9,
          cv.fold=10,
          n.minobsinnode = 50
)

Run the block of script to console:

5.png

Run the line of script to console, it may take some time:

6.png

To review the performance statistics of the GBM, simply recall the model:

gbm
7.png

Run the line of script to console:

8.png

The most salient information from this summary is that 1000 iterations were performed, with the cross validation diverging at tree 542.  A visual inspection of the cross validation can be presented by:

gbm.perf(gbm)
9.png

Run the line of script to console:

10.png

It can be seen that the line was drawn at the point divergence started:

11.png

As decision trees can become a little unwieldy, it might be prudent to inspect the relative importance of each of the independent variables with a view to pruning and rerunning the GBM training.  To understand the importance of each Independent Variable, wrap the summary function around the GBM:

summary(GBM)
12.png

Run the line of script to console:

13.png

The most useful and important variable is written out first, with the less important being written out last.  This is also displayed in a bar chart giving the overall usefulness of the independent variables at a glance:

14.png