10) Creating a Gradient Boosting Machine.

This Blog entry is from the Probability and Trees section in Learn R.

A relatively underutilised classification tool, which is built upon the concept of boosted decision trees, is the Gradient Boosting Machine, or GBM.  The GBM is a fairly black box implementation of the methods covered thus far, in this section.  The concept of Boosting refers to taking under-performing classifications and singling them out for boosting, or rather creating a dedicated model targeting the weaker performing data.  The GBM is part of the GBM package, as such install that package:

1.png

Click Install to download and install the package:

2.png

Load the library:

library(GBM)
3.png

Run the line of script to console:

4.png

The warning messages can be ignored as we can be reasonably assured of backward compatibility between the package build and this version of R.

Creating a GBM is similar to the familiar interfaces of regression, except for having a few parameters relating to the taming of the GBM:

gbm = gbm(Dependent ~., CreditRisk,
          n.trees=1000,
          shrinkage=0.01,
          distribution="gaussian",
          interaction.depth=7,
          bag.fraction=0.9,
          cv.fold=10,
          n.minobsinnode = 50
)

Run the block of script to console:

5.png

Run the line of script to console, it may take some time:

6.png

To review the performance statistics of the GBM, simply recall the model:

gbm
7.png

Run the line of script to console:

8.png

The most salient information from this summary is that 1000 iterations were performed, with the cross validation diverging at tree 542.  A visual inspection of the cross validation can be presented by:

gbm.perf(gbm)
9.png

Run the line of script to console:

10.png

It can be seen that the line was drawn at the point divergence started:

11.png

As decision trees can become a little unwieldy, it might be prudent to inspect the relative importance of each of the independent variables with a view to pruning and rerunning the GBM training.  To understand the importance of each Independent Variable, wrap the summary function around the GBM:

summary(GBM)
12.png

Run the line of script to console:

13.png

The most useful and important variable is written out first, with the less important being written out last.  This is also displayed in a bar chart giving the overall usefulness of the independent variables at a glance:

14.png