9) Boosting and Recalling in C5.

This Blog entry is from the Probability and Trees section in Learn R.

Boosting is a mechanism inside the C5 package that will create many different models, then give opportunity for each model to vote a classification, with the most widely suggested classification being the prevailing classification.  The majority classification voted for wins. It could be argued that this is a form of Abstraction.

Simply add the argument 10 to indicate that there should be ten trials to vote:

C50Tree <- C5.0(CreditRisk[-1],CreditRisk$Dependent,trials = 10) 
1.png

Run the line of script to console:

2.png

The summary function will produce a report:

summary(C50Tree)
3.png

In this instance, however, upon scrolling up, it can be seen that several different models \ trials have been created:

4.png

In the above example the decision tree for the 9th trial has been evidenced.  Prediction takes place in exactly the same manner, using the predict() function,  except for it will run several models and established a voted majority classification.  This is boosting:

CreditRiskPrediction <- predict(C50Tree,CreditRisk)
5.png

Run the line of script to console:

6.png

In the above example the decision tree for the 9th trial has been evidenced.  Prediction takes place in exactly the same manner, using the predict() function,  except for it will run several models and established a voted majority classification.  This is boosting:

CreditRiskPrediction <- predict(C50Tree,CreditRisk)
7.png

Run the line of script to console:

8.png

A confusion matrix can be created to compare this object with that created in procedure 100:

CrossTable(CreditRisk$Dependent, CreditRiskPrediction)
9.png

Run the line of script to console:

10.png

In this example, it can be observed that there were 281 accounts where predicted to be bad, taking the CreditRiskPrediction column-wise, it can be observed there was a 1 account classification as bad in error.  Out of 281 classifications as bad, it can be said that the error rate is just 0.3%.  Referring to the original model as created, it can be seen that an 11% increase in performance has been achieved from boosting.

There is such a thing as a model being too good, which would indicate that the model is perhaps over-fit. Over-fitting is dealt with in more detail while exploring Gradient Boosting Machines and Neural Networks, however, at this stage it is sufficient to explain that one should never test a model on the same data used to train.