Hi there! Get ready to become a bagged tree expert! Bagged trees are famous for improving the predictive capability of a single decision tree. The way we use & evaluate them in R is also very similar to decision trees. Check out my other post on decision trees if you aren’t familiar with them, as they play into the performance of bagged trees. https://datasciencelessons.com/2019/07/30/learn-classification-with-decision-trees-in-r/
Why to use bagged trees
The main idea between bagged trees is that rather than depending on a single decision tree, you are depending on many many decision trees, which allows you to leverage the insight of many models.
When considering the performance of a model, we often consider what’s known as the bias-variance trade-off of our output. Variance has to do with how our model handles small errors and how much that potentially throws off our model and bias results in under-fitting. The model effectively makes incorrect assumptions around the relationships between variables.
You could say the issue with variation is while your model may be directionally correct, it’s not very accurate, while if your model is very biased, while there could be low variation; it could be directionally incorrect entirely.
The biggest issue with a decision tree in general is that they have high variance. The issue this presents is that any minor change to the data can result in major changes to the model and future predictions.
The reason this comes into play here is that one of the benefits of bagged trees, is it helps minimize variation while holding bias consistent.
Why not to use bagged trees
One of the main issues with bagged trees is that they are incredibly difficult to interpret. In the decision trees lesson, we learned that a major benefit of decision trees is that they were considerably easy to interpret. Bagged trees prove opposite in this regard as it’s process lends to complexity. I’ll explain that more in depth shortly.
What is bagging?
Bagging stands for Bootstrap Aggregation; it is what is known as an ensemble method — which is effectively an approach to layering different models, data, algorithms, and so forth.
So now you might be thinking… ok cool, so what is bootstrap aggregation…
What happens is that the model will sample a subset of the data and will train a decision tree; no different from a decision tree so far… but what then happens is that additional samples are taken (with replacement — meaning that the same data can be included multiple times), new models are trained, and then the predictions are averaged. A bagged tree could include 5 trees, 50 trees, 100 trees and so on. Each tree in your ensemble may have different features, terminal node counts, data, etc.
As you can imagine, a bagged tree is very difficult to interpret.
Train a Bagged Tree
To start off we’ll break out our training and test sets. I’m not going to talk much about the train test split here. We’ll be doing this with the Titanic dataset from the
n <- nrow(titanic_train) n_train <- round(0.8 * n) set.seed(123) train_indices <- sample(1:n, n_train) train <- titanic_train[train_indices, ] test <- titanic_train[-train_indices, ]
Now that we have our train & test sets broken out, lets load up the
ipred package. This will allow us to run the bagging function.
A couple things to keep in mind is that the formula indicates that we want to understand
Survived by (
Pclass + Sex + Age + SibSp + Parch + Fare + Embarked
From there you can see that we’re using the train dataset to train this model. & finally you can see this parameter
coob. This is confiming whether we’d like to test performance on an out of bag sample.
Remember how I said that each tree re-samples the data? Well that process leaves a handful of records that will never be used to train with & make up an excellent dataset for testing the model’s performance. This process happens within the
bagging function, as you’ll see when we print the model.
library(ipred) set.seed(123) model <- bagging(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, coob = TRUE) print(model)
As you can see we trained the default of 25 trees in our bagged tree model.
We use the same process to predict for our test set as we use for decision trees.
pred <- predict(object = model, newdata = test, type = "class") print(pred)
Now, we’ve trained our model, predicted for our test set, now it’s time to break down different methods of performance evaluation.
ROC Curve & AUC
ROC Curve or Receiver Operating Characteristic Curve is a method for visualizing the capability of a binary classification model to diagnose or predict correctly. The ROC Curve plots the true positive rate against the false positive rate at various thresholds.
Our target for the ROC Curve is that the true positive rate is 100% and the false positive rate is 0%. That curve would fall in the top left corner of the plot.
AUC is intended to determine the degree of separability, or the ability to correct predict class. The higher the AUC the better. 1 would be perfect, and .5 would be random.
We’ll be using the
metrics package to calculate the AUC for our dataset.
library(Metrics) pred <- predict(object = model, newdata = test, type = "prob") auc(actual = test$Survived, predicted = pred[,"yes"])
Here you can see that I change the
"prob" to return a percentage likelihood rather than the classification. This is needed to calculate AUC.
This returned an AUC of .89 which is not bad at all.
In classification, the idea of a cutoff threshold means that given a certain percent likelihood for a given outcome you would classify it accordingly. Wow was that a mouthful. In other words, if you predict survival at 99%, then you’d probably classify it as survival. Well lets say you look at another passenger that you predict to survive with a 60% likelihood. Well they’re still more likely to survive than not, so you probably classify them as survive. When selecting
type = "pred" you have the flexibility to specify your own cuttoff threshold.
This metric is very simple, what percentage of your predictions were correct. The confusion matrix function from
caret includes this.
confusionMatrix function from the
caret package is incredibly useful. For assessing classification model performance. Load up the package, and pass it your predictions & the actuals.
library(caret) confusionMatrix(data = test$pred, reference = test$Survived)
The first thing this function shows you is what’s called a confusion matrix. This shows you a table of how predictions and actuals lined up. So the diagonal cells where the prediction and reference are the same represents what we got correct. Counting those up 149 (106 + 43) and dividing it by the total number of records, 178; we arrive at our accuracy number of 83.4%.
True positive: The cell in the quadrant where both the reference and the prediction are 1. This indicates that you predicted survival and they did in fact survive.
False positive: Here you predicted positive, but you were wrong.
True negative: When you predict negative, and you are correct.
False negative: When you predict negative, and you are incorrect.
A couple more key metrics to keep in mind are sensitivity and specificity. Sensitivity is the percentage of true records that you predicted correctly.
Specificity on the other hand is to measure what portion of the actual false records you predicted correctly.
Specificity is one to keep in mind when predicting on an imbalanced dataset. A very common example of this is for classifying email spam. 99% of the time it’s not spam, so if you predicted nothing was ever spam you’d have 99% accuracy, but your specificity would be 0, leading to all spam being accepted.
I hope you enjoyed this quick lesson on bagged trees. Let me know if there was something you wanted more info on or if there’s something you’d like me to cover in a different post.
Happy Data Science-ing!
If you’re interested in learning more on this topic be sure to subscribe! I am currently writing a book that dives into this and other principles in far greater detail!