How are bagged trees & random forests similar?
Random forests are similar to bagged trees in that each tree in a random forest or bagged tree model are trained on random subsets of the data. In fact this process of sampling different groups of the data to train separate models is an ensemble method called bagging; hence the name bagged trees. (See lesson on bagged trees here: https://datasciencelessons.com/2019/08/05/learn-bagged-trees-for-classification-in-r/ )
How do they differ?
Where the random forest diverges from the bagged tree is that the bagged tree has access to all of the variables at any given tree, so the main differentiator for the bagged tree is just the sampling of data. For the random forest, the variables which are used to train the model are selected randomly at each tree. This process is known as feature bagging as it’s very similar to bootstrap aggregation as a concept, just directed towards variables.
Many wonder why this process of feature bagging would lead to better performance and the reason is because each decision tree has a greater likelihood of being different thus each tree gives a different explanation of the variation in the dependent variable.
As random forests are an improvement on bagged trees, they are typically more performant and are also simpler to tune in many instances.
Now using the same dataset that I’ve used for my bagging post; I’m going to use the same dataset here.
Lets dive in
We’ll be using the
randomForest package. One thing to keep in mind is when tuning the
ntree hyper parameter, more trees almost always means better model performance, you’ll just have to weigh that against the time to compute and whatever performance gains you may be getting. (The default is 500 trees).
Using the same dataset as I have in the decision tree & bagged tree tutorials, I run the same formula into each model.
As I’ve broken out in other tutorials, the
formula section below is where you get to tell the function what you want to understand. The variable that preceeds the
~ is your dependent variable or what you want to understand. The
~ sign indicates, by or explained by, and everything that follows is what you are using to explain the variation of your dependent variable.
Here we are trying to understand whether or not someone was likely to survive aboard the titanic given their gender, age, sibling count, fare, and so forth.
set.seed(14) model <- randomForest(formula = as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train) print(model)
Here you can see the model printed out. Included is a number of explanations of our model itself, like type, tree count, variable count, etc. The one that is most interesting is the
OOB estimate of error rate.
OOB stands for out of bag error; so similar to a bagged tree, what that means is that when you are taking samples of data to train each version of the model, you will have some data that is left out. For that left out data or as we call it out of bag data, the model produces predictions for it and compares it to its actuals giving us the error rate you see above. The way the error is calculated above is by taking the mis classified count and identifying what portion of the total class it accounted for. For example, we predicted 87 + 193 as 1 or likely to survive. The 31% error rate is calculated by dividing what we got wrong, 87, by what was predicted, 87 + 193.
Now enough of that! Lets get predicting!
Here we predict the class of each record within test, using the model that we just trained.
From there, we actually have to declare it a factor.
confusionMatrix function will not work if both fields are not the same class.
test$pred <- predict(model, test) test$pred <- as.factor(test$pred) confusionMatrix(test$pred, test$Survived)
As you can see above, this is the output from the
This function shows you a table of how predictions and actuals lined up. So the diagonal cells where the prediction and reference are the same represents what we got correct.
I’m going to give you a bit of a lesson on how to interpret a confusion matrix below:
True positive: The cell in the quadrant where both the reference and the prediction are 1. This indicates that you predicted survival and they did in fact survive.
False positive: Here you predicted positive, but you were wrong.
True negative: When you predict negative, and you are correct.
False negative: When you predict negative, and you are incorrect.
A couple more key metrics to keep in mind are sensitivity and specificity. Sensitivity is the percentage of true records that you predicted correctly.
Specificity on the other hand is to measure what portion of the actual false records you predicted correctly.
Specificity is one to keep in mind when predicting on an imbalanced dataset. A very common example of this is for classifying email spam. 99% of the time it’s not spam, so if you predicted nothing was ever spam you’d have 99% accuracy, but your specificity would be 0, leading to all spam being accepted.
ROC Curve & AUC
ROC Curve or Receiver Operating Characteristic Curve is a method for visualizing the capability of a binary classification model to diagnose or predict correctly. The ROC Curve plots the true positive rate against the false positive rate at various thresholds.
Our target for the ROC Curve is that the true positive rate is 100% and the false positive rate is 0%. That curve would fall in the top left corner of the plot.
AUC is intended to determine the degree of separability, or the ability to correct predict class. The higher the AUC the better. 1 would be perfect, and .5 would be random.
pred <- predict(object = model, newdata = test, type = "prob") library(Metrics) auc(actual = test$Survived, predicted = pred[,"yes"])
That is my introduction to random forest for classification in R! I hope it was helpful! If you’d like to follow along as I write about data science, machine learning, and the like come visit me at datasciencelessons.com.
Happy Data Science-ing!
If you’re interested in learning more on this topic be sure to subscribe! I am currently writing a book that dives into this and other principles in far greater detail!