When it comes to classification, using a decision tree classifier is one of the easiest to use.
Why to use a decision tree
- Incredibly easy to interpret
- It handles missing data & outliers very well and as such requires far less up front cleaning
- You get to forego the categorical variable encoding as decision trees handle categoricals well!
- Without diving into the specifics of recursive partitioning, decision trees are able to model non-linear relationships.
Why not to use a decision tree
With all that good said they’re not always the perfect option.
- In the same way they can be simple, they can also be overly complicated making it nearly impossible to conceptualize or interpret.
- To take this idea a tad further, with a tree that is overly biased or complicated, it may be catering too well to it’s training data and as a result is overfit.
With that said, lets jump into it. I wont talk about cross validation or train, test split much, but will post the code below. Be sure to comment if there’s something you’d like more explanation on.
First we’ll break the data into training & test sets.
Also note that we’ll be using the classic titanic dataset that’s included in base R.
n <- nrow(Titanic) n_train <- round(0.8 * n) set.seed(123) train_indices <- sample(1:n, n_train) train <- Titanic[train_indices, ] test <- Titanic[-train_indices, ]
Now we’ll train the model using the
rpart function from the
rpart package. The key things to notice here is that the variable we want to predict is Survived, so we want to understand the likelihood any given individual survived according to some data. ~ can be interpreted as by; so in other words lets understand Survived by some variables. If after the ~ there is a . that means we want to use every other variable in the dataset to predict survived. Alternatively as shown below we can call out the variables we want to use explicitly.
Another thing to note is that the
class. That is because we want to create a classification tree predicting categorical outcomes, as opposed to a regression tree that would be used for numerical outcomes. And finally the data we’re using to train the model is
model <- rpart(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = titanic_train, method = "class")
As previously mentioned one of the things that makes a decision tree so easy to use is that it’s incredibly easy to interpret. You’re able to follow the different branches of the tree to different outcomes.
It’s a bit difficult to read there, but if you zoom in a tad, you’ll see that the first criteria if someone likely lived or died on the titanic was whether you were a male. If you were a male you move to the left branch and work down two nodes, whether you were an adult and your sibling/spouse count onboard. So if you were a single man you’re odds of survival were pretty slim.
Before we break out the metrics, lets predict values for your test set. Similar to the call to train, you select the data, and type of prediction. The core difference being the model specification.
test$pred <- predict(object = model, newdata = test, type = "class")
There are a variety of performance evaluation metrics which will come in very handy when understanding the efficacy of your decision tree.
This metric is very simple, what percentage of your predictions were correct. The confusion matrix function from
caret includes this.
confusionMatrix function from the
caret package is incredibly useful. For assessing classification model performance. Load up the package, and pass it your predictions & the actuals.
library(caret) confusionMatrix(data = test$pred, reference = test$Survived)
The first thing this function shows you is what’s called a confusion matrix. This shows you a table of how predictions and actuals lined up. So the diagonal cells where the prediction and reference are the same represents what we got correct. Counting those up 149 (106 + 43) and dividing it by the total number of records, 178; we arrive at our accuracy number of 83.4%.
True positive: The cell in the quadrant where both the reference and the prediction are 1. This indicates that you predicted survival and they did in fact survive.
False positive: Here you predicted positive, but you were wrong.
True negative: When you predict negative, and you are correct.
False negative: When you predict negative, and you are incorrect.
A couple more key metrics to keep in mind are sensitivity and specificity. Sensitivity is the percentage of true records that you predicted correctly.
Specificity on the other hand is to measure what portion of the actual false records you predicted correctly.
Specificity is one to keep in mind when predicting on an imbalanced dataset. A very common example of this is for classifying email spam. 99% of the time it’s not spam, so if you predicted nothing was ever spam you’d have 99% accuracy, but your specificity would be 0, leading to all spam being accepted.
There are some additional metrics you can use to assess the predictive quality of your model like log-loss, auc, etc. but I’ll save those for another post.
I hope you enjoyed this quick lesson in decision trees. Let me know if there was something you wanted more info on or if there’s something you’d like me to cover in a different post.
Happy Data Science-ing!