Logistic regression can be pretty difficult to understand! As such I’ve put together a very intuitive explanation of the why, what, and how of logistic regression. We’ll start with some building blocks that should lend well to clearer understanding so hang in there! Through the course of the post, I hope to send you on your way to understanding, building, and interpreting logistic regression models. Enjoy!
What is Logistic Regression?
Logistic regression is a very popular approach to predicting or understanding a binary variable (hot or cold, big or small, this one or that one– you get the idea). Logistic regression falls into the machine learning category of classification.
One more example for you to distinguish between linear and logistic regression: Rather than predicting how much something will be sold for.. you alternatively are predicting whether it will be sold or not. Without further adieu, let’s dive right in!
Understanding Linear Regression as a Precursor
Let’s talk about the output of a linear regression. For those of you who aren’t familiar with linear regression it would be best to start there. You can visit this post to learn about Simple Linear Regression & this one for Multiple Linear Regression.
Now knowing a bit about linear regression; you’d know that the linear regression output is equatable to the equation of a line. Simply enough.. that’s all we want; just a way to reinterpret one variable to lend insight into another.
So knowing this, let’s run a linear regression on a binary variable. Binary being yes or no, 1 or 0. This variable will be the candidate for our logistic regression, but we’ll get there shortly.
Building a Linear Regression to Understand Logistic Regression
Today we’ll work with the
mtcars dataset. This is a classic dataset for data science learning that details fuel consumption, engine details, among other details for a variety of automobiles.
Quick glimpse at the dataset:
Linear Regression for a Binary Variable
In this dataset, we have one binary variable…
vs. Not knowing much about cars, I won’t be able to give you a detailed explanation of what
vs means, but the high level is it’s representative of the engine configuration.. I also know that the configuration has impact on things like power, efficiency, etc. which is something we’d be able to tease out through our models. So hopefully it will be easy to determine the difference!
Let’s build that regression model! We’ll seek to understand
vs as a function of miles per gallon.
fit <- lm(vs ~ mpg ,data = mtcars) summary(fit)
Feel free to peek at the regression output below:
You can see an R-squared of .44, which means we can explain 44% of the variation in y, with variation in x. Not bad. We also see a p-value of less than .05. Two thumbs up there.
Now here’s where things get tricky.. Let’s take a look at the y-intercept. We have a y-intercept of -.67. Seems odd as vs isn’t something that can go negative… in fact, it can only be 0 or 1. This is going to constitute a major issue for using linear regression on a binary variable, but more on that in a moment. First, we’ll seek to better understand the data we’re working with.
Visualizing Our Output
Let’s quickly visualize the two variables with a scatter plot and add the regression line to give us some additional insight.
mtcars %>% ggplot(aes(x = mpg, y = vs))+ geom_point()+ geom_smooth(method = 'lm', se = FALSE)
For starters you can see that the y-axis is represented as a continuous variable, so you’ll see that all of the points are along either the 0 or 1.
As far as the x-axis goes we can see that depending on whether the dependent variable was 1 or 0, there is a concentration towards the right, and then left respectively. This lines up with what I mentioned earlier that cars with the
vs of 1 have better fuel consumption than those of
As is very intuitive from the chart we can see the line cutting right through the two groups in linear fashion.
The Obvious Issue?
The obvious issue here is that the line goes on forever in either direction. The output on either extreme, literally wouldn’t make sense.
For this reason, we can’t use linear regression as is. That doesn’t mean it’s useless, we just have to modify it such that our extremes of prediction aren’t infinite.
“Generalizing” Linear Regression
The way we get to logistic regression is through what is called a “generalized linear model”. You can think about the function or equation of a line we just created through our simple linear regression. Through the linear model we have an understanding of y based on a function that we relate to x. For logistic regression, we take that function and effectively wrap it in an additional function that is responsible for “generalizing” the model.
What’s the Purpose of the Generalization or Generalizing Function?
What we’re trying to do is classify a given datapoint, or in other words assign a vehicle of a given mpg to one of two groups, either v (0) or straight (1).
A helpful way to think about this is from the perspective of probability. For any given
mpg datapoint, that automobile has a given probability to be either v or straight. As such it would make sense that the output of a model intended to shed light on that relationship would do so with probability.
To sum up this idea, we want to generalize the linear output in a way that’s representative of probability.
How to Generalize
The process of generalization as I mentioned earlier has to do with wrapping our linear function into yet another function. This function is what’s known as a link function. Just as I mentioned a moment ago, that link function will scale the linear output to be a probability between 0 and 1. For logistic regression, a sub-category of generalized linear models, the logic link function is used.
Visualizing Logistic Regression
Let’s look at the same graph as before but fit the logistic curve this time.
mtcars %>% ggplot(aes(x = mpg, y = vs))+ geom_point()+ geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))
We have included the same code as before, only now our method is “glm” which specifies we want a generalized linear model; secondly we specify the family = “binomial”. This is what calls out which link function to use. In this case “binomial” uses the logit function.
Let’s interpret our new chart. As you can see that either end of our line on either extreme flattens such that the line will never reach 0 or 1.
We can see that the line towards the middle is very straight and similar to that of our linear model.
Building Our First Logistic Regression Model
Let’s go ahead and jump into building our own logistic regression model. This reads very similar to the linear regression call with two key differences. The first is the call is
glm, the second is the family is “binomial” similar to what you saw in the
geom_smooth call. Same rationale for its use here.
fit <- glm(vs ~ mpg ,data = mtcars ,family = 'binomial') summary(fit)
First thing I want to call out with the glm function is that you have to first encode the dependent variable as either 1 or 0. Other classification algorithms may not need said encoding, but in the case of logistic regression, to reiterate, it is a wrapping of a linear output. As such, the linear model that sits inside of our link function would not work on a character or factor.
Logistic Regression Interpretation
Now it’s time to talk about interpretation. Also not an incredibly simple topic, but we’ll approach it as intuitively as possible.
There are three ways for one to think about logistic regression interpretation:
Each has different trade-offs when it comes interpretability. But first… Definitions
Simply enough, probability is the measure of likelihood expressed between 0 and 1.
Odds on the other hand is used to represent how frequently something happens (probability) relative to how often it doesn’t (1-probability).
The formula for that looks like this
O = p/(1-p).
One thing to think about here is it’s on the exponential scale.
Let’s write a bit of quick code to make the reason for the exponential scale to be more intuitive. We’ll create a sequence 0 to 1, by .05. Then we’ll create an odds field based on the above formula. Lastly we’ll plot the line!
probs_vs_odds <- data.frame(prob = seq(0, 1, by = 0.05)) probs_vs_odds <- probs_vs_odds %>% mutate(inverse_prob = 1-prob, odds = prob / inverse_prob) probs_vs_odds %>% ggplot(aes(x = prob, y = odds))+ geom_line()
I’ll also add the dataframe below to give additional illustration. Hopefully this makes it pretty clear to think about. When the probability is 5% your odds are 1 in 20 conversely when the probability is 95% your odds are 19 to 1, not a linear change.
Very similar to odds with one change. We take the log to mitigate the exponential curve. Once we take the log odds, we’re able to visualize our model as a line once again. This is great for function interpretation, but pretty horrible when it comes to output interpretation
My below visuals are intended to relay the spectrum of interpretability for the function & the output. Probability’s output is very simple to interpret, but its function is non-linear. Odds makes sense, but isn’t the easiest thing to mentally wrap your mind around & is exponential as such.. as a function, doesn’t quite make sense. Finally log-odds is just about impossible to interpret, but it’s function is linear which is great for interpretation.
No one of these is outright the best. While for the output, probability is the easiest to interpret, the probability function itself is non-linear. You may find yourself working with some combination to communicate predictions versus the function and so forth.
For this last section, I’m going to set you up with a couple of tools that will be key in model performance evaluation.
We’ll be using what’s called cross-validation. All this means is that rather than training a model with all of your datapoints, you’ll pull some amount of them out, wait until the model is trained, then generate predictions for them, and then make a comparison between the predictions & the actuals.
Below we break out the train and test groups, & generate the model.
n <- nrow(mtcars) n_train <- round(0.8 * n) set.seed(123) train_indices <- sample(1:n, n_train) train <- mtcars[train_indices,] test <- mtcars[-train_indices,] fit <- glm(vs ~ mpg ,data = train ,family = "binomial")
From here, we’ll generate a prediction for our test group. If you were to look at the
pred field, you would actually see the probability of being one.
The challenge this leaves us with is that rather than saying these cars likely have configuration 1 and these ones have 0; we’re left with probabilities.
You’ll see in the second line of the below code I round the prediction. .5 and above is 1, and below is 0.
.5 is used as a pretty standard classification threshold– although there are certainly situations that would necessitate a higher or lower threshold.
Finally we use
confusionMatrix function to visualize the confusion matrix and to also deliver a handful of performance evaluation metrics. Things like accuracy, p-value, sensitivity, specificity, and so forth.
test$pred <- predict(fit, test, type = 'response') test$pred <- as.factor(round(test$pred)) test$vs <- as.factor(test$vs) confusionMatrix(test$pred, test$vs)
In our case, p-value was high & accuracy was mid-tier.
If you’ve made it this far then hopefully you’ve learned a thing or two about logistic regression and will feel comfortable building, interpreting, & communicating your own logistic regression models.
Through the course of the post, we’ve run through the following:
- Model building
- Performance Evaluation
I hope this proves useful! Check out my other data science lessons at datasciencelessons.com. Happy Data Science-ing!