Regression is a staple in the world of data science, and as such it’s useful to understand it in its simplest form.
I recently wrote a post that gave us more detail into regression. You can find that here. To follow on the ideas that we explored there, today we will be exploring the creation of regression models where the explanatory variable is a categorical datapoint.
As I mentioned, it’s important to have good understand of the application & methodology from the ground up. This will lend well as you utilize machine learning algorithms & other statistical analysis that leverage the concept of regression in different ways.
Let’s Start with a bit of EDA
When it comes to visualizing the relationship between a dependent variable that is numeric and an independent variable that is categorical, there are a a couple of standard visuals you should always consider.
What we’re wanting to see is pattern or relationship. When working with two numeric variables, a scatter is an obvious choice.
In this case a couple of great options are faceted histograms & boxplots.
Let’s kick this off with a faceted histogram. Facet just means that rather than creating a single histogram, we actually have a histogram for each level of a given categorical variable.
ggplot makes this very easy to do.
As you can see below, I make a histogram as per regular to represent the distribution of price, but I also include the
facet_wrap command, instructing the program to visualize a histogram for each value of the field passed to
facet_wrap(~). As there are only two values for the waterfront field, we will see two adjacent panes containing a histogram of price values for either value of waterfront.
housing %>% ggplot(aes(x = price)) + geom_histogram(binwidth = 50000) + facet_wrap(~waterfront)
We can see that by and large a vast majority of homes don’t have a waterfront, but it doesn’t necessarily suggest that all homes with a waterfront are priced higher. If we looked at the mean price for these two groups, we would see a higher mean for waterfront positive companies, as there isn’t nearly as strong a relative concentration toward the lower values.
A better relative visualization of distribution is
housing %>% ggplot(aes(x = price)) + geom_density(binwidth = 50000) + facet_wrap(~waterfront)
Now we better capture the relative concentration of a given portion of the distribution.
Let’s now visualize this same data using a boxplot. As you can see below the
ggplot syntax is almost entirely the same.
housing %>% ggplot(aes(x = as.factor(waterfront), y = price)) + geom_boxplot() facet_wrap(~waterfront)
The center line of a boxplot is represented by the median value of the grouped dataset. While both visualization approaches are surfacing distribution in some form, the beauty of a box plot is we get very precise measurement and comparison of mean, IQR, etc.
It can lend well towards making things a little bit easy to immediately interpret.
Is one type of visualization better than the other? I would say yes, for different things. When it comes to exploratory data analysis or any type of analysis you might conduct as a data scientist, it’s easy to just start using tools that you have on your tool belt just because you know that’s what people do, but it will make your work with a given tool far more meaningful if you have a clear purpose and intention for using a given tool.
In this case, a histogram will serve you better for understanding the shape of a distribution, while a boxplot will serve you better for making clear comparisons between groupings of a dataset.
Let’s Build A Regression Model
When building a regression model, it’s important to understand what exactly is going on under the hood.
Rather than re-explaining how to interpret various regression outputs, you can refer to this post, and we will continue to build on that here.
As you now know, each explanatory variable used in a regression model is assigned a coefficient. That coefficient comprises the slope of the line in the equation of a line that we generate through our regression.
Let’s quickly run our regression passing only the waterfront variable as an explanatory variable.
fit <- lm(price ~ waterfront, data = housing) fit
As we’ve seen before, building a linear model comprises a y-intercept which here is 545,462, and a coefficient, slope, or beta which here is 906,159.
So our formula is Y = 545,462 + 906,159*X
As we only have two options 1 or 0 for waterfront, lets pass either value to our equation of a line and predict Y.
Without a waterfront:
545,462 = 545,462 + 906,159*0
In the case that there is no waterfront, we’d pass a 0, cancelling that coefficient out, leaving us with just the value of the y-intercept.
With a waterfront:
1,451,621 = 545,462 + 906,159*1
Conversely when there is a waterfront we’d treat X as 1 and effectively add the y-intercept and coefficient together giving us $1.5M.
The interpretation here is simply enough. Now let’s have a look under the hood.
We’ll kick this off, looking at the mean value per each value of the waterfront group.
housing %>% group_by(waterfront) %>% summarize(mean_price = mean(price))
And this is what we get:
We can see a mean price of 545,462 for those that don’t have a waterfront, and a mean price of 1,451,621 for those that do not.
Notice anything familiar about those two numbers?
If you guessed noticed that the group mean for non waterfront homes and the y-intercept for your model were the same or that the group mean for waterfront homes and the model output for a waterfront home was the same, you’ve got it.
So what exactly is happening here…
When you pass a categorical variable to a regression model, and in this case, the waterfront variable, the baseline group mean of 545K is assigned as the y-intercept, and the variable coefficient now defined as waterfront1, notice the 1, is actually the difference between the baseline group mean (where the waterfront = 0) and the group mean when waterfront = 1. As a note, the baseline group is established according to alphabetical order.
In the event there were three values for that variable, the third value’s coefficient would also be its relative difference between its group mean and the baseline group mean.
To summarize our learnings,
When conducting EDA where you want to assess the relationship between your dependent variable that is numeric and an independent variable that is categorical, a few great visualization options are:
- Histogram (faceted)
- Density Chart (faceted)
- Boxplot (faceted)
In a regression model that takes an explanatory/independent variable that is categorical:
- the y intercept is equal to the baseline group mean
- the baseline group is established according to alphabetical order of variable values
- the coefficients are equal to the relative difference between a given value of the categorical variable and the baseline group mean (or y-intercept)
It’s important to understand the inner workings of the tools that we use. I hope this primer in using categorical variables for regression proves useful as you leverage these and other tools to conduct analysis.
Happy Data Science-ing!