No matter your exposure to data science & the world of statistics, it’s likely that at some point, you’ve at the very least heard of regression. As a precursor to this quick lesson on multiple regression, you should have some familiarity with simple linear regression. If you aren’t, you can start here! Otherwise, let’s dive in with multiple linear regression.
The distinction we draw between simple linear regression and multiple linear regression is simply the number of explanatory variables that help us understand our dependent variable.
Multiple linear regression is an incredibly popular statistical technique for data scientists and is foundational to a lot of the more complex methodologies used by data scientists.
Multiple Linear Regression
In my post on simple linear regression, I gave the example of predicting home prices using a single numeric variable — square footage.
This post aligns very closely to another post I’ve made on multiple linear regression, the distinction is between the data types of the variables that are explaining our dependent variable. That post explains multiple linear regression using one numeric & one categorical variable; also known as a parallel slopes model.
What we’ll run through below will give us insight into a multiple linear regression model where we use multiple numeric variables to explain our dependent variable and how we can effectively visualize utilizing a heat map. Enjoy!
Let’s Build a Regression Model
Similar to what we’ve built in the aforementioned posts, we’ll create a linear regression model where we add another numeric variable.
The dataset we’re working with is a Seattle home prices dataset. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.
Through the course of this post, we’ll be trying to explain price through a function of other numeric variables in the dataset.
With that said, let’s dive in. Similar to the other posts, we’re using
sqft_living to predict
price, only here we’ll add another variable:
fit <- lm(price ~ sqft_living + bathrooms, data = housing) summary(fit)
The inclusion of various numeric explanatory variables to a regression model is both simple syntactically as well as mathematically.
While you can technically layer numeric variables one after another into the same model, it can quickly become difficult to visualize and understand.
In the case of our model, we have three separate dimensions we’ll need to be able to assess.
Over the next bit, we’ll review different approaches to visualizing models with increasing complexity.
Break Out the Heatmap
The purpose of our visualization is to understand given variables relating to one another. A simple scatter plot is a very intuitive choice for two numeric variables. At the moment we include a third variable, things are a bit more confusing.
The first option we’ll be reviewing is the heatmap. This form of visualization as an overlay to a scatter plot does a good job communicating how our model output changes as the combination of our explanatory variables change.
First things first, we need to create a grid that combines all of the unique combinations of our two variables. This will be key as we want to have an exhaustive view of how our model varies with respect to explanatory variables.
Once we do this, we can assign predictions to each of them giving us a clear indication of our prediction across all potential combinations of our numeric variables.
Below I’ll use the table function to get an idea of the range of values for the sake of creating the sequence as you can see in the code below. Alternatively you could also pass all of the unique occurrences of a given variable like so
data.frame(table(housing$sqft_living)) into the expand.grid function.
We use expand.grid to create a dataframe with all of the various variable combinations.
table(housing$bathrooms) table(housing$sqft_living) all_combinations <- expand.grid(sqft_living = seq(370, 13540, by = 10), bathrooms = seq(0.75, 8, by = 0.25))
Now that we have our dataframe, let’s generate predictions using
combos_aug <- augment(fit, newdata = all_combinations)
Let’s move onto the visualization.
housing %>% ggplot(aes(x = sqft_living, y = bathrooms))+ geom_point(aes(color = price))
Here we see the scatter between our explanatory variables with the color gradient assigned to the dependent variable price.
Let’s add our tile. We see the same code as above, we’re just now including the
geom_tile function with the model predictions,
housing %>% ggplot(aes(x = sqft_living, y = bathrooms))+ geom_point(aes(color = price))+ geom_tile(data = combos_aug, aes(fill = .fitted), alpha = 0.5)
As you can see we can see a more distinct gradient moving across
sqft_living on the x-axis. With that said we can also see some gradient across the
bathrooms on the y-axis. We can similarly see that the price, as visualized by the point color gradient, is far darker/lower on the bottom right of our chart.
Creating a model with tons of different explanatory variables can be very easy to do. Whether or not that creates deeper understanding of a given variable is the question. While this is a simple example, I hope that this proves helpful as you seek to make sense of some of your more complex multiple linear regression models.
In the course of the post we’ve covered the following:
- Multiple linear regression definition
- Building a mlr model
- Visualization/interpretation limitations
- Using Heatmaps in conjunction with scatter plots
If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!