Over the next few minutes, I’ll send you on your way to leveraging linear regression for a bit more than explanation or prediction, rather you’ll utilize them to for the sake of inference.
We will leverage simulation for inference in three ways:
- Understanding model sensitivity
- confidence intervals
In this post, we’ll mostly be exploring the first one. It will be foundational to my next posts of using simulation to determine p-value and confidence intervals.
If you’re not familiar with how linear regression works in general, jump over to this post.
You can jump over here to find various posts on different variations on linear regression, from creating them, to understanding and explaining them.
Traditionally we use linear regression to make an assessment between a variety of variables. On top of that assessment, what we are going to learn here is how you can adjust the inputs of various regression models to drive deeper understanding of the sensitivity or variability of the relationship between your explanatory & response variables.
So how might we go about determining the variability of the relationship of two variables?
Think about it like this…
What is the key output of a linear regression? If you guessed a line, then you’ve got it right! The regression output is effectively the equation of a line, and the slope of that equation serves as the indication of relationship of
Y. When seeking to understand the variation of our the relationship between response & explanatory variable… it’s the slope that we’re after. Let’s say you ran your linear regression over different samples… the question we would have, is does our slope vary? Or how much does it vary? Is it positive sometimes and negative others? etc.
The Punchline We’re After
We’ve done a bit of exposition to get to the punch line here, but hopefully this serves to give you a solid foundational footing to really understand and use this is practice.
To sum up our introduction, it comes down to this:
We want to understand the variability and sensitivity to variability of the relationship between two variables when we vary the sample driving the model
Let’s Get Our First Slope!
The dataset we’re working with is a Seattle home prices dataset. I’ve used this dataset many times before and find it particularly flexible for demonstration. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.
Through the course of this post, we’ll be trying to explain price through a function of other square footage.
There is certainly a lot of exploratory data analysis (EDA) you’d want to engage in before you jumped right into this section. There are also certain data pre-requisites that you’d confirm, but for the sake of this illustration, let’s dive in.
fit <- lm(price_log ~ sqft_living_log data = housing) summary(fit)
Perfect! We’ve got ourselves a linear model, let’s go ahead and visualize that. Also keep in mind that I have taken the log of both variables to clean up standardize their distributions.
housing %>% mutate(sqft_living_log = log(sqft_living), price_log = log(price))%>% ggplot(aes(x = sqft_living_log, y = price_log)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
In this dataset, we’re just working with a sample of 4600 homes. This is not an exhaustive population. As such, we are going to use a certain sampling technique to generate many “perspectives”. Said perspectives will drive how we go about understanding the sensitivity of our response and explanatory variables.
Sampling variability creates difficulty when trying to draw conclusions about an underlying population. These many perspectives or samples of the data we have are how we eliminate the potentially adverse effects of sampling variability.
So above we have one line… but what we need is many lines, for many situations.
What we’re going to do next is sample our housing data in smaller groups in give each their own regression model.
First things first, we are going to use the
rep_sample_n function to randomly select a group of 100 homes… we’ll repeat that process a total of 100 times.
samples <- housing %>% rep_sample_n(size = 100, reps = 100)
Now that we have our samples dataset, let’s visualize them very similar to how we did it before. Only in this case, we are going to group our visualization by replicate. The reason this is pertinent is so that we can distinguish point to point; which replicate they pertain to. As you can see in the above code, there will be 100 replicates of 100 records each.
ggplot(samples, aes(x = sqft_living_log, y = price_log, group = replicate)) + geom_point() + geom_smooth(method = 'lm', se = FALSE)
What you’ll see above are the various regression lines fit to each of the disparate samples of 100. As you can see there are cases where slopes are greater or lower. This is the foundation of us being able to understand a range of ‘slope’ that applies to the underlying population.
As you’d imagine interacting with our samples is going to vary the amount of variation in slope. Below I’ve run the same code, but am only drawing 10 random samples for each replicate.
So here you have the visualization, but you don’t yet have the actual details of the linear regression itself.
We’re going to need to run a separate regression for each replicate.
Since we already have our generated simulated dataset, we just need to group by replicate, in this case it’s not for the sake of aggregation, rather it’s to model at the group level. Once we declare our
group_by, we are going to leverage the
do function to indicate our group action. For the group action, we want to run separate models for each of them.
Now what we have is 100 regression outputs.
While there are many relevant pieces of the output, we are targeting the term for our explanatory variable.
Take a look at the code below!
coefs <- samples %>% group_by(replicate) %>% do(lm(price_log ~ sqft_living_log, data = .) %>% tidy()) %>% filter(term == 'sqft_living_log')
We now have a dataframe with each replicate and the corresponding coefficient for our term of interest.
Let’s take a peek at our distribution of slopes.
ggplot(coefs, aes(x = estimate)) + geom_histogram()
We can see a mostly normal distribution. In the case that we ran it with more replicates, it would look smoother.
One thing for you to keep in mind. I’m not suggesting that every time you run a linear regression, you need to arbitrarily run 100 of them for various samples of your data. For many business applications, your data may be representative of the entire population. But even in cases when you don’t have the entire population, the purposes to either approach are different. Here we are leveraging simulation and many linear regression models to eventually make inferential claims about the underlying population. It still makes sense to leverage linear regression in different formats for things like modeling for explanation/description, or prediction.
Variation in Slope
As we seek to understand the distribution of slope coefficients, it can be very helpful to vary the data that eventually supports said distribution. As displayed above, altering the sample size of each replicate is going to lend greater understanding to the reduction in slope variation with different samples.
Another thing that will drive greater variation in our slope would be a reduction in variation on our explanatory variables. It may come as a bit of surprise, but with a broader range of explanatory datapoints, the our model has more information upon which to explain a relationship.
We have done a lot in such a short amount of time. It’s easy to get lost when dissecting statistics concepts like inference. My hope is that having a strong foundational understanding of the need and corresponding execution of simulation to better understand the relationship between our response and explanatory variables.
If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!