Welcome to this lesson on calculating p-values.
Before we jump into how to calculate a p-value, it’s important to think about what the p-value is really for.
Hypothesis Testing Refresher
Without going into too much detail for this post, when establishing a hypothesis test, you will determine a null hypothesis. Your null hypothesis represents the world in which the two variables your assessing don’t have any given relationship. Conversely the alternative hypothesis represents the world where there is a statistically significant relationship such that you’re able to reject the null hypothesis in favor of the alternative hypothesis.
Before we move on from the idea of hypothesis testing… think about what we just said. You effectively need to prove that with little room for error, what we’re seeing in the real world could not be taking place in a world where these variables are not related or in a world where the relationship is independent.
Sometimes when learning concepts in statistics, you hear the definition, but take little time to conceptualize. There is often a lot of memorization of rule sets… I find that understanding the intuitive foundation of these principles will serve you far better when finding their practical applications.
Continuing on this vein of thought. If you want to compare your real world stat with the fake world, that’s exactly what you should do.
As you’d guess we can calculate our observed statistic by creating a linear regression model where we explain our response variable as a function of our explanatory variable. Once we’ve done this we can quantify the relationship between these two variables using the slope or coefficient identified through our ols regression.
But now we need to come up with a this idea of the null world… or the world where these variables are independent. This is something we don’t have, so we’ll need to simulate it. For our convenience, we’re going to leverage the infer package.
Let’s Calculate our Observed Statistic
First things first, let’s get our observed statistic!
The dataset we’re working with is a Seattle home prices dataset. I’ve used this dataset many times before and find it particularly flexible for demonstration. The record level of the dataset is by home and details price, square footage, # of beds, # of baths, and so forth.
Through the course of this post, we’ll be trying to explain price through a function of square footage.
Let’s create our regression model
fit <- lm(price_log ~ sqft_living_log, data = housing) summary(fit)
As you can see in the output above, the statistic we’re after is the
Estimate for our explanatory variable,
A very clean way to do this is to tidy our results such that rather than a linear model, we get a tibble. Tibbles, tables, or data frames are going to make it a lot easier for us to systematically interact with.
We’ll then want to filter down to the
sqft_living_log term and we’ll wrap it up by using the
pull function to return the estimate itself. This will return the slope as a number, which will make things easier to compare with our null distribution later on.
Take a look!
lm(price_log ~ sqft_living_log, data = housing)%>% tidy()%>% filter(term == 'sqft_living_log')%>% pull(estimate)
Time to Simulate!
To kick things off, you should know there are various types of simulation. The one we’ll be using here is what’s called permutation.
Permutation is particularly helpful when it comes to showing a world where variables are independent of one another.
While we won’t be going into the specifics of how a permutation sample is created under the hood; it’s worth noting that the sample will be normal and center around 0 for the observed statistic.
In this case, the slope would center around 0 as we’re operating under the premise that there is no relationship between our explanatory and response variables.
A few things for you to know:
- specify is how we determine the relationship we’re modeling:
- hypothesize is where we designate
- generate is how we determine the number of replications of our dataset we want to make. Note that if you did, one replicate and did not
calculateit would return a sample dataset of the same size as the original dataset.
- calculate allows you to determine the calculation in question (slope, mean, median, diff in means, etc.)
library(infer) set.seed(1) perm <- housing %>% specify(price_log ~ sqft_living_log) %>% hypothesize(null = 'independence') %>% generate(reps = 100, type = 'permute') %>% calculate('slope') perm hist(perm$stat)
Same distribution with 1000 reps
Null Sampling Distribution
Ok we’ve done it! We’ve created what is known as the null sampling distribution. What we’re seeing above is a distribution of 1000 slopes each modeled after 1000 simulations of independent data.
This gives us just what we needed. A simulated world against which we can compare reality.
Taking the visual we just made, let’s use a density plot and add a vertical line for our observed slope, marked in red.
ggplot(perm, aes(stat)) + geom_density()+ geom_vline(xintercept = obs_slope, color = 'red')
Visually, you can see that this is happening far beyond the occurrences of random chance.
As you can guess from visually looking at this the p-value here is going to be 0. As to say, in 0% of the null sampling distribution is greater than or equal to our observed statistic.
If in fact we were seeing cases where our permuted data was greater than or equal to our observed statistic, we would know that it was just random.
The reiterate the message here, the purpose of p-value is to give you an idea of how feasible it is that we saw such a slope randomly versus a statistically significant relationship.
While we know what our p-value will be here, let’s get you set up with the calculation for p-value.
To re-prime this idea; p-value is the portion of replicates that were (randomly) greater than or equal to our observed slope.
You’ll see in our
summarise function that we’re checking to see whether our stat or slope is greater than or equal to the observed slope. Each record will be assigned TRUE or FALSE accordingly.. When you wrap that in a mean function, TRUE will represent 1 and FALSE 0, resulting in a proportion of the cases stat was greater than or equal to our observed slope.
perm %>% summarise(p_val = 2 * round(mean(stat >= obs_slope),2))
For the sake of identifying the case of a weaker relationship in which we would not have sufficient evidence to reject the null hypothesis, let’s look at price explained as a function of the year it was built.
Using the same calculation as above, this results in a p-value of 12%; which according to a standard confidence level of 95%, is not sufficient evidence to reject the null hypothesis.
Final Notes on P-value Interpretation
One final thing I want to highlight just one more time….
The meaning of 12%. We saw that when we randomly generated an independent sample… a whole 12% of the time, our randomly generated slope was as or more extreme…
You might see such a result as much as 12% just due to random chance
That’s it! You’re a master of the calculating & understanding p-value.
In a few short minutes we have learned a lot:
- hypothesis testing
- linear regression refresher
- sampling explanation
- learning about infer package
- building a sampling distribution
- visualizing p-value
- calculating p-value
It’s easy to get lost when dissecting statistics concepts like p-value. My hope is that having a strong foundational understanding of the need and corresponding execution allows you to understand and correctly apply this to any variety of problems.
If this was helpful, feel free to check out my other posts at https://medium.com/@datasciencelessons. Happy Data Science-ing!