The chi-square statistic is a useful tool for understanding the relationship between two categorical variables.
For the sake of example, let’s say you work for a tech company that has rolled out a new product and you want to assess the relationship between this product and customer churn. In the age of data, tech or otherwise, many companies undergo to risk of taking evidence that is either anecdotal or perhaps a high level visualization to indicate certainty of a given relationship. The chi-square statistic gives us a way to quantify and assess the strength of a given pair of categorical variables.
Let’s explore chi-square from this lens of customer churn.
You can download the customer churn dataset that we’ll be working with from kaggle. This dataset provides details for a variety of telecom customers and whether or not they “churned” or closed their account.
Regardless of what company, teams, products, or industries you work with, the following example should be very generalizable.
Now that we have our dataset, let’s quickly use
select command to pull down the fields we’ll be working with for simplicity sake. I’ll also be dropping the number of levels down to two for simplicity sake. You can certainly run a chi-square test on categorical variables with more than two levels, but as we venture to understand it from the ground up, we’ll keep it simple.
churn <- churn %>% select(customerID, StreamingTV, Churn)%>% mutate(StreamingTV = ifelse(StreamingTV == 'Yes', 1, 0))
Churn is going to be classified as a Yes or a No. As you just saw, StreamingTV will be encoded with either a 1 or 0.
Exploratory Data Analysis
I won’t go into great depth on exploratory data analysis here, but I will give you two quick tools to being able to assess a relationship between two categorical variables.
Proportion tables are a great way to establish some fundamental understanding about the relationship between two categoricals
Table gives us a quick idea of the counts in any given level, wrapping that in
prop.table() allows us to see the percentage break down.
Let’s now pass both variables to our
table(churn$StreamingTV, churn$Churn) round(prop.table(table(churn$StreamingTV, churn$Churn),1),2)
Once you pass another variable into the proportion table, you’re then able to establish where you want to assess relative proportion. In this case, the second parameter we pass to the
prop.table() function, “1”, which specifies that we’d like to see the relative proportion of records across each row or value of
StreamingTV. As you can see in the above table in cases when a customer did not have streaming tv, they remained active 76% of the time, conversely if they did have streaming tv they actually stuck around less at 70%.
Now before we go getting ahead of ourselves, saying that having streaming tv most certainly is causing more people to churn… we need to make an assessment of whether or not we really have grounds to make such a claim. Yes the proportion of return customers is lower, but the difference could be random noise. More on this shortly.
Time to Visualize
This will give us similar information to what we just saw, but visualization tends to lend better to quickly understanding relative value.
Let’s start off with a quick bar plot with
StreamingTV across the x-axis, and the fill as
churn %>% ggplot(aes(x = StreamingTV, fill = Churn))+ geom_bar()
As you can see, nearly as many tv streamers churned and with a substantially lower total customer count. Similar to what we saw with proportion tables, 100% stacked bar helps assess relative distribution among values of a categorical variable. All we have to do is pass
position = 'fill' to
churn %>% ggplot(aes(x = StreamingTV, fill = Churn))+ geom_bar(position = 'fill')
Diving into the Chi-square Statistic
Now there appears to be some sort of relationship between the two variables, yet we don’t have an assessment of the statistical significance. In other words, is it because of something about the relationship between tv streamers and customers, i.e. did they hate the service so much that they churn at a higher rate? Does their overall bill appear way to high as a product of the streaming plan, such that they churn all together?
All great questions, and we won’t have the answer to them just yet, but what we are doing is taking the first steps to assessing whether this larger investigative journey is worthwhile.
Before we dive into the depths of creating a chi-square statistic, it’s very important that you understand the purpose conceptually.
We can see two categorical variables that appear to be related, however we don’t definitively know if the disparate proportions are a product of randomness or some other underlying affect. This is where chi-square comes in. The chi-square test statistic is effectively a comparison of our distribution to the distribution we would expect, in the case that the two variables were indeed perfectly independent.
So first things first, we need a dataset to represent said independence.
Generating Our Sample Dataset
We will be making use of the
infer package. This package is incredibly useful for creating sample data for hypothesis testing, creating confidence intervals, etc.
I won’t break down all of the details on how to use
infer, but at a high level, you’re creating a new dataset. In this case, we want to create a dataset that looks a lot like what we just saw with the churn dataset, only this time, we want to ensure independent distribution, i.e. in cases when customers are tv streamers, we shouldn’t see a greater occurrence of churn.
Easy way to think about infer is in the following the steps of specify, hypothesize, and generate. We specify the relationship we’re modeling, we input the intended distribution, independent, and finally we specify the number of replicates we want to generate. A replicate in this case will mirror the row count of our original dataset. There are instances in which you would create many replicates of the same dataset and make calculations on top of that, but not for this part of the process.
churn_perm <- churn %>% specify(Churn ~ StreamingTV) %>% hypothesize(null = "independence") %>% generate(reps = 1, type = "permute")
Lets’s quickly take a look at this dataset.
As you can see we have the two variables we specified, as well as
replicate. All records in this table will be replicate: 1, as we only made a single replicate.
Let’s quickly visualize our independent dataset to visualize the relative proportions now.
churn_perm %>% ggplot(aes(x = StreamingTV, fill = Churn))+ geom_bar(position = 'fill')
As desired you can see that the relative proportions line up almost exactly. There is some randomness at play so we may not see that these two line up perfectly… but that’s really the point. We’re not doing this quite yet, but remember when I mentioned the idea of creating many replicates?
What might the purpose of that be?
If we create this sample dataset tons of times, do we ever see a gap as wide as 70% to 76% churn as we saw in our observed dataset? If so, how often do we see it? Is it so often that we don’t have grounds to chalk up the difference to anything more than random noise?
Alright enough of that rant… On to making an assessment of how much our observed data varies from our sample data.
Let’s Get Calculating
Now that we really understand our purpose, let’s go ahead and calculate our statistic. Simply enough, our intent is to calculate the distance between each cell of our table of observed counts with that of our sample counts.
The formula for said “distance” looks like this:
sum(((obs - sample)^2)/sample)
- We subtract observed from our sample,
- but square them such that they don’t cancel each other out.
- We divide them by the sample count to prevent any single cell from having too great a presence due to its size,
- and finally we take the sum.
The chi-square statistic that we get is: 20.1
So, great. We understand the purpose of the chi-square statistic, we even have it… but what we still don’t know is… is a chi-square stat of 20.1 meaningful?
Earlier in the post, we spoke about how we can use the
infer package to create many, many replicates. A hypothesis test is precisely the time for that type of sampling.
infer again, just this time we’ll generate 500 replicates & calculate a chi-square statistic for each group of replicates.
churn_null <- churn %>% specify(Churn ~ StreamingTV) %>% hypothesize(null = "independence") %>% generate(reps = 500, type = "permute") %>% calculate(stat = "Chisq") churn_null
Based on the above output, you can see that each
replicate has it’s own
Let’s use a density plot to see what our distribution of chi-square statistics looks like.
churn_null %>% ggplot(aes(x = stat)) + # Add density layer geom_density()
At a first glance we can see the distribution of chi-square statistics is very right skewed. We can also see that our statistic of 20.1 is not even on the plot.
Let’s add a vertical line to show how our observed chi-square compares to the permuted distribution.
churn_null %>% ggplot(aes(x = stat)) + geom_density() + geom_vline(xintercept = obs_chi_sq, color = "red")
When it comes to having sufficient evidence to reject the null hypothesis, this is promising. Null hypothesis being that there is no relationship between the two variables.
As a final portion to this lesson on how to use chi-square statistics, let’s talk about how we should go about calculating p-value.
Earlier I mentioned the idea that we might want to know if our simulated chi-square stat was ever as large as our observed chi-square stat, and if so how often it might have occurred.
That is the essence of p-value.
When taking the chi-square stat of two variables that we know are independent of one another (the simulated case), what percentage of these replicates’ chi-square stats are greater than or equal to our observed chi-square stat.
churn_null %>% summarise(p_value = 2 * mean(stat >= obs_chi_sq))
In the case of our sample, we’re getting a p-value of 0. As to say that in the course of 500 replicates, we never surpassed a chi-square stat of 20.1.
As such, we would reject the null hypothesis that churn and streaming tv are independent.
We have done a lot in such a short amount of time. It’s easy to get lost when dissecting statistics concepts like the chi-square statistic. My hope is that having a strong foundational understanding of the need and corresponding calculation of this statistic lends to the right instinct for recognizing the right opportunity to put this tool to work.
In just a few minutes, we have covered:
- A bit of EDA for pairs of categorical variables
- Proportion tables
- Bar Charts
- 100% Stacked Bar
- Chi-square explanation & purpose
- How to calculate a chi-square statistic
- Hypothesis testing with infer
- Calculating p-value
If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!