## Bayesian Statistics at the Heart of Data Science

Data science has deep roots in bayesian statistics & rather than giving the historical background of Sir Thomas Bayes, I’ll give you a high level perspective on bayesian statistics, bayes’ theorem, and how to leverage it as a tool in your work! Bayesian statistics are rooted in so many aspects of data science & machine learning a strong foundation with these principles is incredibly important.

## Bayes Theorem at its Core

The main idea behind bayes theorem is that if there is some insight or knowledge related to an upcoming event. That insight can be used to help describe the likely outcome for that event.

One way to consider this might be in the case of customer churn as it relates to customer activity.

## What Is Our Question?

In order to formulate this into a problem we can solve using bayes theorem it’s helpful to identify exactly what our question is.

So you know whether a customer churned, but given that information, you want to identify the likelihood that it pertained to your active customer group or your inactive customer group. So our question might be phrased as, “given that a customer churned, what is the likelihood that it was active?”

So now we need to figure out how to use the information we know to get us to that answer.

Let’s start to conceptualize this through code.

## Let’s Code It!

As you’ll see below, we will perform 100K ‘draws’ you could say for each scenario where we sample our 100 customers for churn. The first simulation will represent the churn probability for active customers & the second will be for inactive customers; having churn probabilities of 20% & 50% respectively.

```
active <- rbinom(100000, 100, .2)
inactive <- rbinom(100000, 100, .5)
```

If you take the average across the simulations for the churn number, you will see that they linger around the `probability`

by the `size`

. 20 & 50 respectively.

Lets quickly look at the distribution of outcomes for each simulation.

`hist(active)`

`hist(inactive)`

As mentioned you can that the distribution centers on the mean and is normally distributed.

Now let’s look at scenarios where simulated churn was 35 for either group.

```
active_sample <- sum(active == 35)
inactive_sample <- sum(inactive == 35)
active_sample / (active_sample + inactive_sample)
inactive_sample / (active_sample + inactive_sample)
```

What we can see here, is that in cases where the simulation represented active users, only 12.6% of the draws pertained to the active user cohort versus the 87.4% probability that that those draws were actually inactive users.

Also take a look at the histograms for the value of 35 to solidify this idea in mind. Although 35 is on the extreme for both distributions, it does provide some additional context.

## Accounting for Distribution Between Simulations

So up to this point we’ve been creating an even simulation between active and inactive customers, but let’s say we don’t have an eve distribution of active to inactive customers. We can incorporate that idea by creating simulations without different draw counts.

Here we do the same thing, but this time we recreate an 80/20 split between the two groups, that being that 80% of our customer base is active.

```
active <- rbinom(80000, 100, .2)
inactive <- rbinom(20000, 100, .5)
```

Now when we go through the same process as before…

```
active_sample <- sum(active == 35)
inactive_sample <- sum(inactive == 35)
active_sample / (active_sample + inactive_sample)
inactive_sample / (active_sample + inactive_sample)
```

We actually see a large drop in the probability that it is an inactive customer. With good reason of course. We’ve updating our analysis according to the prior distribution of active to inactive customers.

## Conclusion

There is sooo much more we could go into with bayesian statistics, but I hope this application served well to produce a strong introduction. Be sure to comment on where you’d like more detail or if there were things you liked.

Check out my other posts, lessons, & insights at datasciencelessons.com & follow me on twitter @data_lessons!