You’re Not a Data Scientist Until You Understand the Binomial Distribution

Inference at the heart of data analysis

What is the point of inference?

Inference is about drawing conclusions about a greater population via some sample of observed data. For example, you have some sample of the countries opinion on the president and you’d like to make some conclusions about the population at large. Obviously you wont be asking every single citizen, rather you will make an inference about the underlying population using sample data.

Where does probability come in?

While inference is key, we are only able to make inference through data that is generated through some model; queue probability… Probability is used when it comes to the process of generating that data from a given model.

To put this all together in a simple way, consider the following:

Observed/sample data -> Model -> Probability/data generation -> inference

Classic Coin Flips and the binomial distribution

Time to break down the magic of the binomial distribution.

Coin flips are a classic example that helps break down binomial distributions.

Just as we set up a moment ago, we are going to create a model that allows us to generate random data all based upon the idea of a simple coin flip.

Pop Quiz! What’s the likelihood that any given flip returns heads or tails?? you nailed it! 😉 50/50.

So lets jump into R to generate this type of experiement. For this purpose specifically, there is a great function in R called rbinom. We will use this to simulate a coin flip.

Throw the below command in R and see what you get? The first parameter is the number of ‘pulls’ or runs. Run it as many times as you want and you’ll see that approximately 50% of the time the function returns 1 and the other 50% returns 0.

rbinom(1,1,.5)

Calling everything equal to or above .5 a ‘heads’ our first flip was heads. Lets do it again, but 10 times.

rbinom(10,1,.5)

This time we got heads 80% of the time

If you run this over and over again, you will begin to approximate heads 50% of the time. You should also run this with greater and greater numbers of flips.

A nice way to see the percentage of the time a given flip occurs…

flips = rbinom(10000, 1, .5)
mean(flips == 1)

Almost exactly 50% of the time it was heads.

Now let’s change it up a bit; we are going to perform 1M draws of our experiment, but this time each experiment will constitute 10 flips with a 50% likelihood of flipping five heads.

flips = rbinom(1000000, 10, .5)

When we pass flips into a hist() function, we get the following:

hist(flips)

What you’re seeing constitutes a 5 flips approximately 25% of the time. This metric is known as the density of the binomial at that point.

simulation helps us answer questions about distribution and its behavior

Cumulative Density

The next idea is what is called cumulative density. Similar to how we looked at the density of the binomial at the point, we can look at the density at that point and below or above, representing cumulated calculation.

Lets break down the same example as we did earlier… By stating flips must be less than or equal to 5 evaluates true for approximately 50%.

flips = rbinom(100000, 10, .5)
mean(flips <= 5)

Conclusion

Well done! There is a lot to understand in there, but I hope it proved useful as a primer on binomial distribution & how to think about sampling for the sake of inference of the reality of a given sample group.

If you found this useful, feel free to check out all of my other posts at datasciencelessons.com & as always, Happy Data-Sciencing

Advertisement

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: