What Every Data Scientist Needs to Know About Clustering

Introduction to Machine Learning

Machine learning is a frequently buzzed about term, yet there is often a lack of understanding into its different areas.

One of the first distinctions made with machine learning is between what’s called supervised and unsupervised learning.

Having a basic understanding of this distinction and the purposes/applications of either will be incredibly helpful as you pave your way in the world of machine learning and data science.

Supervised & Unsupervised Learning

Supervised learning is certainly the more famous category.

Supervised learning consists of classification & regression, which effectively means, you determine a response variable and explanatory variables, and use them to model a relationship–Whether that’s for the sake of explanation or prediction. (You can learn more about that distinction here).

Classification effectively represents identifying the relationship between a categorical variable and some composition of other variables. This could be predicting which inbox an email belongs to, predicting whether someone will default on a loan, or predicting whether a sales lead will convert or not.

Regression also attempts to model a relationship between dependent and independent variables, however in this case, we’re trying to model a continuous variable. Regression is probably what you’ve heard about the most. This could be modeling home prices, revenue numbers, the age of a tree, the top speed of a car, customer product usage, etc.

Unsupervised learning

Now that we have that all of that out of the way, let’s talk about unsupervised learning.

To put it simply, rather than pre-determining what we want our algorithm to find, we provide the algorithm little to no guidance ahead of time.

In other words, rather than explicitly telling our algorithm what we’d like to predict or explain, we kick back and hand the baton off to the algorithm to identify prominent patterns among given features.

Let’s kick this off with an example of how you might use supervised and unsupervised learning methods on housing data.

One supervised approach could be to use the data to train a model to predict home prices.

An unsupervised approach could be to identify natural groupings of price or some combination of variables, e.g. price & number of rooms.

What I just explained is actually the technique we’re here to learn about. With that let’s dive in.

What Exactly Is Clustering or Cluster Analysis?

To facilitate the best understanding of clustering, a good start is to understand its fundamental purpose.


If you’re a data scientist, one of the pre-requisites to any analysis is coming to understand your data in some capacity. One aspect of that understanding comes from the notion of similarity between records of a dataset.

To further define this idea, we want to understand the relative similarity or dissimilarity between any and all records.

One mechanism for accomplishing this is by identifying natural groupings of records. These groupings can be defined as the records that are most similar to one another and most dissimilar from records of other groupings.

If it’s not obvious, this is where clustering comes into play. It is the algorithmic approach to creating said groupings.

To further drive this point home, clustering analysis helps you answer one fundamental question: How similar or dissimilar are any two observations?

How Is it Measured?

We are trying to asses the similarity of two records and we use the distance between those records to help define that.

The dissimilarity metric or distance is defined as 1 – similarity

The greater the distance, the greater the dissimilarity and vice-versa.

Let’s illustrate with the following hypothetical dataset.

chess <- data.frame(
                 y = c(2, 4),
                 x = c(5, 3))
row.names(chess) <- c('knight', 'king')

I created a little dataset that details the x & y axis of a chess board. I don’t know if there is actually a name for each axis of a chess board… I do know one contains letters, but for simplicity sake, go along with me.

We will use the dist function, which defaults to the euclidean distance between two points. As you can see, it defines the distance between the knight and the king as 3.46

dist_chess <- dist(chess)

You can also manually calculate the euclidean distance between the two pieces. Also equals 2.82.

knight <- chess[1,]
king <- chess[2,]

piece_distance <- sqrt((knight$y - king$y)^2 + (knight$x - king$x)^2)

Let’s throw up our pieces on a plot!

ggplot(chess, aes(x = x, y = y)) + 
  geom_point() +
  lims(x = c(0,8), y = c(0, 8))

Applications of Clustering

Now after all of that explanation, what in the world is clustering for? Why should you spend your time learning it?

Cluser analysis is incredibly useful any time you want an assessment of similarity.

You may work at a software company where you want to understand how different users are similar or dissimilar, possibly to alter the offering, messaging, etc.

The applications run far beyond business as well. From analysis of plant & animal species, user behavior, weather, and just about anything where we can measure a pattern..

When Is the Best Time to Use it?

There may be many potentially appropriate times to use cluster analysis. One of the most prominent of which is during exploratory data analysis. If you’re not familiar with exploratory data analysis, you can learn more on exploratory data analysis fundamentals here.

Without diving too deep into the principles of exploratory data analysis (EDA), the key intention of EDA is to familiarize yourself with the dataset you’re working with.

Clustering can be immensely helpful during this process.

Clustering Prep

Lets jump into some of the pre-processing steps that are required before we can perform our analysis.

Don’t Forget to Scale!

Let’s jump back to the chess example. Each 1 unit increase in a value represents one for cell in a given direction. This is a great example of when euclidian distance makes perfect sense.

However, what if you are clustering with metrics that don’t exist on the same scale… something like annual revenue and number of employees, or foot size and vertical jump.

The challenge comes when the values we’re using aren’t comparable to one another, as displayed in my previous example.

Think of the following two scenarios,

scenario 1:

You have two companies both with the same number of employees, but one with 1000 more dollars in revenue than the other.

Now let’s swap the differing variable, they have the same revenue but one has 1000 employees more than the other.

The first scenario constitutes two companies that would likely be incredibly similar, only a thousand dollars off on revenue constitutes a minor difference and likely signifies two companies of very similar value. Ironically, the second example highlights two companies that are vastly different. It could vary massively in industry, market segment, or whatever.

While the difference in both scenarios was 1000, that difference was significant of two vary different things in either scenario.

The problem with the varying group values is that they have different averages and different variation. Which would be the exact situation we just reviewed.

As such, when performing cluster analysis, it’s very important that we scale our metrics to have the same average and variability.

We will use an approach called standardization that effectively will bring the mean of our metric to 0 and the standard deviation to 1.

Technically speaking, we can manually scale a given variable as seen below.

scaled_var = (var - mean(var))/sd(var)

While it’s good to familiarize yourself with the logic of the calculation, it’s also very convenient to just use the scale function in R.

Scaling & Distancing Housing Data

Let’s go through the same exercise with housing data.

I pulled down this Seattle home prices dataset from kaggle. You can find that here.

Let’s do a quick visualization of the first two datapoints in the datset.

Here are the datapoints we’re working with:

housing <- housing[1:2, c('price', 'sqft_lot')]
ggplot(housing, aes(x = sqft_lot, y = price))+
housing <- housing[1:2, c('price', 'sqft_lot')]
ggplot(housing, aes(x = sqft_lot, y = price))+

housing_dist <- dist(housing)

# when I scale, I'm going to do so using the entire dataset, but # when I take the dist I'll just use the subset

housing_scaled <- scale(housing)
housing_scaled_dist <- dist(housing_scaled)

Similarity Score for Categorical Data

We have spent the entirety of the time so far talking about the euclidean distance between two points and using that as our proxy for dissimilarity/similarly.

What about in the case of categorical data?

Similar to euclidian distance for categoricals, we use something called the Jacaard index.

Let me explain the Jacaard index.

Lets say you have a categorical field with cases a & b. The Jacaard index provides to us the ratio of instances when both a & b occurred relative to the number of times either occurred.

You can also think of it as the ratio of the intersection of a & b to the union of a & b.

Using the same dist function we used earlier, but in this case just changing the method to 'binary' and you’ll have a measure of distance.

Let’s first create a dataset to play around with. Below you can see I’ve come up with two categorical variables for each company.

companies <- data.frame(
  industry = c('retail', 'retail', 'tech', 'finance', 'finance', 'retail'),
  segment = c('smb', 'smb', 'mid market', 'enterprise', 'mid market', 'enterprise'))
row.names(companies) <- c('a', 'b', 'c', 'd', 'e', 'f')
companies$industry <- as.factor(companies$industry)
companies$segment <- as.factor(companies$segment)

Make sure to declare your categoricals as factors!

Now we’ll turn our categoricals into dummy variables, you may have also heard the term one hot encoding. The idea is that each value of the categorical is turned into a column and the row value is populated with either a 1 or a 0. We’ll use the dummy.data.frame from the dummies package in R.

companies_dummy <- dummy.data.frame(companies)

As mentioned, you’ll notice above that for each value of industry, we see a column associated with each of the values: finance, retail, & tech. We see the same thing for each column.

Let’s now run our dist function.

dist <- dist(companies_dummy, method = 'binary')

Here each company is being compared to one another. A & B are 0, because there is no distance between them. If you recall they were both smb retail. 1 would be if they held no similarity. You’ll notice that for C & E, they had only one similarity, hence the distance of .67.


I hope you’ve enjoyed this breakdown of clustering.

We’ve covered the two main areas of machine learning.

The definition, purpose, applications, and measurement of clustering.

We’ve learned about preprocessing and how to compute distance between two points whether numeric or categorical.

Each of these lessons will prove incredibly foundational as you continue to learn and implement different clustering approaches in your analysis.

Happy Data Science-ing


Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: