Kmeans clustering

Introduction

Clustering is a machine learning technique that falls into the unsupervised learning category. Without going into a ton of detail on different machine learning categories, I’ll give a high level description of unsupervised learning.

To put it simply, rather than pre-determining what we want our algorithm to find, we provide the algorithm little to no guidance ahead of time.

In other words, rather than explicitly telling our algorithm what we’d like to predict or explain, we kick back and hand the baton off to the algorithm to identify prominent patterns among given features.

If you’d like to learn more about that delineation, you can check out this post.

Clustering has a broad variety of applications and is an incredibly useful tool to have in your data science toolbox.

We will be talking about a very specific implementation of clustering, Kmeans. With that said, a foundational understanding of clustering lends well as a precursor to practical application. We won’t be diving into that here, but you can learn more from a conceptual level through this post.

Getting started with Kmeans

The K in Kmeans

K represents the number of groups or clusters you are seeking to identify. If you were performing clustering analysis where you already knew there were three groupings, you would use that context to inform the algorithm that you need three groups. It’s not always that you know how many natural groupings there are, but you may know how many groupings you need. In the below examples we’ll use the Iris dataset which is a collection of measurements associated with various iris species, but there are many other use cases.. customers based on usage, size, etc. or prospects based on likelihood to buy and likely purchase amounts, etc. There are a wide variety of applications in business, biology, and elsewhere.

Learning about centroids

Once we’ve determined k, the algorithm will allocate k points randomly; and these points will operate as your “centroids”. What is a centroid, you might ask? A centroid refers to what will be the center point of each cluster.

Once these center points or centroids have been randomly allocated… the euclidean distance is calculated between each point and each centroid.

Once the distance is calculated, each point is assigned to the closest centroid.

We now have a first version of what these clusters could look like; but we’re not done. Once we’ve arrived at the first version of our cluster, the centroid is moved to the absolute center of the group of points assigned to each centroid. A recalculation of euclidian distance takes place and in the event that a given point is now closer to an alternative centroid, it is assigned accordingly. This same process occurs until the centroid reaches stability and points are no longer being reassigned. The combination of points assigned to a given centroid is what comprises each cluster.

Lets build our first model

A bit of exploration to start

For this analysis, we’ll use the classic Iris dataset as I mentioned earlier. This dataset has been used time and time again to teach the concept of clustering. Edgar Anderson collected sepal & petal length/width data across three species of iris. If you’re interested in more information on this, check out the wikipedia explanation (https://en.wikipedia.org/wiki/Iris_flower_data_set)

Quick EDA

head(iris)
glimpse(iris)

We are going to visualize petal length & width in a scatter plot with species overlayed as the color.

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = factor(Species))) +
  geom_point()

Let’s do the same for a couple more variable combinations just for fun.

ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = factor(Species))) +
  geom_point()
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = factor(Species))) +
  geom_point()

We could go through this same exercise exhaustively for each combination of variables, but for the sake of brevity; we’ll carry on.

Building our first clustering model

set.seed(3)
k_iris <- kmeans(iris[, 3:4], centers = 3)

When we set the seed to a given number, we ensure we can reproduce our results.

We call the kmeans function & pass the relevant data & columns. In this case, we are using the petal length & width to build our model. We declare 3 centers as we know there are three different species.

If you then call your cluster model. You will get an ouput akin to the following:

A few things worth highlighting:

You will see a handful of useful pieces of information:

  • the number of clusters, as previously determined by the centers parameter
  • means for each value across the naturally determined cluster.
  • “Within cluster sum of squares” – this represents the absolute distance between each point and the centroid by cluster
  • Available components – here the model delivers up a handful of other pieces of information. We’ll leverage the “cluster” component here shortly!

Performance Assessment

Here we are allowing an unsupervised machine learning method to identify naturally occurring groups within our data, but we also have access to the actual species; so let’s assess the algorithm’s performance!

A very simple way is to look at a table with species & assigned clusters. Here you’ll see that we’ll reference the model object and call the determined cluster by record as well.

table(k_iris$cluster, iris$Species)

Here we can see that all of the setosa species were classified in the same grouping with no other species added to the same group. For versicolor we can see kmeans accurately captured 48/50 of the veriscolor in cluster 3, and 46/50 virginica in cluster 1.

We can very easily assign this classification to our iris dataset with the mutate function from the dplyr library

iris <- mutate(iris, cluster = k_iris$cluster)

Now lets recreate our first scatter plot, swapping out species for cluster

ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = factor(cluster))) +
  geom_point()

Above we can see a near identical representation of the natural groupings of these two datapoints based on their species.

We could certainly test a variety of variable combinations to arrive at an even better approximation of each species of iris, but now you have the foundational tools needed to do exactly that!

Conclusion

In a few short minutes we’ve learned how Kmeans clustering works and how to implement it.

  • Clustering is an unsupervised learning method
  • Kmeans is among the most popular clustering techniques
  • K is the predetermined number of clusters
  • How does the algorithm actually work
  • How to create our own cluster model

I hope this quick post on the practical application of Kmeans proves useful to you in whatever your application of the technique.

Happy data science-ing!

Advertisement

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: