Principal Component Analysis in R

Hi there! Welcome to my blog on pricipal component analysis in R.

Purpose:

PCA is a dimensionality rediction technique; meaning that each additional variable you’re including in your modeling process represents a dimension.

What does it do?:

In terms of what PCA actually does, it takes a dataset with high dimensionality, and reduces them down into a handful of uncorrelated components. The thing to think about is the idea of an uncorrelated component. If I am predicting customer revenue using their income (if I happened to have that) & I then add an additional variable which represents the median income in their zip code, those two variables are very likely correlated, and the next variable is unlikely to positively impact my models ability to accurately predict revenue.

The idea that I just explained is multicollinearity. The idea that one predictor of a given variable could be predicted by other predictor variables. Or to tie it back to the example I just gave, the idea that actual income could be decently predicted by median income in their area; the two would likely be collinear.

Getting Started:

How are components determined?

Each component is determined because it accounts for the most variance, and each subsequent component is chosen according to the next greatest portion of variance accounted for. So component one will account for the most variance, component 2 will account for the second most variance, and so forth.

Steps:

  • Scale dataset
  • create pca object – prcomp
  • print eigenvalues

First things first, load up the R dataset, mtcars

data(mtcars)

Next, PCA works best with numeric data, so you’ll want to filter out any variables that aren’t numeric. In our case, we’ll use the dplyr select function to remove the variables vs & am.

mtcars <- mtcars %>% select(- c(vs, am))

Lets throw the updated, numeric data only dataset into the pca function, prcomp.

pca <- prcomp(mtcars, center = TRUE,scale. = TRUE)

Something to keep in mind here is scaling. The purpose of scaling is to standardize the variance across variables. Why would someone care you ask? Because the idea behind each component is that they are accounting for as much variance as possible. Without this scaling, your high variance variables will be over-represented in your components.

Now, How many components should we keep??:

Before we kick off this section, the thing to remember is that we’re trying to reduce our dimensions, thus minimizing the number of less necessary components is ideal.

I am going to give you three main things to consider when determining how many components to keep.

For the first one follow along!

summary(pca)

What we’re looking for here is the proportion of variance. As we’ve been talking about, the idea is around explaining variation; if you follow along, you’ll see that the first component, PC1 accounts for 62.8% of the variation, PC2 23% and so forth.

With this information, you can make a consideration around how much variance do you need explained by the least number of components. This could be you making the determination that 63% with the first component is enough, it could be the first two for 86%, or three for 91%.

One helpful tool for considering a problem like this is a scree plot.

You’ll see that on the scree plot, I’ve plotted the component on the x-axis and the importance of that component on the y-axis. What we’re seeing above is that after the second component there is a significant drop-off to the incremental impact of each additional component. While one might reason the impact of the of the third component is significant enough to conclude it’s worth including the third component, it would be unlikely one would reason to add more than that.

The next evaluation you might make is what’s called the Kaiser-Guttman Criterion. The rule here is simple enough; only keep components for which the Eigenvalue is greater than 1.

If you don’t know how to get the Eigenvalue right off the top of your head; come on a quick journey with me. To get the eigenvalue per component, you just need to find the variance or standard deviation squared of each component from your pca object — as seen below.

pca$sdev ^ 2

As mentioned earlier, you would eliminate any variable that didn’t have an eigenvalue greater than 1. The idea behind this is that if the eigenvalue is less than 1, then the compenent accounts for less variance than a single variable contributed.

Based on the methods we’ve used, it’s obvious that our route here is to use the first two components only

Understanding your components:

It’s important to know what your components are made up of. In other words, which variables contribute to any given component? geom_hline(yintercept=20) It’s important to know what your components are made up of. In other words, which variables contribute to any given component.

print(pca$rotation)

Looking at the various contributions to the first component, I might endeavor to suggest that the first component largely represents car power or performance with things like cyl, hp, wt, & disp as the highest.

Another thing you can do to easily see the grouping of different variables is a biplot.

pca %>% biplot(cex = .5)

As I mentioned previously you can see cyl, hp, disp, & wt all grouped together.

Compare your components to the original variables using linear regression:

In order to determine the potential predictive impact of using the regular variables versus principal components, lets run a regression using each group and compare the r-squared of each model.

For each regression, we’ll try to predict MPG using every other variable in the mtcars dataset.

We’ll now create the first model predicting with all variables as is.

fit_1 <- lm(mpg ~ ., data = mtcars)

Now lets create the next dataset containing mpg and the first two components.

components <- cbind(mpg = mtcars[, "mpg"], pca$x[, 1:2]) %>%
as.data.frame()

Now we’ll train the second model using these components

fit_2 <- lm(mpg ~ ., data = components)

Now for the moment of truth, let’s compare the r-squared of each model!

summary(fit_1)$adj.r.squared

summary(fit_2)$adj.r.squared

As you can see above, there’s a not so insignificant bump to the model’s r-squared when using just two components.

Conclusion

I hope this was helpful! The final thing to keep in mind here is that the sole purpose here is to simplify without losing too much predictive ability. Happy Data Science-ing!

Advertisement

Join the Conversation

2 Comments

  1. Pingback: How to do feature engineering with categorical data in R – 121 News
Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: