The dplyr package has a rich set of tools & functions that you can use for data wrangling, exploratory data analysis, feature engineering, and the like.
In the next few minutes, we’ll run through the functions that are absolutely pivotal and that you’ll find yourself using every day as a data scientist.
Select: Surface the Variables That Matter
Functions available to you in dplyr relate very closely to what you might use with SQL or even Python depending on what tasks you are performing.
select function in
dplyr acts very similar to how you would specify a columns that you want in a
SQL select statement,
The major difference here is that with sql your code might look like this… (and that you wont include aggregation functions in
dplyr select, but more on that later)
SELECT mpg, cyl, carb FROM mtcars
dplyr we first start by specifying the dataset we want to manipulate, we then use what’s known as a pipe operator
%>% to effectively say, “hey this is our dataset and the next action or function I call out should be done to it!”
So we’ll effectively see the function flip with a slight change in syntax. To continue with that example, “Hey mtcars is my dataset, and I’d like to select mpg, cyl, and carb from it.”
You can pipe from one function to another. It’s super convenient and easy to get used to. More on that in a moment.
mtcars %>% select(mpg, cyl, carb)
I personally prefer the approach of dplyr when it comes to data manipulation. It’s a lot easier to logically breakdown any data manipulation you may be performing or reviewing because you pipe function to function effectively recreating your dataset in each line.
SQL your output would look something like this:
You can think of filter as dplyr’s answer to SQL’s where statement. If you don’t think in SQL, this is pretty much how you get rid of anything that doesn’t fulfill your desired criteria.
For instance, let’s say you only want to look at cars that have 4 cylinders.
You would declare
mtcars %>% then call the
filter function containing the criteria you care about.
mtcars %>% filter(cyl == 4)
Your output will look like this:
You’ll see that all of the records contain 4 cylinders.
Other Conditional Operators
You can use other conditional operators as well, like
>. You can do this on strings, logicals, etc.
mtcars %>% filter(cyl >= 4)
We can now see a variety of examples where the
cyl column is greater than or equal to 4.
Imagine a scenario where, yes you want to filter to records where
cyl is greater than or equal to 4… great; but you also want to filter down to records where
mpg is less than 20.
You can combine those criteria using
& means that both criteria have to be fulfilled, while
| will include records where at least one of the two criteria are fulfilled.
You aren’t limited to two or three or four criteria. You can chain together as many filters as you’d like!
Check out these examples!
mtcars %>% filter(cyl >= 4 & mpg < 20)
mtcars %>% filter(cyl >= 4 | mpg < 20)
Off the top of your head which code snippet do you think will return more records?
The second one will always return more (unless all vehicles with 4 cylinders or more all have less than 20
mpg, then they’d return the same number of records.)
This is because the filter is less strict. Only one of the two criteria have to be fulfilled, versus both having to be fulfilled with
You can also separate criteria using a comma and it will behave the same as the
Don’t Forget to Pipe Your Functions Together
So far you’ve learned about two key functions,
filter. You very likely will want to combine the outputs in succession. The secret to this is the pipe operator.
After you perform the
select operation, you now have a new dataset. It’s
mtcars plus whatever operation you did on top. When you follow with the pipe operator (
%>%), you effectively tell
R that the next function will be acting as a continuation of that lines output! Check out below how we combine those last two actions.
mtcars %>% select(cyl, mpg, carb)%>% filter(cyl >= 4 & mpg < 20)
For instance, if you didn’t include the
gear field in your
select function and then piped that output into a filter that said
gear == 4, it would give you this error:
It does this because
mtcars is no longer the dataset you’re working with, it’s the version of
mtcars that only includes the three columns listed in the
Always Feel Free to Save What You’ve Done
Before we move onto other functions you can use, it’s also worth telling you how to effectively save your dataset.
You’ll do it the same way you’d declare any variable in
mtcars_sub <- mtcars %>% select(cyl, mpg, carb)%>% filter(cyl >= 4 & mpg < 20)
Arrange: Let’s Get Orderly!
Alright, now that that’s out of the way.. lets jump into the
To continue relating each function to its
arrange is the
ORDER BY equivalent.
mtcars %>% filter(cyl >= 4)%>% arrange(mpg)
After we filter we then pipe into an
arrange function where we include
mpg. We will now see everything sorted smallest to largest
Now lets say you want to reverse the order on this looking at
mpg highest to lowest. All you will do is throw in a
mtcars %>% filter(cyl >= 4)%>% arrange(desc(mpg))
We now see the same output sorted from greatest to least
mutate function is how we go about creating a new column. This could be adding multiple column values together, creating categorical buckets according to a columns values, binning categoricals to be less granular, or additional columns for whatever else.
I’m going to jump into an example here, but keep in mind that my familiarity with cars is limited… so my understanding of the utility of the metric I’m about to create is a little naive.
Let’s say you want to know what the horsepower per cylinder would be. You’d declare the new variable name, and then include the logic for how to calculate this new field.
mtcars %>% mutate(hp_per_cyl = hp / cyl)
Group_by & Summarize
To wrap this up, the final major
dplyr operations to cover is its grouping & aggregating functionality.
In SQL when performing an aggregation, you use
GROUP BY to declare your grouping variables. However the logic for aggregation (mean, sum, max, min, count, etc.) sits in the
SELECT statement– which we visited at the beginning.
dplyr, we first will pipe into the
group_by() function specifying our grouping variables. A little different, but honestly a lot simpler to follow. Then all logic for aggregation follows in the
summarize function that you pipe into right after.
For this example, lets group by the cylinder count & then look at the mean miles per gallon and horse power.
You’ll see we pipe into
cyl, and then pipe into
summarise— where we first declare the name of our new aggregate field and then perform the operation.
mtcars %>% group_by(cyl)%>% summarize(mean_mpg = mean(mpg), mean_hp = mean(hp))
As you may have guessed, higher cylinder count is inversely related to miles per gallon, and relates closely with horse power.
Realistically, you would probably just evaluate the correlation across these specific variables, but it serves to show operation of
Dplyr is an incredibly useful library that includes the functionality of other data wrangling languages and its format for piping one function to another allows you to keep track of what otherwise could be an incredibly confusing query.
Dplyr also offers much much more that we didn’t get a chance to detail here, but what we’ve covered will serve as an excellent foundation as you branch into some of
dplyr‘s more complex functionality like
summarise_if, and many others.
Let me know what you liked and what you’d like to learn more about!
As always, happy data science-ing!