Translating Core Dplyr Functions to Python

Introduction

You may find yourself with a tool-belt filled with a wide variety of tools in R and for whatever reason you may need to perform the same functionality of said tools in python. At the surface, the jump from the convenience and simplicity of R can seem a bit daunting as the python landscape, while ample, can often produce what feels like too many translations for a given piece of functionality. Finding a straight forward translation is not always that easy and may leave us running back to the safety of the tidyverse.

Through this series we’ll aim to simplify the translations between R and Python and give a bit of rationale for why we use what– without going overboard of course.

What you’ll learn

For this article, we’ll be breaking down primary translations for R’s standout library, dplyr.

By reading this article, you can expect to learn primary python translations for the following functions:

  • filter
  • mutate
  • select
  • rename
  • group_by
  • summarise

Let’s talk python

If you’d like to follow along, I’ll be downloading the classic Iris dataset supplied by sklearn

You can access iris in a pandas dataframe like so:

from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df["target"] = iris.target

When it comes to python for data scientists, pandas is a mainstay. Just keep in mind that it is not a one to one replacement for dplyr.

Pipes

Coming to terms with the lack of a good alternative to the the tidyverse’s pipes is a key step in getting some better muscle memory in python. Before you yell at me; Yes, I know there are various alternatives in python that aim to create parity, but I feel that they fall short. Please let me know if you disagree and what your go-to is. Perhaps in a future post I’ll talk a bit more on this and enumerate which I think is… best.

Now on to dplyr functions!

Filter

Filter is obviously critical to almost any ds workflow enabling users to subset their data according to certain criteria.

It is one of the most used functions that also has a ton of translations; not to worry, as promised, we’ll keep this simple.

Below I’ll drop an example in each language and explain:

R
iris_df %>%
   filter(sepal_length > 5)
Python
iris_df[iris_df['sepal_length'] > 5]
Breakdown

I am a big fan of boolean indexing (the above method) for the following reasons:

  • It’s quite similar to what you’ve already seen in R. You could run this same bit of code, iris_df['sepal_length'] > 5, in r and python and get the same result
  • This method is fast! No need to create a copy of the dataframe
  • You don’t have to deal with the nuances of methods like loc or iloc

As you continue to add to your skills in python, certainly enrich what you know; but I am a major proponent for finding common ground with what you already understand– especially at first.

Mutate

In dplyr, mutate is how we go about creating new columns, typically via calculations on data that is already in our dataframe.

For this example, let’s create a column that represents the ratio of sepal length to width:

R
iris_df <- iris_df %>%
    mutate(sepal_ratio = sepal_length /  sepal_width)
Python
iris_df['sepal_ratio'] = iris_df['sepal_length'] / iris_df['sepal_width']
Breakdown

What you see above is quite self explanatory. The only major difference here is that we don’t have the facility of chaining methods. Now, a very frequently used function in the context of mutate is ifelse. I won’t go into this in detail here, but np.where is my preferred alternative to ifelse.

Select

Select is exactly what you would think, it’s our way to choose specific columns from a dataframe.

In the below examples, we’ll pretend we only want the sepal_length & sepal_width columns:

R
iris <- iris %>% 
    select(sepal_length, sepal_width)
Python
iris[['sepal_length', 'sepal_width']]
Breakdown

As you can see, dplyr allows us to simplify a lot of the syntax of column subsetting or selecting, but fortunately the python alternative is quite simple as well. One note is that when selecting multiple columns you need to use the double brackets displayed above. When selecting a single field only one set of brackets is necessary.

Rename

Renaming Columns

The rename function in dplyr is used to change the names of columns in a data frame. The equivalent function in pandas is called rename

R
iris %>%
  rename('sepal_length' = Sepal.Length)
Python
iris.rename(columns={"sepal length (cm)": "sepal_length"})
Breakdown

Here is one method that won’t introduce too many alternatives. The key things to recall with this alternative is 1. the bracket and colon format & 2. that the names go old to new.

Group by & Summarise

Sometimes the methods here can be a bit nuanced, mainly for the sake of specifying column names. Now sure, no matter the method you use, it’s simple to memorize the syntax… but it doesn’t take away from the fact that some of these aggregation methods feel syntactically clunky.

That said, we’re going to go with an approach that makes use of a bit of what we’ve already seen (to specify column names — lengthy but intuitive) as well as the most parallel approach to what we know.

R
mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg))
Python
mtcars.groupby("cyl")["mpg"].mean()
Breakdown

In the R code, we use the group_by function to group the data by the cyl column, and the summarise function to calculate the mean mpg for each group. In the Python code, we use the groupby method to group the data by the cyl column, we then specify the column we’d like to aggregate using the same approach as our select & filter methods, and finally we specify the aggregation method. One limitation here is that using this approach you aren’t able to name your new column on the spot, which is sometimes needed. If using this method, go ahead and use the rename function that we reviewed above.

Summary

Thanks for taking the time to learn with us. We covered a lot in a few short minutes. In just a few minutes, we covered a lot of ground, so let’s review some of the key takeaways:

  • Python is powerful, widely used, and easy to integrate with a myriad of other technologies
  • We learned key python translations for each of the following functions:
    • filter
    • mutate
    • select
    • rename
    • group_by
    • summarise

While these approaches are simple and recommended, as your expertise and needs evolve, so will the methods you use. Try new things and be sure to let me know what works best for you!

Advertisement

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: