Using Mutate to Feature Engineer a New Categorical
Among the most helpful functions from
mutate; it allows you to create new variables– typically by layering some logic on top of the other variables in your dataset.
Let’s say that you’re analyzing user data and you want to categorize users according to usage volume.
You decide that you want four tiers– inactive users, limited users, healthy users, & power users.
Your code might look something like this. What you’ll see here is that to create the new,
user_type variable, we use the
mutate function, declare the new variable name, then leverage
ifelse to determine under what bands to apply different values. As you can see below, if actions within the app are less than 5, the lowest threshold, then we’ll call then inactive… if that criteria is not true, then we’ll enter the next
ifelse statement to establish the next set of criteria.
df %>% mutate(user_type = ifelse( app_actions <= 5, 'inactive user', ifelse( app_actions <= 10, 'limited user', ifelse( app_actions <= 15, 'healthy user', 'active user' ) ) ) )
ifelse is a staple and very useful, when you start nesting too many
ifelse, a couple of problems arise.
- Messy code that is hard to interpret & edit
- You write a lot of redundant code
I should add, that what I wrote above wasn’t too crazy, but you can very quickly end up needing double digit
ifelse statements which creates the exact problems we’re talking about.
case_when to Save The Day
In many ways, R presents a lot more flexibility than sql, but with that said, one SQL command that many miss… unnecessarily is
case_when. Luckily for us,
case_when is actually a thing in R.
Check out the exact same code snippet presented with a
df %>% mutate(user_type = case_when( app_actions <= 5 ~ 'inactive user', app_actions <= 10 ~ 'limited user', app_actions <= 15 ~ 'healthy user', TRUE ~ 'active user' ) )
Again this is a very simple example, but when you are having to do twenty condition/value combinations, this presents a lot of time savings as well as clarity & readability. The main difference here is that the left side is effectively reserved for conditions, the
~ sign operates as the divider between comparison & value, and obviously on the right is the value to be given matching criteria. As a final note on this,
TRUE acts as a final catchall, akin to an
In short, while
ifelse have their place and are incredibly useful,
case_when makes a simple & easy to interpret alternative when you may be confronting a myriad of
I hope you find this useful in all of your feature engineering endeavors! If you found this useful and enjoyable, come check out some of our other data science posts at datasciencelessons.com! Happy data science-ing!