Using Mutate to Feature Engineer a New Categorical
Among the most helpful functions from dplyr
is mutate
; it allows you to create new variables– typically by layering some logic on top of the other variables in your dataset.
Quick Example
Let’s say that you’re analyzing user data and you want to categorize users according to usage volume.
You decide that you want four tiers– inactive users, limited users, healthy users, & power users.
Your code might look something like this. What you’ll see here is that to create the new, user_type
variable, we use the mutate
function, declare the new variable name, then leverage ifelse
to determine under what bands to apply different values. As you can see below, if actions within the app are less than 5, the lowest threshold, then we’ll call then inactive… if that criteria is not true, then we’ll enter the next ifelse
statement to establish the next set of criteria.
df %>%
mutate(user_type = ifelse(
app_actions <= 5, 'inactive user', ifelse(
app_actions <= 10, 'limited user', ifelse(
app_actions <= 15, 'healthy user', 'active user'
)
)
)
)
While ifelse
is a staple and very useful, when you start nesting too many ifelse
, a couple of problems arise.
- Messy code that is hard to interpret & edit
- You write a lot of redundant code
I should add, that what I wrote above wasn’t too crazy, but you can very quickly end up needing double digit ifelse
statements which creates the exact problems we’re talking about.
case_when
to Save The Day
In many ways, R presents a lot more flexibility than sql, but with that said, one SQL command that many miss… unnecessarily is case_when
. Luckily for us, case_when
is actually a thing in R.
Check out the exact same code snippet presented with a case_when
.
df %>%
mutate(user_type = case_when(
app_actions <= 5 ~ 'inactive user',
app_actions <= 10 ~ 'limited user',
app_actions <= 15 ~ 'healthy user',
TRUE ~ 'active user'
)
)
Again this is a very simple example, but when you are having to do twenty condition/value combinations, this presents a lot of time savings as well as clarity & readability. The main difference here is that the left side is effectively reserved for conditions, the ~
sign operates as the divider between comparison & value, and obviously on the right is the value to be given matching criteria. As a final note on this, TRUE
acts as a final catchall, akin to an else
statement.
Conclusion
In short, while ifelse
have their place and are incredibly useful, case_when
makes a simple & easy to interpret alternative when you may be confronting a myriad of ifelse
statements.
I hope you find this useful in all of your feature engineering endeavors! If you found this useful and enjoyable, come check out some of our other data science posts at datasciencelessons.com! Happy data science-ing!