Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps.
In a previous post, I outlined the benefits of semi-joins and how to use them. Here I’ll be following that up with a very similar explanation of anti-joins.
If you want to brush up on semi-joins first you can find that here.
Anti-joins & semi-joins are quite a bit different than the other four that I just highlighted; the number one difference being they are actually classified as what’s known as filtering joins.
Syntactically it feels very similar to any other join, but the intention is not to enhance a dataset with additional columns or rows, the intent is to use these joins to perform filtering.
A filtering join is not classified by the additional of new columns of information, rather it facilitates one being able to keep or reduce records in a given dataset.
The anti-join is used with the intent of taking a dataset and filtering it down based on if a common identifier is not located in some additional dataset. If it shows up in both datasets… it’s excluded.
A good way to drill this idea home is to code the alternative.
Territory Analysis Example
Let’s say you’re a data analyst helping out your sales team who’s just hired on a new account executive. This new rep needs accounts, and the sales leaders want to make sure these accounts aren’t currently being worked. For this problem, let’s say we have two datasets; one that contains all accounts and another one that logs all sales activity this would have an activity id as well as an account id.
| account_id | current_owner| account_name | revenue
| activity_id | account_id |activity_type |
What you’ll need to do is filter the first dataset using the second dataset.
In a World Without anti-joins
Let’s first try to figure this out without the use of anti-joins
Based on what we know now around left joins… we could do the following:
accounts %>% left_join(activity, by = 'account_id')%>% filter(is.na(activity_type))
As you can see, we can left join the activity dataset to our accounts dataset. For whichever accounts there are not matches in this dataset activity type would be populated as NULL. As such, if you want a list of the accounts not in the second dataset you would filter where activity type is NULL.
This is fine and effectively performs the same functionality I explained above. Two annoying things are 1. it gives you a new field, activity_type; 2. is that if the same account shows up many times in the activity dataset, when you join it will create a new record for as many matches as there are.
We could also throw on a select statement thereafter to pull that field off, which just adds another line of code for you to write to meet this functionality.
accounts %>% left_join(activity, by = 'account_id')%>% filter(is.na(activity_type))%>% select(-activity_type)
Simplifying with Anti-joins
Now let’s simplify things even more with an anti-join.
accounts %>% semi_join(activity, by = 'account_id')
This will get us to the exact same output as each of the above examples. It will filter out records of accounts where they show up in the activity dataset. As such only accounts that are not being worked are going to get moved to the new rep in our scenario. This approach wont add columns or rows to your dataset. It exclusively exists and is used with the intent to filter.
There you have it, in just a few minutes we’ve covered a lot, and unlocked a bit of dplyr functionality that can simplify your code & workflow.
- The difference between mutating joins and filtering joins
- How to execute a “filtering-join” in the absence of anti-joins
- The specific output and intent for an anti-join
- How to use a anti-join
I hope this proves helpful in your day to day as a data professional.
Happy Data Science-ing!