Assuming you already have some background with the other more common types of joins, inner, left, right, and outer; adding semi and anti can prove incredibly useful saving you what could have alternatively taken multiple steps.
In this post, I’ll be focusing on just semi-joins; with that said, there is a lot of overlap between semi & anti, so get ready to learn a bit about both.
Semi & anti joins are quite a bit different than the other four that I just highlighted; the number one difference being they are actually classified as what’s known as filtering joins.
Syntactically it feels very similar to any other join, but the intention is not to enhance a dataset with additional columns or rows, the intent is to use these joins to perform filtering.
A filtering join is not classified by the additional of new columns of information, rather it facilitates one being able to keep or reduce records in a given dataset.
The semi join is used with the intent of taking a dataset and filtering it down based on whether a common identifier is located in some additional dataset.
A good way to drill this idea home is to code the alternative.
Opportunity Dataset Example
Let’s say we have a dataset from salesforce that contains all deals or opportunities that we’ve worked on or are currently working on.
This opportunity dataset gives us a lot of really good information around the deal itself. Let’s suppose it looks something like this:
| opp_id | account_id | created_date | close_date | amount | stage |
Now let’s say we need to filter this dataset down such that we’re only including enterprise accounts. Well it just so happens that there is no segment field on our opportunity dataset that we can use to filter… we’ll have to leverage information from elsewhere.
Let’s pretend that the only place enterprise accounts have been tracked is in a random excel file. Let’s say the dataset looks like this:
| account_id | customer_segment |
In a World Without Semi-joins
Based on what we know now around left joins… we could do the following:
opportunities %>% left_join(enterprise_accounts, by = 'account_id')%>% filter(!is.na(customer_segment))
As you can see, we can left join the enterprise accounts dataset to our main opps dataset and in the event that there is no matching value, customer segment would be null, as such you can add a filter statement saying you only want non null cases.
This is fine and effectively performs the same functionality I explained above. One annoying thing is it gives you a new field, customer_segment, that is the same for every record.
We could also throw on a select statement thereafter to pull that field off, which just adds another line of code for you to write to meet this functionality.
opportunities %>% left_join(enterprise_accounts, by = 'account_id')%>% filter(!is.na(customer_segment))%>% select(-customer_segment)
Assuming you’ve learned about inner joins as well, we could also achieve a similar functionality there with slightly simpler code.
opportunities %>% inner_join(enterprise_accounts, by = 'account_id')%>% select(-customer_segment)
Simplifying with Semi-joins
Now let’s simplify things even more with a semi-join.
opportunities %>% semi_join(enterprise_accounts, by = 'account_id')
This will get us to the exact same output as each of the above examples. It will filter out records of opportunity where there is no matching account_id in the enterprise accounts table. It wont add columns or rows to your dataset. It exclusively exists and is used with the intent to filter.
There you have it, in just a few minutes we’ve covered a lot, and unlocked a bit of dplyr functionality that can simplify your code, your workflow.
- The difference between mutating joins and filtering joins
- How to execute a “filtering join” in the absence of semi-joins
- The specific output and intent for a semi-join
- How to use a semi-join
I hope this proves helpful in your day to day as a data professional.
Happy Data Science-ing!