Understand the Internet’s Relevance Metric in a Hurry: TF-IDF

intro

With roots in the 1950’s, TF-IDF is a cornerstone for modern applications determining the relevance of each word in a document. At first glance, one could take the simple approach of looking at volume alone, i.e. “how many times did each term show up?”; but TF-IDF takes us a big step further; not only does it consider volume but also a proxy for relevance. More on that in a moment.

Section takeaway: the most important terms in a document aren’t always the ones that show up the most

purpose

Put simply, TF-IDF exists to enable better term prioritization. If I write 1000 words, how does a search engine consider my article? If I used the word “and” 100 times, is that how my article is characterized? Any time someone types “and” into the search bar my article comes up?

If you have some some basic NLP experience, you might say that you’d remove stop words; fair enough, *stop words removed*, now what if data is the next most frequent term? While that might separate the document from some, it’s still very general… The meat of the topic itself could be obscured and as a result, the article may not be found when it counts.

I’ve been referring to TF-IDF as the metric for relevance, but perhaps a better way of describing the IDF in TF-IDF, is that it allows us the identify novelty.

method

If you’re not familiar with some of the NLP basics like bag of words, stop words, etc. I’d pause and familiarize yourself before moving forward.

Now when we talk about stop words; we’re talking about the ‘my’s, ‘the’s, ‘and’s, ‘it’s, ‘for’s, and so many others. What these words often make up in volume, they lack in… you guessed it, relevance.

Now, why even bring that up? Well.. there are other terms that are not stop words, but they occur with great frequency… so much so that they almost become a given… they lack substantive meaning and don’t really add any new insight.

Now… we have enough of a conceptual idea to chew on.. let’s jump break down the formula.

Section takeaway: it’s quite easy to wash out relevant terms when we only care about volume

Let’s break it down

We’ll break the concept into two obvious pieces: TF – Term Frequency & IDF – Inverse Document Frequency.

Term Frequency

This is exactly what it sounds like… of all the words in our document… what percentage of them does a given term make up (given term being our *term of interest*)?

Formula

tf = # of times the term occurs in our document of interest / # of total words in our document of interest

Inverse Document Frequency

This is where things get a bit more complicated, so bare with me a moment.

The first part is the ratio of total documents in the corpus to the number of documents that contain the term of interest.

Part two is that we take the log of that ratio. If you take a moment to imagine the likely ratio of the total number of documents to those containing the term of interest, that could be astronomically huge… but at a certain point the effect of additional documents begins to waiver. So using the log function is to diminish incremental values on the extreme.

Formula

idf = log(# of documents / (# of documents that contain the term of interest)

Time to code

House Keeping

To get started let’s import the necessary libraries, we’ll be using sklearn’s tooling for tfidf:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

Data Collection

Next, I’ve gotten the first paragraph about mars, neptune, and saturn from wikipedia and created our corpus with those documents, breaking them into separate list items by sentence.

docs = "Saturn is the sixth planet from the Sun and the second-largest in the Solar System after Jupiter", "It is a gas giant with an average radius of about nine", "Mars is the fourth planet from the Sun and the second-smallest planet in the Solar System being larger than only Mercury", "In English Mars carries the name of the Roman god of war", "Neptune is the eighth and farthest-known Solar planet from the Sun", "In the Solar System it is the fourth-largest planet by diameter the third-most-massive planet and the densest giant planet", "It is 17 times the mass of Earth and slightly more massive than its near-twin Uranus"
# sources (https://en.wikipedia.org/wiki/Neptune)
(https://en.wikipedia.org/wiki/Saturn)
(https://en.wikipedia.org/wiki/Mars)

Modeling

Similar to any run-of-the-mill sklearn algorithms, you’ll want to instantiate the model:

tfIdfVectorizer=TfidfVectorizer(use_idf=True)

After that, we’ll simply fit the model to our list of documents:

tfIdf = tfIdfVectorizer.fit_transform(docs)

Output

From here, let’s prepare our output.

Now, we’ll convert our scipy matrix to a dataframe.

output = pd.DataFrame(tfidf[0].T.todense(), index=vectorizer.get_feature_names(), columns=["tfidf"])

Finally we’ll sort the file to see some of our highest relevance terms.

output = output.sort_values('tfidf', ascending=False)

What’s interesting to note here is terms like Neptune and planet are coming in quite a ways down the list. The reason here would be that the prevalence or frequency of the term lends to the IDF component of the calculation to return a very low measure.

Conclusion

I hope this was a useful crash course in TF-IDF.

In a short time we’ve covered the following about TF-IDF:

  • a brief overview of its history
  • its purpose and use case
  • a breakdown of term frequency & inverse document frequency
  • a code implementation of tf-idf on some wikipedia documents

I hope you’ve enjoyed this post and that it helps you in your work! Please share what worked and what didn’t!

Feel free to check out some of my other posts at datasciencelessons.com

Happy Data Science-ing!

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: