Relevant keywords using TF-IDF
TF-IDF is a numerical statistic that tells us how important some word is in a document. It stands for term frequency - inverse document frequency. There are a lot of words with a high term frequency like the, a, and, but they, themselves, don’t convey much meaning. In other words, they are not keywords.
The second part of this statistic is the inverse document frequency which tells us how common a word is across all documents. It is calculated by dividing the total number of documents in a collection by the number of documents containing the specific word we want to analyse. After that quotient is determined, we take the logarithm of that.
TF-IDF = n_count_word_document x log2(count_all_documents / count_documents_containing_word)
Let’s get some data for some keywords in this blog post.
The values in the *TF column include title of this post and exclude anything below this paragraph. The values in the No docs containing term column result from googling the term we are analysing. It is assumed, for demonstration purposes that google contains 50 billion documents indexed.*
TF | No. docs containing term | TF-IDF | |
tf-idf | 3 | 18.72 | |
statistic | 2 | 19.39 | |
keyword | 3 | 16.97 | |
frequency | 4 | 28 | |
blog | 1 | 2.42 |
As you can see, the term blog does not make a good keyword. All other terms are a good fit.