

In case the term doesn’t exist in a particular document, that particular TF value will be 0 for that particular document. We need the word counts of all the vocab words and the length of the document to compute TF. Vocab is the list of all possible worlds in the corpus. So, what we do is that we vectorize the documents on the vocab.

If we do that, then the vector length will be different for both the documents, and it will not be feasible to compute the similarity. When we plan to vectorize documents, we cannot just consider the words that are present in that particular document. Recall that we need to finally vectorize the document. For this exact reason, we perform normalization on the frequency value, we divide the frequency with the total number of words in the document. But we cannot say that the longer document is more important than the shorter document. But if we take two documents with 100 words and 10,000 words respectively, there is a high probability that the common word “was” is present more in the 10,000 worded document. This highly depends on the length of the document and the generality of the word, for example, a very common word such as “was” can appear multiple times in a document. This measures the frequency of a word in a document. TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF) Terminology All of this process is done using the vectorized form of query and documents. When you search with a query, the search engine will find the relevance of the query with all of the documents, ranks them in the order of relevance and shows you the top k documents. The search engine maintains a fixed representation of all the documents. The web pages are called documents and the search text with which you search is called a query. This exact technique is used when you perform a google search (now they are updated to newer transformer techniques). So, for this reason, we need to vectorize all of the text so that it is better represented.īy vectorizing the documents we can further perform multiple tasks such as finding the relevant documents, ranking, clustering, etc. But how can any program (eg: python) interpret this sentence? It is easier for any programming language to understand textual data in the form of numerical value. It's easy for us to understand the sentence as we know the semantics of the words and the sentence. If I give you a sentence for example “This building is so tall”. This method is a widely used technique in Information Retrieval and Text Mining. We generally compute a score for each word to signify its importance in the document and corpus. This is a technique to quantify words in a set of documents. TF-IDF stands for “Term Frequency - Inverse Document Frequency”.

Photo by Sanwal Deen on Unsplash Introduction: TF-IDF
