Question 4 : You are working on a text-mining project. Where you are trying to find tf-idf for word "HadoopExam" . You have taken all the three blogs content from following blog lists

www.QuickTechie.com

www.Training4Exam.com

After conducting tf-idf analysis on all 3 blogs contents and derived TFIDF as ("hadoopexam",document y) =1.908. You know that the term "HadoopExam" only appears in blog "HadoopExam.com" What is the tf of "HadoopExam" on blog www.HadoopExam.com ?

A. 2 based on the following reasoning:

TFIDF = TFIDF = 1.908

You then know that IDF will equal LOG(32)=0.954

Therefore, TFIDF=TF*0.954 = 1.908

TF will then round to 2

B. 4 based on the following reasoning:

TFIDF = TF1DF = 1.908

You then know that IDF will equal LOG(3/1 )=0.477

Therefore, TFIDF=TF'0 477 = 1.908

TF will then round to 4

C. 6 based on the following reasoning:

TFIDF = TF1DF = 1.908

You then know that IDF will equal 3/1=3

Therefore, TFIDF=TF/3 = 1.908

TF will then round to 6

D. 11 based on the following reasoning:

TFIDF = TF1DF = 1908

You then know that IDF will equal LOG(3/2)=0.176

Therefore, TFIDF=TF*0.176 = 1.908

TF will then round to 11

You have no rights to post comments

0 # Amit 2016-09-05 15:48
In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.:8 It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow". A simple way to start out is by eliminating documents that do not contain all three words "the", "brown", and "cow", but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency.

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:

The weight of a term that occurs in a document is simply proportional to the term frequency.

nverse document frequency
Because the term "the" is so common, term frequency will tend to incorrectly emphasize documents which happen to use the word "the" more frequently, without giving enough weight to the more meaningful terms "brown" and "cow". The term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.