Home
Name Modified Size InfoDownloads / Week
TF-IDF_stopword_removed.jar 2013-12-08 172.7 kB
README.txt 2013-12-08 4.6 kB
mystopword.txt 2013-12-08 3.9 kB
Totals: 3 Items   181.2 kB 1
=============================================================================================================
                                                                       
                              ======                                   
                              README                                   
                              ======                                   
                                                                       
						  TF-IDF Measure
                          March 04 2013
						Author: Barkin Aygun
				Author: Rushdi Shams, UWO, Canada
						   version: 1.1
                                                                       
		JAR file for measuring TF-IDF of each document in a collection
NOTE: 	THE TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY ARE MEASURED BY EXCLUDING THE STOPWORDS
		ACCORDING TO A COMPREHENSIVE LIST OF STOPWORDS (mystopword.txt)
		
=============================================================================================================

Contents:
---------

1. Overview
2. Operation example

=============================================================================================================

1. Overview:
-----------------------------------
TF-IDF.jar is a Java Archive file to measure TF-IDF of each document 
in a document collection (corpus).

The jar can be used to 
(a) get all the terms in the corpus
(b) get the document frequency (DF) and inverse document frequency (IDF) of 
all the terms in the corpus
(c) get the TF-IDF of each document in the corpus
(d) get each term with their frequency (no. of presence), term frequency (TF) and TF-IDF in every document

To calculate TF the following formula is used.
TF (term) = frequency of the term in the document / no. of terms in the document

It is possible to use a variant of the formula (like 1 + log (frequency) or sqrt (frequency) by
extracting the frequency of each word.

To calculate IDF the following formula is used.
IDF (term) = log (no. of documents in the corpus / document frequency of the term)

To calculate TF-IDF of each term the following formula is used
TF-IDF (term) = TF (term) X IDF (term)

To calculate TF-IDF of each document, the following formula is used
TF-IDF (document) = sqrt (sum (TF-IDF (term)))

It is possible to use a variant of TF-IDF (document) normalization other than using squared root by
extracting the TF-IDF (term)s.

2. Operation example:
------------------------
A test class is provided as follows:

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

public class TestTF_IDF {
	
	public static void main(String[] args){
		//Test code for TfIdf
		TfIdf tf = new TfIdf("directory of your corpus");
		//Contains words in the documents
		String word;
		//Contains file name being processed
		String file;
		//Variable to hold document frequency and IDF of each word
		Double[] dfIDF;
		
		//Print document frequency and IDF of every word in the corpus
		for (Iterator<String> it = tf.allwords.keySet().iterator(); it.hasNext(); ) {
			word = it.next();
			dfIDF = tf.allwords.get(word);
			//dfIDF[0] is the DF of the word and dfIDF[1] is the IDF of the word
			System.out.println("Term " + word + " " + " Document Frequency " + dfIDF[0] + " " + " IDF " + dfIDF[1]);
		}//for (Iterator<String> it = tf.allwords.keySet().iterator(); it.hasNext(); )	
		
		tf.buildAllDocuments();
		
		//Print TF-IDF of each document in the corpus
		for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); ) {
			file = it.next();
			System.err.println("File Name " + file + "\t" + "TF-IDF " + tf.documents.get(file).vectorlength);
		}//for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); )
		
		//Prints each term in a document, its frequency, term frequency and tf-idf
		Map<String, Double[]> myMap = new HashMap<String, Double[]>();
		Double[] values;
		for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); ) {
			file = it.next();
			System.out.println("File \t" + file);
			
			myMap = tf.documents.get(file).getF_TF_TFIDF();
			
			for (String key : myMap.keySet()) {
				
				values = myMap.get(key);
			    System.out.println("Term = " + key + " Frequency = " + values[0] + " Term Frequency " + values[1] + " TF-IDF " + values[2]);
			  
			}//for (String key : myMap.keySet())
		}//for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); )
		
	}//public static void main(String[] args)
}//public class TestTF_IDF


=============================================================================================================
Source: README.txt, updated 2013-12-08