| Name | Modified | Size | Downloads / Week |
|---|---|---|---|
| TF-IDF_stopword_removed.jar | 2013-12-08 | 172.7 kB | |
| README.txt | 2013-12-08 | 4.6 kB | |
| mystopword.txt | 2013-12-08 | 3.9 kB | |
| Totals: 3 Items | 181.2 kB | 0 |
=============================================================================================================
======
README
======
TF-IDF Measure
March 04 2013
Author: Barkin Aygun
Author: Rushdi Shams, UWO, Canada
version: 1.1
JAR file for measuring TF-IDF of each document in a collection
NOTE: THE TERM FREQUENCY AND INVERSE DOCUMENT FREQUENCY ARE MEASURED BY EXCLUDING THE STOPWORDS
ACCORDING TO A COMPREHENSIVE LIST OF STOPWORDS (mystopword.txt)
=============================================================================================================
Contents:
---------
1. Overview
2. Operation example
=============================================================================================================
1. Overview:
-----------------------------------
TF-IDF.jar is a Java Archive file to measure TF-IDF of each document
in a document collection (corpus).
The jar can be used to
(a) get all the terms in the corpus
(b) get the document frequency (DF) and inverse document frequency (IDF) of
all the terms in the corpus
(c) get the TF-IDF of each document in the corpus
(d) get each term with their frequency (no. of presence), term frequency (TF) and TF-IDF in every document
To calculate TF the following formula is used.
TF (term) = frequency of the term in the document / no. of terms in the document
It is possible to use a variant of the formula (like 1 + log (frequency) or sqrt (frequency) by
extracting the frequency of each word.
To calculate IDF the following formula is used.
IDF (term) = log (no. of documents in the corpus / document frequency of the term)
To calculate TF-IDF of each term the following formula is used
TF-IDF (term) = TF (term) X IDF (term)
To calculate TF-IDF of each document, the following formula is used
TF-IDF (document) = sqrt (sum (TF-IDF (term)))
It is possible to use a variant of TF-IDF (document) normalization other than using squared root by
extracting the TF-IDF (term)s.
2. Operation example:
------------------------
A test class is provided as follows:
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
public class TestTF_IDF {
public static void main(String[] args){
//Test code for TfIdf
TfIdf tf = new TfIdf("directory of your corpus");
//Contains words in the documents
String word;
//Contains file name being processed
String file;
//Variable to hold document frequency and IDF of each word
Double[] dfIDF;
//Print document frequency and IDF of every word in the corpus
for (Iterator<String> it = tf.allwords.keySet().iterator(); it.hasNext(); ) {
word = it.next();
dfIDF = tf.allwords.get(word);
//dfIDF[0] is the DF of the word and dfIDF[1] is the IDF of the word
System.out.println("Term " + word + " " + " Document Frequency " + dfIDF[0] + " " + " IDF " + dfIDF[1]);
}//for (Iterator<String> it = tf.allwords.keySet().iterator(); it.hasNext(); )
tf.buildAllDocuments();
//Print TF-IDF of each document in the corpus
for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); ) {
file = it.next();
System.err.println("File Name " + file + "\t" + "TF-IDF " + tf.documents.get(file).vectorlength);
}//for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); )
//Prints each term in a document, its frequency, term frequency and tf-idf
Map<String, Double[]> myMap = new HashMap<String, Double[]>();
Double[] values;
for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); ) {
file = it.next();
System.out.println("File \t" + file);
myMap = tf.documents.get(file).getF_TF_TFIDF();
for (String key : myMap.keySet()) {
values = myMap.get(key);
System.out.println("Term = " + key + " Frequency = " + values[0] + " Term Frequency " + values[1] + " TF-IDF " + values[2]);
}//for (String key : myMap.keySet())
}//for (Iterator<String> it = tf.documents.keySet().iterator(); it.hasNext(); )
}//public static void main(String[] args)
}//public class TestTF_IDF
=============================================================================================================