Menu

Home

Angel Castellanos

Home Page

Welcome to Text Model Generator Wiki!


Description

This project offer to the users an JAVA-based application to model a collection of textual documents.
The system offers different types of modelling method, allowing to the users to experiment and check what method offers the best results for their needs.

The generated models can be seen as the set of more representative terms of each document/category according to the applied method.

The system is suitable to model collection in spanish or english. Please note that it is desirable that all the documents have to be in the same language, in order to obtains valid models. Otherwise, the representative terms will be dependant on their idiom.

In the following it is explained how to use the application as well as the different available techniques to model the contents.


Using the application

In this section it is described how the application has to be used for modelling a collection of textual documents, as well as the format of the data for allowing the correct execution of the program.

Format of the Input Data

The format of the data to be modelled is dependant on the modelling technique to be applied:
1. TF-IDF and KLD
2. MI and Chi-Squared

TF-IDF and KLD

This two techniques are used to model independent documents with their terms. For that, every document will be compared with the rest of the documents.
To apply these techniques, it is enough that all the documents to model are in the same folder

MI and Chi-Squared

This two techniques are used to model a set of categories, according to the textual documents belonging to them. Thus, the collection to be modelled has to be categorized in a set of categories and the input data has to be formated in this way.
To apply these techniques the collection has to be structured in a set of folders (corresponding to the categories), inside a root folder. Each of the category folders have to contain the documents containing to the category

Execution

The application is offered as a JAR file (see Wikipedia Page for more information).

To execute this file, JAVA Runtime Environment (JRE) has to be installed. To download and obtain information about JRE, see the next link.

With JRE installed, the application can be executed from command-line with the next command:

java -jar TextualModelGenerator.jar COLLECTION_PATH RESULTS_PATH MODELLING_METHOD LANGUAGUE [--stopWords]

where:

COLLECTION_PATH: the path where the collection to model is stored
RESULTS_PATH: the path where the user wants to store the results (the modelling of the collection)
MODELLING_METHOD: the modelling method that the users wants to apply to model the collection
(option: KLD, TF-IDF, MI, XI)
LANGUAGE: the language of the documents in the collection
(option: ES, EN)

Other valid options:
--stopWords: If this option it is selected, the stop words (e.g. a, the for the English or un, de for the Spanish) in the collection will be deleted.


Modelling Techniques

The application offers 4 different approaches for modelling the contents.
These are explained below

KLD (Kullback-Leibler Divergence)

KLD is used to extract those terminology that most distinguishes a specific content from the rest. For that, inside a collection. This technique is based on the occurrences of the terms: the longer a term appears in the document and less in other documents, more relevant is the term in the document.
The relevance is calculated as follows:

KLD(pD,pC) = pD(t)∙ln(pD(t)/pC(t))

where pD(t) is the probability of each term within a document D (or, in other words, frequency of t divided by the whole of terms in the document D) and pC(t)is the probability of the same term t within the collection C (frequency of t divided by the whole of terms in the collection C).

For more information see: KLD Wikipedia Page

TF-IDF

TF-IDF is a well-known technique (almost a standard) to model textual contents, this is, to reflect how important is a term to a document in a collection.
TF-IDF basically weights a term of a document according to the number of times that appears in the document and the frequency of the term in the collection. The formulation of TF-IDF is shown below:

TF-IDF(t,D) = f(t,D)∙log(|D|/|Dt|)

where f(t,D) is the frequency of the term t in a document D, |D| is the number of documents in the collection and |Dt| the number of documents in which t appears.

For more information see: TF-IDF Wikipedia Page

Mutual Information

Mutual Information (MI) can be seen as a metric to measure the mutual dependence of two random variables. In the field of Information Retrieval, MI has been posed as a useful metric to feature selection. More concretely, for textual representation, MI is useful to select the more representative terms of a category (the set of documents belonging to a category). For that, MI measures how much the presence/absence of a term contributes to classify a document, containing this term, in a category.

For more information see: MI Wikipedia Page

Chi-Squared

Chi-Squared is another method of feature selection, similar to MI. Chi-Squared is used to measure the independency of two events. Like MI, Chi-Squared can be used for textual representation, being the two events to check their independency the occurrence of a term in a document and the occurrence of a category in the same document.

For more information see: Chi-Squared Wikipedia Page


Contact

This is a project created and supported by:

Angel Castellanos
Juan Cigarrán Recuero
Ana García Serrano

If you want to contact with us, please mail to:

acastellanos@lsi.uned.es


Download