Menu

Home

František Dařena

VecText is an application that converts raw text to a structured format suitable for various data mining software (e.g., Weka, C5, CLUTO). The application is written in the interpreted programming language Perl which runs on more than 100 platforms. A part of the functionality is realized by external modules (e.g., Lingua::Stem::Snowball for stemming) freely available at the Comprehensive Perl Archive Network (CPAN). The graphical user interface is implemented in Perl/Tk, a widely used graphical interface for Perl. This extension can be also obtained from the CPAN archive.

Graphical user interface enables user friendly software employment without requiring specialized technical skills and knowledge of a particular programming language, names of libraries and their functions, etc. All preprocessing actions are specified using common graphical elements organized into logically related blocks.

In the command line interface mode, all options need to be specified using the command line parameters. This way of non-interactive communication enables incorporating the application into a more complicated data mining process integrating several software packages or performing multiple conversions in a batch.

To help the users define all necessary and desired parameters for the command line mode the application with the graphical interface enables generating the string with command line parameters based on current values of all form elements in the application window. These parameter settings are returned in the form of a text string and might be simply copied to, e.g., a batch file or script.

An example of running the application in command line mode:

perl vectext-cmdline.pl --input=data.txt --output_dir=. --output_file=data 
                        --local_weights="Term Frequency (TF)" --output_format=arff   
                        --print_statistics --create_dictionary_freq –stopwords_file=stop.txt
                        --stemming=English

See also a [General description] of the entire conversion process.

Page [Parameters] contains a detailed description of application parameters and their allowed values.

Project Members:


Related

Wiki: General description
Wiki: Parameters