These modules are used to collect the right information of the stored documents and perform useful tasks such as extraction of documents' features like TF_IDF, or classification of the documents, etc.
MODULES
FeaturesExtractorTFIDF
Module for creating TF/IDF features of a text field.
SVMTagger
Module for classifing docs based on LibSVM.
WordCount
Module that adds a field in each document between desired dates, that specifies the number of words of the desired field of input.
InnerProduct
This module computes the inner product or the cosine similarity of a text of interest w (provided as an external file) and documents' field x in the BB and stores the result in specified output field name.
InnerProductWithWeights
Computes the weighted inner product of a text of interest (as an input vocabulary) and documents in the database with a predefined Tag name and Field name, within a period of time. The module writes the result on each document's registry.
UrlFeedFinder
Module for classifing docs based on LibSVM.
ReplaceTags
Queries the input BB for docs having all input Tags. It then replaces all the input tags with all output tags.
LanguageDetector
Queries the input BB for docs having input Tag. Then it classifies the language of the specified fields. The input tag then is being replaced by output tag that includes the txt's
language.BinaryRepresentation
Queries the input BB for all docs in a specific period of dates. Then it checks if the words from the INPUT_VOCABULARY_FILENAME is present to the doc's field specified by the user and it adds a tag if the number of words are greater than a threshold again specified by the user.
OnLineLearningPerceptronOnWords
Implements online learning using Perceptron algorithm. It adjust the weight vector w according to the INPUT_LEARN_TAGS, and the learning information is printed on the screen but also on the txt file STATISTICAL. It also updates the documents in the database by adding:
a) a new tag to all processed docs (positive or negative according to the predicted output).
b) a field with y_hat value.y_hat(t) = <w(t).x(t)>
The module works on every document that carries all the tags in INPUT_TAG field.OnLineLearningPerceptronOnFeatures
The only difference with the previous module is that it takes the already calculated features as an input.
OnLineLearningWinnowOnWords
Implements online learning using Winnow algorithm. It adjust the weight vector w according to the INPUT_LEARN_TAGS, and the learning information is printed on the screen but also on the txt file STATISTICAL. It also updates the documents in the database by adding:
a) a new tag to all processed docs (positive or negative according to the predicted output).
b) a field with y_hat value.y_hat(t) = <w(t).x(t)>
The module works on every document that carries all the tags in INPUT_TAG field.