From: Davide <Dav...@un...> - 2008-07-01 19:24:49
|
Hi I have actually downloaded Infomap and I am trying to play with it. Reading the documentaton about the algorithm description on the website (http://infomap-nlp.sourceforge.net/doc/algorithm.html), I found this information: "It is at this stage of the preprocessing that the WORDSPACE software can incorporate extra linguistic information such as part of speech tags and multiword expressions, if these are suitably recorded in the corpus." Unfortunately no futher information is provided. I would like exactly to exploit extra linguistic information as part of speech tags and multiword expressions. Currently, I am just using the single-file format as described in http://infomap-nlp.sourceforge.net/doc/input_formats.html and it works great if I use the raw text corpus. The next step I would like to experiment is to give a text already preprocessed (tokenized, lemmatized, POStagged, NERTagged). So, I would like Infomap to skip this preprocess creating directily the suitable matrix for the svd. How can I do it? What would the matrix be like? Is there a way that the coll dimensions in the matrix be only the extra linguistic features, rather than the words? What does "suitably recorded in the corpus" mean? Is there a particular input format? Thank you Sincerely Davide |