Re: [infomap-nlp-users] parameters for big corpora

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Joerg,

Apologies for not replying to your earlier message, glad you got things 
figures out.

> Now I managed to create a model for my 76 million word text collection
> (using a single fiel input) using standard settings of infomap. 
> However,
> my retrieval results are quite bad for my test queries. Much worse than
> with standard IR engines such as Apache Lucene.

I'm not surprised at all that you get worse results with Infomap than 
with a standard engines like Lucene. There's very little evidence in 
the literature of LSA actually improving straight text retrieval, and 
this is the main reason why the most interesting work in LSA-type 
systems in the past 10 years has been in applications like WSD, 
classification, lexical acquisition, etc. I wouldn't really advise 
anyone to think of Infomap as a top class search engine, I'd think of 
it as a lexical modelling tool that does some document retrieval.

Thus saying, if you were to pursue this, the place I would start would 
be with term-weighting. There's a term_weight function in 
count_artvec.c that you could call from process_region, and I don't 
believe this is done by default. Experimenting with term weights (both 
in document vectors and query vectors?) would be one of the most 
sensible places to start.

I would still be surprised if you did better than Lucene for most 
queries using Infomap for document retrieval. However, you might get 
improvements using Infomap for query expansion on sparse queries. In 
other words, if you get low recall for a query, try putting the query 
terms through an Infomap "associate words" query to get neighboring 
terms, and add neighboring terms to the query if their cosine 
similarity is greater than (e.g.) 0.65. Just a suggestion. In general, 
I think you'd expect a kind if similarity induction / smoothing engine 
like Infomap to help with improving recall rather than precision.

> I wonder if I should play around with the parameters of Infomap a bit
> more. I started to do some little experiments but without any huge
> improvements.
>
> I have 76 million tokens and i just counted 740,000 different word 
> types
> (with alphabetic characters only). The standard settings for infomap,
> however, are only 20000 rows and 1000 columns. For the first I'm not
> really sure what the difference is between "content bearing words to be
> used as features" and "words for which to learn word vectors". I'm not
> really sure what that means

The "content bearing words" are columns, the "words from which to learn 
vectors" are rows. After processing the corpus, each entry in the 
matrix records the number of times each row-word occurred near each 
column word. (After that, the SVD happens.)

> (sorry for asking newbie questions ...)

Don't be. I'm sorry for not having time to reply often!

> I'd like to know if somebody has experience with settings for large 
> text
> collections. Is there any rule-of-thumb how to adjust the parameters to
> get good results? I don't know exactly where to start ... should I
> increase the numbers of rows and columns (and how much) or should I 
> play
> with the dimensions of LSA (SINGVALS)?

My initial suggestion would be to increase the number of rows, if 
you're going to try increasing anything. Then you have a bigger 
vocabulary, but described in terms of the same "content bearing words".

> What exaclty does the pre- and
> post-context?

Pre-context is how many words before a row-word a column-word must 
appear to be counted, post-context is how many words after. I think 
this is the case, I haven't used the options myself.

Best wishes,
Dominic