From: Dominic W. <wi...@ma...> - 2006-02-16 19:31:31
|
Dear Joerg, Apologies for not replying to your earlier message, glad you got things figures out. > Now I managed to create a model for my 76 million word text collection > (using a single fiel input) using standard settings of infomap. > However, > my retrieval results are quite bad for my test queries. Much worse than > with standard IR engines such as Apache Lucene. I'm not surprised at all that you get worse results with Infomap than with a standard engines like Lucene. There's very little evidence in the literature of LSA actually improving straight text retrieval, and this is the main reason why the most interesting work in LSA-type systems in the past 10 years has been in applications like WSD, classification, lexical acquisition, etc. I wouldn't really advise anyone to think of Infomap as a top class search engine, I'd think of it as a lexical modelling tool that does some document retrieval. Thus saying, if you were to pursue this, the place I would start would be with term-weighting. There's a term_weight function in count_artvec.c that you could call from process_region, and I don't believe this is done by default. Experimenting with term weights (both in document vectors and query vectors?) would be one of the most sensible places to start. I would still be surprised if you did better than Lucene for most queries using Infomap for document retrieval. However, you might get improvements using Infomap for query expansion on sparse queries. In other words, if you get low recall for a query, try putting the query terms through an Infomap "associate words" query to get neighboring terms, and add neighboring terms to the query if their cosine similarity is greater than (e.g.) 0.65. Just a suggestion. In general, I think you'd expect a kind if similarity induction / smoothing engine like Infomap to help with improving recall rather than precision. > I wonder if I should play around with the parameters of Infomap a bit > more. I started to do some little experiments but without any huge > improvements. > > I have 76 million tokens and i just counted 740,000 different word > types > (with alphabetic characters only). The standard settings for infomap, > however, are only 20000 rows and 1000 columns. For the first I'm not > really sure what the difference is between "content bearing words to be > used as features" and "words for which to learn word vectors". I'm not > really sure what that means The "content bearing words" are columns, the "words from which to learn vectors" are rows. After processing the corpus, each entry in the matrix records the number of times each row-word occurred near each column word. (After that, the SVD happens.) > (sorry for asking newbie questions ...) Don't be. I'm sorry for not having time to reply often! > I'd like to know if somebody has experience with settings for large > text > collections. Is there any rule-of-thumb how to adjust the parameters to > get good results? I don't know exactly where to start ... should I > increase the numbers of rows and columns (and how much) or should I > play > with the dimensions of LSA (SINGVALS)? My initial suggestion would be to increase the number of rows, if you're going to try increasing anything. Then you have a bigger vocabulary, but described in terms of the same "content bearing words". > What exaclty does the pre- and > post-context? Pre-context is how many words before a row-word a column-word must appear to be counted, post-context is how many words after. I think this is the case, I haven't used the options myself. Best wishes, Dominic |