From: Dominic W. <wi...@ma...> - 2005-10-21 00:14:46
|
Hi Gabriel, If it's a really big corpus, then 500 occurrences may not be enough for=20= a word to get into the top 20000, which I think is the default number=20 of rows in the cooccurrence matrix. The first thing you should try is building a model with more rows,=20 which I think you can do by changing the "admin/default_params" file. Try a "grep -n falklands" on the ".dic" file in your model directory if=20= you want to get a sense of how many rows you should include before you=20= get to the word "falklands". If it looks like the word should have been=20= in anyway, then the problem is something else. Hope this helps. Best wishes, Dominic On Oct 20, 2005, at 8:01 PM, Gabriel Murray wrote: > I built an Infomap model using a very large corpus of newspaper=20 > articles (100+ million words). I can use associate to query words, but=20= > I find that some words that were contained in the corpus and were NOT=20= > stopwords are for some reason not contained in the model, i.e. I get a=20= > response of "no word vector for X." Is there some frequency threshold=20= > set? For example, "falklands" doesn't appear in the model even though=20= > it appeared more than 500 times in the corpus. > =A0 > If there is some threshold, can I turn it off? > Thanks, > Gabriel Murray= |