Hi Gabriel,
If it's a really big corpus, then 500 occurrences may not be enough for=20=
a word to get into the top 20000, which I think is the default number=20
of rows in the cooccurrence matrix.
The first thing you should try is building a model with more rows,=20
which I think you can do by changing the
"admin/default_params" file.
Try a "grep -n falklands" on the ".dic" file in your model directory if=20=
you want to get a sense of how many rows you should include before you=20=
get to the word "falklands". If it looks like the word should have been=20=
in anyway, then the problem is something else.
Hope this helps.
Best wishes,
Dominic
On Oct 20, 2005, at 8:01 PM, Gabriel Murray wrote:
> I built an Infomap model using a very large corpus of newspaper=20
> articles (100+ million words). I can use associate to query words, but=20=
> I find that some words that were contained in the corpus and were NOT=20=
> stopwords are for some reason not contained in the model, i.e. I get a=20=
> response of "no word vector for X." Is there some frequency threshold=20=
> set? For example, "falklands" doesn't appear in the model even though=20=
> it appeared more than 500 times in the corpus.
> =A0
> If there is some threshold, can I turn it off?
> Thanks,
> Gabriel Murray=
|