Re: [infomap-nlp-users] missing words

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Gabriel,

If it's a really big corpus, then 500 occurrences may not be enough for=20=

a word to get into the top 20000, which I think is the default number=20
of rows in the cooccurrence matrix.

The first thing you should try is building a model with more rows,=20
which I think you can do by changing the
"admin/default_params" file.

Try a "grep -n falklands" on the ".dic" file in your model directory if=20=

you want to get a sense of how many rows you should include before you=20=

get to the word "falklands". If it looks like the word should have been=20=

in anyway, then the problem is something else.

Hope this helps.
Best wishes,
Dominic

On Oct 20, 2005, at 8:01 PM, Gabriel Murray wrote:

> I built an Infomap model using a very large corpus of newspaper=20
> articles (100+ million words). I can use associate to query words, but=20=

> I find that some words that were contained in the corpus and were NOT=20=

> stopwords are for some reason not contained in the model, i.e. I get a=20=

> response of "no word vector for X." Is there some frequency threshold=20=

> set? For example, "falklands" doesn't appear in the model even though=20=

> it appeared more than 500 times in the corpus.
> =A0
> If there is some threshold, can I turn it off?
> Thanks,
> Gabriel Murray=