Re: [infomap-nlp-users] Number of ROWS for a large corpus

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Arooj,

Two questions I can think of that affect your question is
i. is space a concern?
and
ii. how reliable do you want the results to be?

The second question is affected crucially by the token frequency of the=20=

types that are being indexed. In my (ad hoc) experience, by the time=20
you have about 20 or 30 occurrences of a word type, things are looking=20=

reasonably stable. If you have fewer than 10, things are looking very=20
hit and miss. Those who have experimented with the King James Bible=20
corpus will probably have seen for themselves that infrequent words=20
like "kirioth" seem to crop up in very odd places.

This suggests rephrasing the question as follows. In a corpus with m=20
tokens belonging to n < m types, how many of the types do you expect to=20=

occur with frequency greater than some "stability threshold"? I believe=20=

that the range 10 to 30 is a pretty good guess for this threshold, but=20=

a guess it remains. One should be able to do a bit of counting and=20
comparing with Zipf's law to firm up the mathematics of this=20
suggestion.

Am I answering the right question?
Best wishes,
Dominic

On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote:

> Hi,
> =A0
> I=92ve wanted to index a large corpus, with almost 170K words. The=20
> default 20K doesn=92t work well for this. Is there a way of telling =
what=20
> the optimal number of rows for a corpus of =91n=92 words should be?
> =A0
> Thanks,
> Arooj
> =A0
> =A0
>
> Express yourself instantly with MSN Messenger! MSN Messenger Download=20=

> today it's FREE! =20
> ------------------------------------------------------- This SF.Net=20
> email is sponsored by: Power Architecture Resource Center: Free=20
> content, downloads, discussions, and more.=20
> http://solutions.newsforge.com/ibmarch.tmpl=20
> _______________________________________________ infomap-nlp-users=20
> mailing list inf...@li...=20
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users=