From: Dominic W. <wi...@ma...> - 2005-10-04 14:51:21
|
Dear Arooj, Two questions I can think of that affect your question is i. is space a concern? and ii. how reliable do you want the results to be? The second question is affected crucially by the token frequency of the=20= types that are being indexed. In my (ad hoc) experience, by the time=20 you have about 20 or 30 occurrences of a word type, things are looking=20= reasonably stable. If you have fewer than 10, things are looking very=20 hit and miss. Those who have experimented with the King James Bible=20 corpus will probably have seen for themselves that infrequent words=20 like "kirioth" seem to crop up in very odd places. This suggests rephrasing the question as follows. In a corpus with m=20 tokens belonging to n < m types, how many of the types do you expect to=20= occur with frequency greater than some "stability threshold"? I believe=20= that the range 10 to 30 is a pretty good guess for this threshold, but=20= a guess it remains. One should be able to do a bit of counting and=20 comparing with Zipf's law to firm up the mathematics of this=20 suggestion. Am I answering the right question? Best wishes, Dominic On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote: > Hi, > =A0 > I=92ve wanted to index a large corpus, with almost 170K words. The=20 > default 20K doesn=92t work well for this. Is there a way of telling = what=20 > the optimal number of rows for a corpus of =91n=92 words should be? > =A0 > Thanks, > Arooj > =A0 > =A0 > > Express yourself instantly with MSN Messenger! MSN Messenger Download=20= > today it's FREE! =20 > ------------------------------------------------------- This SF.Net=20 > email is sponsored by: Power Architecture Resource Center: Free=20 > content, downloads, discussions, and more.=20 > http://solutions.newsforge.com/ibmarch.tmpl=20 > _______________________________________________ infomap-nlp-users=20 > mailing list inf...@li...=20 > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users= |