Re: [infomap-nlp-users] Number of ROWS for a large corpus

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

<html><div style='background-color:'><DIV class=RTE>
Dear Dominic,
Thanks for your reply. You definitely are answering the right question and it certainly is very helpful. Just to elaborate things a little more...
&gt;i. is space a concern? No, not really. 
&gt;ii. how reliable do you want the results to be? Well, i want them to be reliable enough. When i said 170k words, i meant 170k types. So as a general rule,&nbsp;types with a frequency of greater than 10 should be added to the wordvector, right?
I haven't completely understood the SVD part of infomap. Could you please explain why this problem of "infrequent words cropping up in odd places" occurs?
Thanks, Arooj </DIV>
<DIV></DIV>
<BLOCKQUOTE style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #a0c6e5 2px solid; MARGIN-RIGHT: 0px">
<HR color=#a0c6e5 SIZE=1>

<DIV></DIV>From:&nbsp;&nbsp;Dominic Widdows &lt;wi...@ma...&gt; To:&nbsp;&nbsp;"Arooj Asghar" &lt;aro...@ho...&gt; CC:&nbsp;&nbsp;inf...@li... Subject:&nbsp;&nbsp;Re: [infomap-nlp-users] Number of ROWS for a large corpus Date:&nbsp;&nbsp;Tue, 4 Oct 2005 10:51:12 -0400 MIME-Version:&nbsp;&nbsp;1.0 (Apple Message framework v623) Received:&nbsp;&nbsp;from raven.maya.com ([192.70.254.20]) by mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct 2005 07:51:18 -0700 Received:&nbsp;&nbsp;from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1 with cipher RC4-SHA (128/128 bits))(No client certificate requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue, 4 Oct 2005 10:27:48 -0400 (EDT) &gt;Dear Arooj, &gt; &gt;Two questions I can think of 
that affect your question is &gt;i. is space a concern? &gt;and &gt;ii. how reliable do you want the results to be? &gt; &gt;The second question is affected crucially by the token frequency of &gt;the types that are being indexed. In my (ad hoc) experience, by the &gt;time you have about 20 or 30 occurrences of a word type, things are &gt;looking reasonably stable. If you have fewer than 10, things are &gt;looking very hit and miss. Those who have experimented with the King &gt;James Bible corpus will probably have seen for themselves that &gt;infrequent words like "kirioth" seem to crop up in very odd places. &gt; &gt;This suggests rephrasing the question as follows. In a corpus with m &gt;tokens belonging to n &lt; m types, how many of the types do you expect &gt;to occur with frequency greater than some "stability threshold"? I 
 &gt;believe that the range 10 to 30 is a pretty good guess for this &gt;threshold, but a guess it remains. One should be able to do a bit of &gt;counting and comparing with Zipf's law to firm up the mathematics of &gt;this suggestion. &gt; &gt;Am I answering the right question? &gt;Best wishes, &gt;Dominic &gt; &gt;On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote: &gt; &gt;&gt;Hi, &gt;&gt;&nbsp; &gt;&gt;Ive wanted to index a large corpus, with almost 170K words. The &gt;&gt;default 20K doesnt work well for this. Is there a way of telling &gt;&gt;what the optimal number of rows for a corpus of n words should &gt;&gt;be? &gt;&gt;&nbsp; &gt;&gt;Thanks, &gt;&gt;Arooj &gt;&gt;&nbsp; &gt;&gt;&nbsp; &gt;&gt; &gt;&gt;Express yourself instantly with MSN Messenger! MSN Messenger &gt;&gt;Download today it's 
FREE!&nbsp;&nbsp; &gt;&gt;------------------------------------------------------- This SF.Net &gt;&gt;email is sponsored by: Power Architecture Resource Center: Free &gt;&gt;content, downloads, discussions, and more. &gt;&gt;http://solutions.newsforge.com/ibmarch.tmpl &gt;&gt;_______________________________________________ infomap-nlp-users &gt;&gt;mailing list inf...@li... &gt;&gt;https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users </BLOCKQUOTE></div> <hr>Express yourself instantly with MSN Messenger! <a href="http://g.msn.com/8HMAEN/2743??PS=47575" target="_top">MSN Messenger</a> Download today it's FREE!</html>