From: Arooj A. <aro...@ho...> - 2005-10-05 11:12:33
|
<html><div style='background-color:'><DIV class=RTE> <P>Dear Dominic,</P> <P>Thanks for your reply. You definitely are answering the right question and it certainly is very helpful. Just to elaborate things a little more...</P> <P>>i. is space a concern?<BR>No, not really. </P> <P>>ii. how reliable do you want the results to be?<BR>Well, i want them to be reliable enough. When i said 170k words, i meant 170k types. So as a general rule, types with a frequency of greater than 10 should be added to the wordvector, right?</P> <P>I haven't completely understood the SVD part of infomap. Could you please explain why this problem of "infrequent words cropping up in odd places" occurs?</P> <P>Thanks,<BR>Arooj<BR></P></DIV> <DIV></DIV> <BLOCKQUOTE style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #a0c6e5 2px solid; MARGIN-RIGHT: 0px"><FONT style="FONT-SIZE: 11px; FONT-FAMILY: tahoma,sans-serif"> <HR color=#a0c6e5 SIZE=1> <DIV></DIV>From: <I>Dominic Widdows <wi...@ma...></I><BR>To: <I>"Arooj Asghar" <aro...@ho...></I><BR>CC: <I>inf...@li...</I><BR>Subject: <I>Re: [infomap-nlp-users] Number of ROWS for a large corpus</I><BR>Date: <I>Tue, 4 Oct 2005 10:51:12 -0400</I><BR>MIME-Version: <I>1.0 (Apple Message framework v623)</I><BR>Received: <I>from raven.maya.com ([192.70.254.20]) by mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct 2005 07:51:18 -0700</I><BR>Received: <I>from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1 with cipher RC4-SHA (128/128 bits))(No client certificate requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue, 4 Oct 2005 10:27:48 -0400 (EDT)</I><BR>>Dear Arooj,<BR>><BR>>Two questions I can think of that affect your question is<BR>>i. is space a concern?<BR>>and<BR>>ii. how reliable do you want the results to be?<BR>><BR>>The second question is affected crucially by the token frequency of <BR>>the types that are being indexed. In my (ad hoc) experience, by the <BR>>time you have about 20 or 30 occurrences of a word type, things are <BR>>looking reasonably stable. If you have fewer than 10, things are <BR>>looking very hit and miss. Those who have experimented with the King <BR>>James Bible corpus will probably have seen for themselves that <BR>>infrequent words like "kirioth" seem to crop up in very odd places.<BR>><BR>>This suggests rephrasing the question as follows. In a corpus with m <BR>>tokens belonging to n < m types, how many of the types do you expect <BR>>to occur with frequency greater than some "stability threshold"? I <BR>>believe that the range 10 to 30 is a pretty good guess for this <BR>>threshold, but a guess it remains. One should be able to do a bit of <BR>>counting and comparing with Zipf's law to firm up the mathematics of <BR>>this suggestion.<BR>><BR>>Am I answering the right question?<BR>>Best wishes,<BR>>Dominic<BR>><BR>>On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote:<BR>><BR>>>Hi,<BR>>> <BR>>>Ive wanted to index a large corpus, with almost 170K words. The <BR>>>default 20K doesnt work well for this. Is there a way of telling <BR>>>what the optimal number of rows for a corpus of n words should <BR>>>be?<BR>>> <BR>>>Thanks,<BR>>>Arooj<BR>>> <BR>>> <BR>>><BR>>>Express yourself instantly with MSN Messenger! MSN Messenger <BR>>>Download today it's FREE! <BR>>>------------------------------------------------------- This SF.Net <BR>>>email is sponsored by: Power Architecture Resource Center: Free <BR>>>content, downloads, discussions, and more. <BR>>>http://solutions.newsforge.com/ibmarch.tmpl <BR>>>_______________________________________________ infomap-nlp-users <BR>>>mailing list inf...@li... <BR>>>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users<BR></FONT></BLOCKQUOTE></div><br clear=all><hr>Express yourself instantly with MSN Messenger! <a href="http://g.msn.com/8HMAEN/2743??PS=47575" target="_top">MSN Messenger</a> Download today it's FREE!</html> |