From: Arooj A. <aro...@ho...> - 2005-10-04 06:47:13
|
<html><div style='background-color:'><DIV class=RTE> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Hi,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial"><o:p> </o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Ive wanted to index a large corpus, with almost 170K words. The default 20K doesnt work well for this. Is there a way of telling what the optimal number of rows for a corpus of n words should be?<o:p></o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial"><o:p> </o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Thanks,<o:p></o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Arooj<o:p></o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial"><o:p> </o:p></SPAN></P> <P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial"><o:p> </o:p></SPAN></P></DIV></div><br clear=all><hr>Express yourself instantly with MSN Messenger! <a href="http://g.msn.com/8HMBEN/2743??PS=47575" target="_top">MSN Messenger</a> Download today it's FREE!</html> |
From: Dominic W. <wi...@ma...> - 2005-10-04 14:51:21
|
Dear Arooj, Two questions I can think of that affect your question is i. is space a concern? and ii. how reliable do you want the results to be? The second question is affected crucially by the token frequency of the=20= types that are being indexed. In my (ad hoc) experience, by the time=20 you have about 20 or 30 occurrences of a word type, things are looking=20= reasonably stable. If you have fewer than 10, things are looking very=20 hit and miss. Those who have experimented with the King James Bible=20 corpus will probably have seen for themselves that infrequent words=20 like "kirioth" seem to crop up in very odd places. This suggests rephrasing the question as follows. In a corpus with m=20 tokens belonging to n < m types, how many of the types do you expect to=20= occur with frequency greater than some "stability threshold"? I believe=20= that the range 10 to 30 is a pretty good guess for this threshold, but=20= a guess it remains. One should be able to do a bit of counting and=20 comparing with Zipf's law to firm up the mathematics of this=20 suggestion. Am I answering the right question? Best wishes, Dominic On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote: > Hi, > =A0 > I=92ve wanted to index a large corpus, with almost 170K words. The=20 > default 20K doesn=92t work well for this. Is there a way of telling = what=20 > the optimal number of rows for a corpus of =91n=92 words should be? > =A0 > Thanks, > Arooj > =A0 > =A0 > > Express yourself instantly with MSN Messenger! MSN Messenger Download=20= > today it's FREE! =20 > ------------------------------------------------------- This SF.Net=20 > email is sponsored by: Power Architecture Resource Center: Free=20 > content, downloads, discussions, and more.=20 > http://solutions.newsforge.com/ibmarch.tmpl=20 > _______________________________________________ infomap-nlp-users=20 > mailing list inf...@li...=20 > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users= |
From: Arooj A. <aro...@ho...> - 2005-10-05 11:12:33
|
<html><div style='background-color:'><DIV class=RTE> <P>Dear Dominic,</P> <P>Thanks for your reply. You definitely are answering the right question and it certainly is very helpful. Just to elaborate things a little more...</P> <P>>i. is space a concern?<BR>No, not really. </P> <P>>ii. how reliable do you want the results to be?<BR>Well, i want them to be reliable enough. When i said 170k words, i meant 170k types. So as a general rule, types with a frequency of greater than 10 should be added to the wordvector, right?</P> <P>I haven't completely understood the SVD part of infomap. Could you please explain why this problem of "infrequent words cropping up in odd places" occurs?</P> <P>Thanks,<BR>Arooj<BR></P></DIV> <DIV></DIV> <BLOCKQUOTE style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #a0c6e5 2px solid; MARGIN-RIGHT: 0px"><FONT style="FONT-SIZE: 11px; FONT-FAMILY: tahoma,sans-serif"> <HR color=#a0c6e5 SIZE=1> <DIV></DIV>From: <I>Dominic Widdows <wi...@ma...></I><BR>To: <I>"Arooj Asghar" <aro...@ho...></I><BR>CC: <I>inf...@li...</I><BR>Subject: <I>Re: [infomap-nlp-users] Number of ROWS for a large corpus</I><BR>Date: <I>Tue, 4 Oct 2005 10:51:12 -0400</I><BR>MIME-Version: <I>1.0 (Apple Message framework v623)</I><BR>Received: <I>from raven.maya.com ([192.70.254.20]) by mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct 2005 07:51:18 -0700</I><BR>Received: <I>from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1 with cipher RC4-SHA (128/128 bits))(No client certificate requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue, 4 Oct 2005 10:27:48 -0400 (EDT)</I><BR>>Dear Arooj,<BR>><BR>>Two questions I can think of that affect your question is<BR>>i. is space a concern?<BR>>and<BR>>ii. how reliable do you want the results to be?<BR>><BR>>The second question is affected crucially by the token frequency of <BR>>the types that are being indexed. In my (ad hoc) experience, by the <BR>>time you have about 20 or 30 occurrences of a word type, things are <BR>>looking reasonably stable. If you have fewer than 10, things are <BR>>looking very hit and miss. Those who have experimented with the King <BR>>James Bible corpus will probably have seen for themselves that <BR>>infrequent words like "kirioth" seem to crop up in very odd places.<BR>><BR>>This suggests rephrasing the question as follows. In a corpus with m <BR>>tokens belonging to n < m types, how many of the types do you expect <BR>>to occur with frequency greater than some "stability threshold"? I <BR>>believe that the range 10 to 30 is a pretty good guess for this <BR>>threshold, but a guess it remains. One should be able to do a bit of <BR>>counting and comparing with Zipf's law to firm up the mathematics of <BR>>this suggestion.<BR>><BR>>Am I answering the right question?<BR>>Best wishes,<BR>>Dominic<BR>><BR>>On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote:<BR>><BR>>>Hi,<BR>>> <BR>>>Ive wanted to index a large corpus, with almost 170K words. The <BR>>>default 20K doesnt work well for this. Is there a way of telling <BR>>>what the optimal number of rows for a corpus of n words should <BR>>>be?<BR>>> <BR>>>Thanks,<BR>>>Arooj<BR>>> <BR>>> <BR>>><BR>>>Express yourself instantly with MSN Messenger! MSN Messenger <BR>>>Download today it's FREE! <BR>>>------------------------------------------------------- This SF.Net <BR>>>email is sponsored by: Power Architecture Resource Center: Free <BR>>>content, downloads, discussions, and more. <BR>>>http://solutions.newsforge.com/ibmarch.tmpl <BR>>>_______________________________________________ infomap-nlp-users <BR>>>mailing list inf...@li... <BR>>>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users<BR></FONT></BLOCKQUOTE></div><br clear=all><hr>Express yourself instantly with MSN Messenger! <a href="http://g.msn.com/8HMAEN/2743??PS=47575" target="_top">MSN Messenger</a> Download today it's FREE!</html> |
From: Dominic W. <wi...@ma...> - 2005-10-06 21:35:47
|
> >ii. how reliable do you want the results to be? > Well, i want them to be reliable enough. When i said 170k words, i=20 > meant 170k types. So as a general rule,=A0types with a frequency of=20 > greater than 10 should be added to the wordvector, right? This is probably not a bad rule of thumb. I may go for 20 rather than=20 10 as a first guess. > I haven't completely understood the SVD part of infomap. Could you=20 > please explain why this problem of "infrequent words cropping up in=20 > odd places" occurs? This is more to do with the problem of sparse data in general than=20 anything specifically to do with SVD. If you only have 3 or 4=20 occurrences of a word, it may occur with very atypical usages. Once you=20= have 30 or 40 occurrences, you have a much better chance of having a=20 representative sample of the available meanings. In theory at least, SVD actually helps this situation because it makes=20= your matrix less sparse. Though if your sample occurrences are a skewed=20= sample to begin with, I don't really see how projecting down to fewer=20 dimensions is likely to make your sample less skewed. Best wishes, Dominic > Thanks, > Arooj >> From:=A0=A0Dominic Widdows <wi...@ma...> >> To:=A0=A0"Arooj Asghar" <aro...@ho...> >> CC:=A0=A0i...@li... >> Subject:=A0=A0Re: [infomap-nlp-users] Number of ROWS for a large = corpus >> Date:=A0=A0Tue, 4 Oct 2005 10:51:12 -0400 >> MIME-Version:=A0=A01.0 (Apple Message framework v623) >> Received:=A0=A0from raven.maya.com ([192.70.254.20]) by=20 >> mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct=20= >> 2005 07:51:18 -0700 >> Received:=A0=A0from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1=20= >> with cipher RC4-SHA (128/128 bits))(No client certificate=20 >> requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue,=20= >> 4 Oct 2005 10:27:48 -0400 (EDT) >> >Dear Arooj, >> > >> >Two questions I can think of that affect your question is >> >i. is space a concern? >> >and >> >ii. how reliable do you want the results to be? >> > >> >The second question is affected crucially by the token frequency of >> >the types that are being indexed. In my (ad hoc) experience, by the >> >time you have about 20 or 30 occurrences of a word type, things are >> >looking reasonably stable. If you have fewer than 10, things are >> >looking very hit and miss. Those who have experimented with the King >> >James Bible corpus will probably have seen for themselves that >> >infrequent words like "kirioth" seem to crop up in very odd places. >> > >> >This suggests rephrasing the question as follows. In a corpus with m >> >tokens belonging to n < m types, how many of the types do you expect >> >to occur with frequency greater than some "stability threshold"? I >> >believe that the range 10 to 30 is a pretty good guess for this >> >threshold, but a guess it remains. One should be able to do a bit of >> >counting and comparing with Zipf's law to firm up the mathematics of >> >this suggestion. >> > >> >Am I answering the right question? >> >Best wishes, >> >Dominic >> > >> >On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote: >> > >> >>Hi, >> >>=A0 >> >>I=92ve wanted to index a large corpus, with almost 170K words. The >> >>default 20K doesn=92t work well for this. Is there a way of telling >> >>what the optimal number of rows for a corpus of =91n=92 words = should >> >>be? >> >>=A0 >> >>Thanks, >> >>Arooj >> >>=A0 >> >>=A0 >> >> >> >>Express yourself instantly with MSN Messenger! MSN Messenger >> >>Download today it's FREE!=A0=A0 >> >>------------------------------------------------------- This SF.Net >> >>email is sponsored by: Power Architecture Resource Center: Free >> >>content, downloads, discussions, and more. >> >>http://solutions.newsforge.com/ibmarch.tmpl >> >>_______________________________________________ infomap-nlp-users >> >>mailing list inf...@li... >> >>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > > Express yourself instantly with MSN Messenger! MSN Messenger Download=20= > today it's FREE!=20= |
From: Rahul J. <rj...@ya...> - 2005-10-18 00:12:30
|
Hi: Few days ago I saw a question on this list relating to large corpus (large number of words). I have different but relevant questions: Is there a known limit for number of files in a multi-document corpus, expecting optimal performance? Is there a known break-down point? Does the uniformity (or lack of it) in the sizes of individual documents in a corpus affect the quality model? Thanks! Rahul. __________________________________ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com |
From: Dominic W. <wi...@ma...> - 2005-10-22 12:49:03
|
Dear Rahul, Sorry for not getting back to you sooner. > Is there a known limit for number of files in a > multi-document corpus, expecting optimal performance? > Is there a known break-down point? I know of no maximum number of files for a multi-document corpus, as far as the Infomap software is concerned. However, this isn't because we've stretched the system and found that it doesn't break, it's because we haven't really used it for this very much. I've only ever built a couple of multidocument models, and have had sporadic reports of this functionality not working at all. If we were embarking on a new, well-resourced project, we'd look into this straight away, > Does the uniformity (or lack of it) in the sizes of > individual documents in a corpus affect the quality > model? It certainly matters much less than in a standard search / LSA engine, because the Infomap software relies on cooccurrence within a fixed-width window rather than cooccurrence within a document. This is discussed properly in Ch6 of Geometry and Meaning. There's a sketch on the web at http://infomap.stanford.edu/book/chapters/chapter6.html but of course, finding yourself a copy of the book would be more useful ;-) Best wishes, Dominic |
From: Rahul J. <rj...@ya...> - 2005-10-26 06:11:24
|
Hi Dominic, > because the Infomap software relies on cooccurrence > within a > fixed-width window rather than cooccurrence within a > document. Does the "cooccurence within a window" mean "a window within the same document"? In other words, the window doesn't span documents. Is the width set using: PRE_CONTEXT_SIZE and POST_CONTEXT_SIZE? Thanks, Rahul. --- Dominic Widdows <wi...@ma...> wrote: > Dear Rahul, > > Sorry for not getting back to you sooner. > > > Is there a known limit for number of files in a > > multi-document corpus, expecting optimal > performance? > > Is there a known break-down point? > > I know of no maximum number of files for a > multi-document corpus, as > far as the Infomap software is concerned. However, > this isn't because > we've stretched the system and found that it doesn't > break, it's > because we haven't really used it for this very > much. I've only ever > built a couple of multidocument models, and have had > sporadic reports > of this functionality not working at all. If we were > embarking on a > new, well-resourced project, we'd look into this > straight away, > > > Does the uniformity (or lack of it) in the sizes > of > > individual documents in a corpus affect the > quality > > model? > > It certainly matters much less than in a standard > search / LSA engine, > because the Infomap software relies on cooccurrence > within a > fixed-width window rather than cooccurrence within a > document. > > This is discussed properly in Ch6 of Geometry and > Meaning. There's a > sketch on the web at > http://infomap.stanford.edu/book/chapters/chapter6.html > but of course, finding yourself a copy of the book > would be more useful > ;-) > > Best wishes, > Dominic > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Power Architecture Resource Center: Free content, > downloads, discussions, > and more. > http://solutions.newsforge.com/ibmarch.tmpl > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > __________________________________ Yahoo! FareChase: Search multiple travel sites in one click. http://farechase.yahoo.com |
From: Scott C. <ced...@gm...> - 2005-10-26 06:47:34
|
Hi Rahul, On 10/26/05, Rahul Joshi <rj...@ya...> wrote: > Hi Dominic, > > > because the Infomap software relies on cooccurrence > > within a > > fixed-width window rather than cooccurrence within a > > document. > > Does the "cooccurence within a window" mean "a window > within the same document"? > Yes. > In other words, the window doesn't span documents. > No, the window doesn't span documents. > Is the width set using: PRE_CONTEXT_SIZE and > POST_CONTEXT_SIZE? > Yes. > Thanks, > Rahul. > > Scott |