From: Dominic W. <wi...@ma...> - 2005-10-06 21:35:47
|
> >ii. how reliable do you want the results to be? > Well, i want them to be reliable enough. When i said 170k words, i=20 > meant 170k types. So as a general rule,=A0types with a frequency of=20 > greater than 10 should be added to the wordvector, right? This is probably not a bad rule of thumb. I may go for 20 rather than=20 10 as a first guess. > I haven't completely understood the SVD part of infomap. Could you=20 > please explain why this problem of "infrequent words cropping up in=20 > odd places" occurs? This is more to do with the problem of sparse data in general than=20 anything specifically to do with SVD. If you only have 3 or 4=20 occurrences of a word, it may occur with very atypical usages. Once you=20= have 30 or 40 occurrences, you have a much better chance of having a=20 representative sample of the available meanings. In theory at least, SVD actually helps this situation because it makes=20= your matrix less sparse. Though if your sample occurrences are a skewed=20= sample to begin with, I don't really see how projecting down to fewer=20 dimensions is likely to make your sample less skewed. Best wishes, Dominic > Thanks, > Arooj >> From:=A0=A0Dominic Widdows <wi...@ma...> >> To:=A0=A0"Arooj Asghar" <aro...@ho...> >> CC:=A0=A0i...@li... >> Subject:=A0=A0Re: [infomap-nlp-users] Number of ROWS for a large = corpus >> Date:=A0=A0Tue, 4 Oct 2005 10:51:12 -0400 >> MIME-Version:=A0=A01.0 (Apple Message framework v623) >> Received:=A0=A0from raven.maya.com ([192.70.254.20]) by=20 >> mc10-f26.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Tue, 4 Oct=20= >> 2005 07:51:18 -0700 >> Received:=A0=A0from [10.20.50.23] (unknown [10.20.50.23])(using TLSv1=20= >> with cipher RC4-SHA (128/128 bits))(No client certificate=20 >> requested)by raven.maya.com (Postfix) with ESMTPid 7B3511481B3; Tue,=20= >> 4 Oct 2005 10:27:48 -0400 (EDT) >> >Dear Arooj, >> > >> >Two questions I can think of that affect your question is >> >i. is space a concern? >> >and >> >ii. how reliable do you want the results to be? >> > >> >The second question is affected crucially by the token frequency of >> >the types that are being indexed. In my (ad hoc) experience, by the >> >time you have about 20 or 30 occurrences of a word type, things are >> >looking reasonably stable. If you have fewer than 10, things are >> >looking very hit and miss. Those who have experimented with the King >> >James Bible corpus will probably have seen for themselves that >> >infrequent words like "kirioth" seem to crop up in very odd places. >> > >> >This suggests rephrasing the question as follows. In a corpus with m >> >tokens belonging to n < m types, how many of the types do you expect >> >to occur with frequency greater than some "stability threshold"? I >> >believe that the range 10 to 30 is a pretty good guess for this >> >threshold, but a guess it remains. One should be able to do a bit of >> >counting and comparing with Zipf's law to firm up the mathematics of >> >this suggestion. >> > >> >Am I answering the right question? >> >Best wishes, >> >Dominic >> > >> >On Oct 4, 2005, at 2:46 AM, Arooj Asghar wrote: >> > >> >>Hi, >> >>=A0 >> >>I=92ve wanted to index a large corpus, with almost 170K words. The >> >>default 20K doesn=92t work well for this. Is there a way of telling >> >>what the optimal number of rows for a corpus of =91n=92 words = should >> >>be? >> >>=A0 >> >>Thanks, >> >>Arooj >> >>=A0 >> >>=A0 >> >> >> >>Express yourself instantly with MSN Messenger! MSN Messenger >> >>Download today it's FREE!=A0=A0 >> >>------------------------------------------------------- This SF.Net >> >>email is sponsored by: Power Architecture Resource Center: Free >> >>content, downloads, discussions, and more. >> >>http://solutions.newsforge.com/ibmarch.tmpl >> >>_______________________________________________ infomap-nlp-users >> >>mailing list inf...@li... >> >>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > > Express yourself instantly with MSN Messenger! MSN Messenger Download=20= > today it's FREE!=20= |