From: Dominic W. <dwi...@cs...> - 2005-05-17 23:48:45
|
Dear Vladimir, I'm glad you asked this question, you have certainly got down to one of the rules of thumb in the infomap code that we just really learned from experience. The infomap vector-building algorithm is described in some detail in http://infomap.stanford.edu/book/ (Chapters 5 and 6) However, I don't think that this, or any other published description, contains the decision we made about not multiplying by singular values. When we did multiply by singular values, we found that many cosine similarities were very high, in the 0.98 to 0.99 region, and so not very discrimating. This also meant that vector negation was very chaotic, removed practically the whole of the word you were interested in, and sent your query way off course. So we just removed the scaling factor, effectively treating all latent axes as having equivalent importance (at least, I think this is the upshot of the decision). It would be good to have a better understanding of why this happens and what a truly "principled" method would be. In practice it's quite hard to find encouragement for such a study, because data is noisy and so improvements are marginal, leaving little hope of detecting any measurably "best" configuration. The strategy for our research was always something along the lines of "get it working and show that it does interesting things". Given that LSA vector models aren't configured to accept annotated training data, you can always find a supervised machine learning method that outperforms infomap on currently recognised tasks, so any papers we wrote claiming that we'd found a principled "optimal" way of doing things would meet with the criticism that LSA is not optimal compared with other methods. This certainly spurred us to search for innovation rather than perfection. Sorry I can't give you a better answer right now! Best wishes, Dominic On Tue, 17 May 2005 vl...@sp... wrote: > Hello, > > I want to ask about one thing in the Infomap algorithm. > > Let matrix A is the cooccurrence matrix ,where i,j-th element is cooccurrence count for word i within a window of content bearing word j > throughout the corpus. If the rows in matrix A are normalized, the cosine similarity between two words (i-th and j-th) can be computed as > a dot product of the i-th and j-th row vector of A. That is the i,j-th element of matrix A(A^T). After Singular Value Decomposition > A=US(V^T) and keeping only k largest singular values in S the matrix A'=U'S'(V'^T) is approximation of A. So A'(A'^T)=U'S'(V'^T) > (U'S'(V??^T))^T=U'S'(U'S)^T is approximation of A(A^T). So the i-th row of matrix U'S' should be vector in WordSpace for i-th word. > I'm missing the reason, why the Infomal algorithm is using only rows of U' as vectors in WordSpace. > > The multiplication of U' by S' can be done by this commented part of code in encode_wordvec.c : > > /* Scale the vector if you want to > > Commented out by Dominic, June 2001, because the first > principal component only really seems to say that all the > data is in the positive quadrant; declare i if you want to > use these two lines of code. (Also, comment back in the > code to declare the singvals array and to open SINGVAL_FILE > and read its contents into that array.) > > for( i=0; i<SINGVALS; i++) > vector[i] *= singvals[i]; */ > > It is true that "first principal component only really seems to say that all the data is in the positive quadrant" ? > > I will be very happy if someone could clarify the infomap algorithm to me. > > Best regards, > Vladimir Repisky > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |