[infomap-nlp-users] Infomap algorithm explanation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I want to ask about one thing in the Infomap algorithm.

Let matrix A is the cooccurrence matrix ,where i,j-th element is cooccurrence count for word i within a window of content bearing word j 
throughout the corpus. If the rows in matrix A are normalized, the cosine similarity between two words (i-th and j-th) can be computed as 
a dot product of the i-th and j-th row vector of A. That is the i,j-th element of matrix A(A^T). After Singular Value Decomposition 
A=US(V^T) and keeping only k largest singular values in S the matrix A'=U'S'(V'^T) is approximation of A. So A'(A'^T)=U'S'(V'^T)
(U'S'(V??^T))^T=U'S'(U'S)^T is approximation of A(A^T). So the i-th row of matrix U'S' should be vector in WordSpace for i-th word.
I'm missing the reason, why the Infomal algorithm is using only rows of U' as vectors in WordSpace.

The multiplication of U' by S' can be done by this commented part of code in encode_wordvec.c :

	/* Scale the vector if you want to

	   Commented out by Dominic, June 2001, because the first
	   principal component only really seems to say that all the
	   data is in the positive quadrant; declare i if you want to
	   use these two lines of code.  (Also, comment back in the
	   code to declare the singvals array and to open SINGVAL_FILE
	   and read its contents into that array.)

	for( i=0; i<SINGVALS; i++)
	vector[i] *= singvals[i]; */

It is true that "first principal component only really seems to say that all the data is in the positive quadrant" ?

I will be very happy if someone could clarify the infomap algorithm to me.

Best regards, 
Vladimir Repisky