From: <vl...@sp...> - 2005-05-17 22:33:07
|
Hello, I want to ask about one thing in the Infomap algorithm. Let matrix A is the cooccurrence matrix ,where i,j-th element is cooccurrence count for word i within a window of content bearing word j throughout the corpus. If the rows in matrix A are normalized, the cosine similarity between two words (i-th and j-th) can be computed as a dot product of the i-th and j-th row vector of A. That is the i,j-th element of matrix A(A^T). After Singular Value Decomposition A=US(V^T) and keeping only k largest singular values in S the matrix A'=U'S'(V'^T) is approximation of A. So A'(A'^T)=U'S'(V'^T) (U'S'(V??^T))^T=U'S'(U'S)^T is approximation of A(A^T). So the i-th row of matrix U'S' should be vector in WordSpace for i-th word. I'm missing the reason, why the Infomal algorithm is using only rows of U' as vectors in WordSpace. The multiplication of U' by S' can be done by this commented part of code in encode_wordvec.c : /* Scale the vector if you want to Commented out by Dominic, June 2001, because the first principal component only really seems to say that all the data is in the positive quadrant; declare i if you want to use these two lines of code. (Also, comment back in the code to declare the singvals array and to open SINGVAL_FILE and read its contents into that array.) for( i=0; i<SINGVALS; i++) vector[i] *= singvals[i]; */ It is true that "first principal component only really seems to say that all the data is in the positive quadrant" ? I will be very happy if someone could clarify the infomap algorithm to me. Best regards, Vladimir Repisky |