Re: [infomap-nlp-users] Infomap algorithm explanation

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Vladimir,

I'm glad you asked this question, you have certainly got down to one of
the rules of thumb in the infomap code that we just really learned from
experience.

The infomap vector-building algorithm is described in some detail in
http://infomap.stanford.edu/book/ (Chapters 5 and 6)

However, I don't think that this, or any other published description,
contains the decision we made about not multiplying by singular values.
When we did multiply by singular values, we found that many cosine
similarities were very high, in the 0.98 to 0.99 region, and so not very
discrimating. This also meant that vector negation was very chaotic,
removed practically the whole of the word you were interested in, and sent
your query way off course.

So we just removed the scaling factor, effectively treating all latent
axes as having equivalent importance (at least, I think this is the upshot
of the decision). It would be good to have a better understanding of why
this happens and what a truly "principled" method would be. In practice
it's quite hard to find encouragement for such a study, because data is
noisy and so improvements are marginal, leaving little hope of detecting
any measurably "best" configuration.

The strategy for our research was always something along the lines of "get
it working and show that it does interesting things". Given that LSA
vector models aren't configured to accept annotated training data, you can
always find a supervised machine learning method that outperforms infomap
on currently recognised tasks, so any papers we wrote claiming that we'd
found a principled "optimal" way of doing things would meet with the
criticism that LSA is not optimal compared with other methods. This
certainly spurred us to search for innovation rather than perfection.

Sorry I can't give you a better answer right now!
Best wishes,
Dominic

On Tue, 17 May 2005 vl...@sp... wrote:

> Hello,
>
> I want to ask about one thing in the Infomap algorithm.
>
> Let matrix A is the cooccurrence matrix ,where i,j-th element is cooccurrence count for word i within a window of content bearing word j
> throughout the corpus. If the rows in matrix A are normalized, the cosine similarity between two words (i-th and j-th) can be computed as
> a dot product of the i-th and j-th row vector of A. That is the i,j-th element of matrix A(A^T). After Singular Value Decomposition
> A=US(V^T) and keeping only k largest singular values in S the matrix A'=U'S'(V'^T) is approximation of A. So A'(A'^T)=U'S'(V'^T)
> (U'S'(V??^T))^T=U'S'(U'S)^T is approximation of A(A^T). So the i-th row of matrix U'S' should be vector in WordSpace for i-th word.
> I'm missing the reason, why the Infomal algorithm is using only rows of U' as vectors in WordSpace.
>
> The multiplication of U' by S' can be done by this commented part of code in encode_wordvec.c :
>
> 	/* Scale the vector if you want to
>
> 	   Commented out by Dominic, June 2001, because the first
> 	   principal component only really seems to say that all the
> 	   data is in the positive quadrant; declare i if you want to
> 	   use these two lines of code.  (Also, comment back in the
> 	   code to declare the singvals array and to open SINGVAL_FILE
> 	   and read its contents into that array.)
>
> 	for( i=0; i<SINGVALS; i++)
> 	vector[i] *= singvals[i]; */
>
> It is true that "first principal component only really seems to say that all the data is in the positive quadrant" ?
>
> I will be very happy if someone could clarify the infomap algorithm to me.
>
> Best regards,
> Vladimir Repisky
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by Oracle Space Sweepstakes
> Want to be the first software developer in space?
> Enter now for the Oracle Space Sweepstakes!
> http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
> _______________________________________________
> infomap-nlp-users mailing list
> inf...@li...
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users
>