Re: [infomap-nlp-users] word to word matrix

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Vladimir,

> I want to ask, if I've correctly understanded the source codes; that the
> associate command for word associating is only reading vectors
> from 'wordvec.bin' file and returns sorted top words with greatest cosine
> similarity to the query word vector? And if the query word is non-
> stopword from 'dic' file, than the vector for this word is the same as
> corresponding vector stored in 'wordvec.bin' file?

Yes, that's exactly how it works.

> I assume that the words in the 'wordvec.bin' are normalized vectors form
> 'left' file, where vectors are successively associated to words in
> 'dic' file. Is this right?

This is right except that the dictionary contains also the stopwords. So 
the vectors from the "left" file are succesively associated with the 
non-stopwords (value of the third column is 0) in "dic".

To check where things go wrong, you could retrieve the vectors of word1 
and word2 using "associate -q ..." and check whether the vectors retrieved 
are consistent with the vectors you obtain.

Further, you can compute the cosine similarity of word1 and word2 using 
"compare_words.pl <options> word1 word2" and see if you get the same 
result using your own technique.

Good luck!
Best wishes,
Beate