Hi Vladimir,
> I want to ask, if I've correctly understanded the source codes; that the
> associate command for word associating is only reading vectors
> from 'wordvec.bin' file and returns sorted top words with greatest cosine
> similarity to the query word vector? And if the query word is non-
> stopword from 'dic' file, than the vector for this word is the same as
> corresponding vector stored in 'wordvec.bin' file?
Yes, that's exactly how it works.
> I assume that the words in the 'wordvec.bin' are normalized vectors form
> 'left' file, where vectors are successively associated to words in
> 'dic' file. Is this right?
This is right except that the dictionary contains also the stopwords. So
the vectors from the "left" file are succesively associated with the
non-stopwords (value of the third column is 0) in "dic".
To check where things go wrong, you could retrieve the vectors of word1
and word2 using "associate -q ..." and check whether the vectors retrieved
are consistent with the vectors you obtain.
Further, you can compute the cosine similarity of word1 and word2 using
"compare_words.pl <options> word1 word2" and see if you get the same
result using your own technique.
Good luck!
Best wishes,
Beate
|