[infomap-nlp-devel] Re: [infomap-nlp-users] Document vectors

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Menno, dear Dominic,

I added an option -i to associate which allows a user to tell the program
whether the query consists of words ( -i w) or of documents (-i d).
Together with the -q option, it's now possible to return document vectors.
You can check out the new code via cvs on sourceforge.
In order to try out the new option you'll first have to rebuild your model
from the beginning (the new option relies on a database produced by
prepare_corpus).

In case of a single-file corpus, "associate -i d" expects document IDs as
input; in case of a multi-file corpus, it expects document names.

Default is "-i w" which corresponds to associate as before, e.g.
  associate -w -i w -m ... -c ... word1 .. wordk
will return words which are similar to word1 .. wordk.
  associate -w -i d -m ... -c ... doc_id1 .. doc_idk
will return words which are similar to the documents corresponding to
doc_id1 .. doc_idk, and
  associate -q -i d -m ... -c ... doc_id1 .. doc_idk
will return the average over the document vectors corresponding to doc_id1
.. doc_idk.

So if you wanted to look up the document vectors of a whole bunch of docs
you'd have to call "associate -q -i d" for each of the docs (i.e. a loop).

Could you please check if the new option works properly, please? If you
have suggestions for a more convenient way of getting document vectors or
any other suggestions for improvement, let me know. And could you read the
new man pages, please, and check whether they are comprehensible?
Once you have tested the code, we can post a new release together with the
other recent changes.

Another thing which I changed is print_doc. So far, it expected document
IDs as input. Since "associate -d", in case of a multi-file corpus,
returns document names rather than document ids, I thought it'd make more
sense to pass document names to print_doc if it's a multi-file corpus. For
a single-file corpus, however, print_doc still expects doc ids (since this
is what "associate -d" returns in this case). Does that make sense?

Best wishes,
Beate

On Thu, 6 May 2004, Dominic Widdows wrote:

>
>> There must be an easier way, but I think not many people will be
>> interested in the raw document vectors (or am I wrong)?
>
>Hi Menno,
>
>It sounds like your work-around to get the document vectors is pretty
>effective, though as you say there should be an easier way.
>
>For word and query vectors there's an "associate -q" option which simply
>prints out the query vector rather than performing a search. One way I've
>often used to get document vectors is simply to pass the whole document as
>an argument to "associate -q", which is pretty unsatisfactory though it
>does have the benefit that you can get document vectors for textfiles that
>weren't in your original corpus.
>
>If the "associate -q" option was combined with the "associate_doc"
>function Beate described, this would solve the problem properly, and I
>could see benefits to making this available (eg. for work on document
>clustering). It sounds as though you've already got a workable solution,
>but if enough other people on the list express an interest we should look
>into it.
>
>I'm delighted to hear about people using the infomap software as part of a
>richer and more complex system of features - I'd be interested to hear
>more about your work whenever you are ready.
>
>Best wishes,
>Dominic
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by Sleepycat Software
>Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver
>higher performing products faster, at low TCO.
>http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3
>_______________________________________________
>infomap-nlp-users mailing list
>inf...@li...
>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users
>