From: Beate D. <do...@IM...> - 2004-05-11 13:37:28
|
Dear Menno, dear Dominic, I added an option -i to associate which allows a user to tell the program whether the query consists of words ( -i w) or of documents (-i d). Together with the -q option, it's now possible to return document vectors. You can check out the new code via cvs on sourceforge. In order to try out the new option you'll first have to rebuild your model from the beginning (the new option relies on a database produced by prepare_corpus). In case of a single-file corpus, "associate -i d" expects document IDs as input; in case of a multi-file corpus, it expects document names. Default is "-i w" which corresponds to associate as before, e.g. associate -w -i w -m ... -c ... word1 .. wordk will return words which are similar to word1 .. wordk. associate -w -i d -m ... -c ... doc_id1 .. doc_idk will return words which are similar to the documents corresponding to doc_id1 .. doc_idk, and associate -q -i d -m ... -c ... doc_id1 .. doc_idk will return the average over the document vectors corresponding to doc_id1 .. doc_idk. So if you wanted to look up the document vectors of a whole bunch of docs you'd have to call "associate -q -i d" for each of the docs (i.e. a loop). Could you please check if the new option works properly, please? If you have suggestions for a more convenient way of getting document vectors or any other suggestions for improvement, let me know. And could you read the new man pages, please, and check whether they are comprehensible? Once you have tested the code, we can post a new release together with the other recent changes. Another thing which I changed is print_doc. So far, it expected document IDs as input. Since "associate -d", in case of a multi-file corpus, returns document names rather than document ids, I thought it'd make more sense to pass document names to print_doc if it's a multi-file corpus. For a single-file corpus, however, print_doc still expects doc ids (since this is what "associate -d" returns in this case). Does that make sense? Best wishes, Beate On Thu, 6 May 2004, Dominic Widdows wrote: > >> There must be an easier way, but I think not many people will be >> interested in the raw document vectors (or am I wrong)? > >Hi Menno, > >It sounds like your work-around to get the document vectors is pretty >effective, though as you say there should be an easier way. > >For word and query vectors there's an "associate -q" option which simply >prints out the query vector rather than performing a search. One way I've >often used to get document vectors is simply to pass the whole document as >an argument to "associate -q", which is pretty unsatisfactory though it >does have the benefit that you can get document vectors for textfiles that >weren't in your original corpus. > >If the "associate -q" option was combined with the "associate_doc" >function Beate described, this would solve the problem properly, and I >could see benefits to making this available (eg. for work on document >clustering). It sounds as though you've already got a workable solution, >but if enough other people on the list express an interest we should look >into it. > >I'm delighted to hear about people using the infomap software as part of a >richer and more complex system of features - I'd be interested to hear >more about your work whenever you are ready. > >Best wishes, >Dominic > > >------------------------------------------------------- >This SF.Net email is sponsored by Sleepycat Software >Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver >higher performing products faster, at low TCO. >http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3 >_______________________________________________ >infomap-nlp-users mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |