From: Beate D. <do...@IM...> - 2004-10-27 16:28:28
|
Hi! > My question, however, concerns if it > would be possible to do document to document comparisons. Could somebody > here provide me with info on this? To answer Mich's question: It is already possible to use documents as queries. By specifying the option "-i d" (for "input is document"), "associate" expects document identifiers rather than words as query "terms". In the case of a single-file corpus, document identifiers are the offsets in the corpus (the numbers enclosed in <f> and </f> in the wordlist file), and in case of a multiple-file corpus, in which each file constitutes a document, the identifiers are the names of the files. E.g., "associate -d -i d ... <doc_1> .. <doc_k> NOT <doc_k+1> .. <doc_k+n>" will return documents which are similar to doc_1 .. doc_k and dissimilar to doc_k+1 .. doc_k+n, where the doc_i are the identifiers of the corresponding documents. To actually compare two documents, you can use "associate -q -i d ..." for each of the two documents to obtain their vector representations, and then simply compute the scalar product. Best, Beate |
From: Mich <pse...@zo...> - 2004-10-27 19:27:37
|
Hello Beate, Dominic, Thank you both for your kind and helpful answers. I've tried what you suggest, but it seems the associate command disagrees with me. I have a single-file corpus - typically called 'many' and have another document (text-file in the directory), called 'aow' (from sun tzu's art of war). Since the model already works ("associate -t -c many war" returns entries such as 'barbaric' and 'pointless' as highly similar), i was wondering what it could be. When I try "associate -d -c many -i aow.txt" In which i would have wished the program returned a list of similarity indices between the text-file and the corpus. Since aow.txt is a document within the corpus, ideally, the program would return an index of 1.000 with one document in the corpus. If i would guess, Homer's Odyssee would be high also, as i would see it as remotely similar (they are both related to war). Anyway, when i try this in whatever order, i keep getting as output: "Bad option: -i" When I just leave -i out, it returns Am I using a different version (i think i have the latest)? If you could help me, could you please state the syntax using the example i just gave? Thank you. Dominic: thanks for your suggestions as well. If I could have this program in a windows environment, I could easily program something as you suggested, but although i often wish otherwise, i am not a programmer. And what experience i do have, is totally unrelated to unix environments. However, it seems that cygwin compiles a few .exe binary files, among which, associate. Do you think it is possible to use these from within windows (cmd)? Is there a particular reason no binaries of infomap-nlp were included? Hopefully i haven't offended anyone with bring microsoft software to this discussion! Cheers, Mich |
From: Beate D. <do...@IM...> - 2004-10-28 11:27:46
|
Hi Mich, The "-i" option of associate has to be followed by either "w" (words) or "d" (documents). So the correct way to call associate when using a document as a query is: associate -d -w <model_dir> -c <corpus> -i d <doc_id> where doc_id is the document identifier of one of the documents in the corpus. Please note that the option "-i d" works only, if the document <doc_id> is part of the corpus which you used for building the model, and then doc_id has to be the offset of the document in the corpus (the number inclosed in <f></f> in the "wordlist" file). As far as I understand, you would however like to compare text files to the corpus which are not part of the corpus itself. In this case, as Dominic suggested, you can simply hand over the complete text stream to associate, i.e. associate -d w <model_dir> -c <corpus> -i w <word1> <word2> <word3> ... where the <wordi> form the text stream of your query file. Best, Beate On Wed, 27 Oct 2004, Mich wrote: > Hello Beate, Dominic, > > Thank you both for your kind and helpful answers. I've tried what you > suggest, but it seems the associate command disagrees with me. I have > a single-file corpus - typically called 'many' and have another > document (text-file in the directory), called 'aow' (from sun tzu's > art of war). Since the model already works ("associate -t -c many war" > returns entries such as 'barbaric' and 'pointless' as highly similar), > i was wondering what it could be. > > When I try > "associate -d -c many -i aow.txt" > In which i would have wished the program returned a list of similarity > indices between the text-file and the corpus. Since aow.txt is a > document within the corpus, ideally, the program would return an index > of 1.000 with one document in the corpus. If i would guess, Homer's > Odyssee would be high also, as i would see it as remotely similar > (they are both related to war). > > Anyway, when i try this in whatever order, i keep getting as output: > "Bad option: -i" > > When I just leave -i out, it returns > Am I using a different version (i think i have the latest)? If you > could help me, could you please state the syntax using the example i > just gave? Thank you. > > Dominic: thanks for your suggestions as well. If I could have this > program in a windows environment, I could easily program something > as you suggested, but although i often wish otherwise, i am not a > programmer. And what experience i do have, is totally unrelated to > unix environments. However, it seems that cygwin compiles a few .exe binary > files, among which, associate. Do you think it is possible to use > these from within windows (cmd)? Is there a particular reason no > binaries of infomap-nlp were included? > > Hopefully i haven't offended anyone with bring microsoft software to > this discussion! > > Cheers, > > Mich > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: > Sybase ASE Linux Express Edition - download now for FREE > LinuxWorld Reader's Choice Award Winner for best database on Linux. > http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > |
From: Mich <pse...@zo...> - 2004-10-28 15:42:51
|
Hello Beate, Thursday, October 28, 2004, 1:27:41 PM, you wrote: > Hi Mich, > The "-i" option of associate has to be followed by either "w" (words) or > "d" (documents). So the correct way to call associate when using a > document as a query is: > associate -d -w <model_dir> -c <corpus> -i d <doc_id> > where doc_id is the document identifier of one of the documents in the > corpus. Please note that the option "-i d" works only, if the document > <doc_id> is part of the corpus which you used for building the model, and > then doc_id has to be the offset of the document in the corpus (the number > inclosed in <f></f> in the "wordlist" file). > As far as I understand, you would however like to compare text files to > the corpus which are not part of the corpus itself. Of course, it should be possible to temporarily create a new model, and then compare the latest addition to the other documents, right? I wouldn't know how to automate this process though, so that wouldn't really be useful (unless this new addition would always be the last (or 'highest' maybe even?) identifier in the wordlist file). When going for some sort of automatic grading, this would mean i'd make a model of doc_id_1-doc_id_i - which would all be 'relevant, maybe even perfect' answers to the question asked - and a single student's answer to the question - doc_id_i+1 - included. After that, the 'estimated grade' would be a function of the similarities between doc_id_i+1 and the rest. I am able to program something like this in pascal (yes, some people still use that!), IF i could somehow make an executable of infomap_build (associate.exe seems to work...). > In this case, as Dominic suggested, you can simply hand > over the complete text stream to associate, i.e. This is maybe easier, though i would have to know how to treat the text. Which words should be left out (such as 'the'), etc? Maybe you can help me with this, Dominic? > associate -d w <model_dir> -c <corpus> -i w <word1> <word2> <word3> ... > where the <wordi> form the text stream of your query file. Thanks, that should keep me from typing it wrong several times in a row! Cheers, Mich |