Re[3]: [infomap-nlp-users] document x document

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello Beate,

Thursday, October 28, 2004, 1:27:41 PM, you wrote:

> Hi Mich,

> The "-i" option of associate has to be followed by either "w" (words) or
> "d" (documents). So the correct way to call associate when using a 
> document as a query is:

>     associate -d -w <model_dir> -c <corpus> -i d <doc_id>

> where doc_id is the document identifier of one of the documents in the
> corpus. Please note that the option "-i d" works only, if the document
> <doc_id> is part of the corpus which you used for building the model, and
> then doc_id has to be the offset of the document in the corpus (the number
> inclosed in <f></f> in the "wordlist" file).

> As far as I understand, you would however like to compare text files to
> the corpus which are not part of the corpus itself.

Of course, it should be possible to temporarily create a new model,
and then compare the latest addition to the other documents, right? I
wouldn't know how to automate this process though, so that wouldn't
really be useful (unless this new addition would always be the last
(or 'highest' maybe even?) identifier in the wordlist file).

When going for some sort of automatic grading, this would mean i'd
make a model of doc_id_1-doc_id_i - which would all be 'relevant, maybe
even perfect' answers to the question asked - and a single student's answer
to the question - doc_id_i+1 - included. After that, the 'estimated
grade' would be a function of the similarities between doc_id_i+1 and
the rest. I am able to program something like this in pascal (yes,
some people still use that!), IF i could somehow make an executable of
infomap_build (associate.exe seems to work...).

> In this case, as Dominic suggested, you can simply hand
> over the complete text stream to associate, i.e.

This is maybe easier, though i would have to know how to treat the
text. Which words should be left out (such as 'the'), etc? Maybe you
can help me with this, Dominic?

>     associate -d w <model_dir> -c <corpus> -i w <word1> <word2> <word3> ...

> where the <wordi> form the text stream of your query file.

Thanks, that should keep me from typing it wrong several times in a
row!

Cheers,

Mich