From: Dominic W. <dwi...@cs...> - 2004-04-23 00:18:36
|
Dear Beate, Thanks for putting together such a comprehensive set of instrcutions for Shuji to use the bilingual code. Would anyone have any objection to putting this up on the SourceForge website (maybe called "infomap-bilingual"), along with the instructions? We can say it's strictly a beta version. This way if anyone does take it upong themselves to use the code and get it as robust as the main infomap code, they have the option. Any thoughts? Dominic On Tue, 20 Apr 2004, Beate Dorow wrote: > > > Dear Shuji, > > Here is the tarball of the Bilingual Infomap code. > Unfortunately, it's not as convenient to use as the monolingual model on > sourceforge. > I prepared a very small example corpus (it's a tiny fraction of the > Canadian-Hansard), so that you can see what you'll have to change in order > to build a model from your own corpus. Since the tarball is already big, I > put the corpus on the web, so you can download it separately from there: > http://infomap.stanford.edu/shuji. > The results on this example corpus are quite bad due to its small size. In > particular, looking for documents related to a query may not result in any > document, because the similarity is below a threshold. This shouldn't > bother you, however, and the results on your own corpus should be a > lot better! > > I added two directories to the main directory, "corpora" and "data" which > are normally located at some other place. You can specify their location > in the Makefile of the main directory by changing the CORPUS_DIR, > DATA_PATH, DATA_DIR variables. > > To build a model from the example corpus, you have to do the following: > > * Download the example corpus from the web, unpack it and put it in the > BiLing/corpora directory. > > * Go into the BiLing/search directory and create the following symbolic > links: > > ln -f -s ../preprocessing/utils.c > ln -f -s ../preprocessing/utils.h > ln -f -s ../preprocessing/list.c > ln -f -s ../preprocessing/list.h > > (I couldn't get the Makefile to do this automatically, so for now, > you'll have to do it by hand). > > * Go back to the main "BiLing" directory and run "make data". You'll > then have to change into the preprocessing directory and run > "encode_wordvec" and then "count_artvec". "make data" is supposed > to build the model at once, but this is another bug which has to be > resolved at some point. > > * Move all the produced model files from data/working_data to > data_finished_data. > > For searching the model, go into the search directory. To look for similar > *words*, run "associate -w", to look for similar documents you have to use > "associate -d". > E.g. to look for *English* words which are similar to "health", the > complete associate command looks like this: > > associate -w -l A A health > (the "A" stands for language A which is English) > > If you are instead interested in *French* words associated to "health", > use: > > associate -w -l B A health > > Or to look for English documents similar to the French word "sant\351", > type > > associate -d -l A B sant\351 > > > To build your own bilingual model, you'll have to do the > following: > > * Suppose your corpus is called "reuters". Add directories named > "reuters" to both the corpora and the data/working_data and > data/finished_data direcotries. Copy your corpus (consisting of > documents and their translations) into the corpora/reuters > directory, together with two stoplists, one for each language, which > you put in a directory corpora/reuters/lists. > Now suppose that your copus consists of English and German documents, > the former ending in ".eng", the latter in ".ger". The names of the > stoplists then have to be "stopeng.list" and "stopger.list" > > * Check that your corpus is in the proper format; documents which are > translations of each other have the same filename stem and differ only > in prefixes and suffixes which indicate which language a file is > written in. In case your documents are big and you want to use smaller > units for counting co-occurrences, sentence id tags (<s id=...>) can be > used (but are not required) to divide each file into smaller junks. A > document and its translation have to have the same number of sentences > and sentences which are translations of each other have the same id. > > * You'll then have to create a file named "reutersNames2.txt" in which > you list all the stems of the corpus files together with the number of > sentences contained in this file. There is a perl script > "count_sentences.pl" in the "BiLing/corpora/Canadian-Hansard" directory > which, after a bit of customization, you can use to build this file > automatically. > > * Then edit the Makefile and change the variables (e.g. corpus name, > corpus directory, prefixes, suffixes, ...) such that they fit your > situation. > > * Now, run "make data" to build the model. Then change into the > preprocessing directory and run "encode_wordvec" and then "count_artvec". > The model files are all put in data/working_data, and you'll now have > to move them to data/finished_data. > > * Change into the search directory. "associate ..." should now work > for your corpus. In case you build models from different corpora, you > can use the "-c" option of "associate" to specify which corpus you want > to query. > > There are two things to note: The bilingual code still uses the old > my_isalpha procedure in preprocessing/utils.c to decide which characters > to read during preprocessing, and there is only one is_alpha function for > both languages. Depending on your corpus, you may have to include other > characters than the one specified in my_isalpha as well. > > I added the "read column labels from file" to the bilingual code as well. > So if, instead of taking the most frequent words, you prefer to read the > column labels from a file, you will also have to change the line > $(MY_DIR_STEM)/preprocessing/count_wordvec > in the main Makefile to > $(MY_DIR_STEM)/preprocessing/count_wordvec > -col_label_file (name_of_your_col_label_file) > Column labels are assumed to be language A words. > > I know, this is a lot of info at once, and I am sorry it's not more > convenient at the moment. I hope you are succesful in building your own > model, and I am happy to help in case there are problems. > > Best wishes, > Beate > > > > > > On Mon, 19 Apr 2004, Shuji Yamaguchi wrote: > > >Dear Beate, > > > >Yes, I would appreciate it as I do not have a reply back from Stanley (and > >Emma told me I should count on him for sapir). > >Could you please send it to me via mail to my CSLI account (which has larger > >quota), > > sh...@cs... > >? > >I assume it would be around 300 kb in size, judging from a gzip file of > >version 0.8.3. > > > >Many thanks for your time and support. > >Regards, Shuji > > |