From: Shuji Y. <yam...@ya...> - 2004-04-21 05:24:59
|
Hi all, I wonder whether some of you could review and validate my bootstrapping use of Infomap for bilingual text alignment. I have described it in details below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not authentic but still) a doable short-cut for calculating document vectors of new additional documents under an existing model. My bootstrapping use of Informap ---------------------------------------------- Given a comparable English and Japanese news corpus from Reuters (which I work for), my task is to find out an English original news for a given Japanese translated news. Roughly number of English news is 10 times more than Japanese. I use Infomap as follows to narrow down candidates of English original news for a Japanese translation. 1) Initially 120 news bilingual pairs are identified rather manually and they are used as an initial "training" bilingual corpus. 2) Each pair of news are merged into a single text file. All of the pairs are fed into Infomap to come up with a pseudo bilingual training model. (NB: I have not yet used the unreleased bilingual InfoMap. I have converted a Japanese 2 byte character into a special transliterated hex string to get over the current limitation of 8 byte-per-character assumption in Infomap. I have also modified count_wordvec.c locally in my copy of Infomap so that a whole bilingual file falls into a "context window" for co-occurrence analysis.) 3) Now a few thousands of English news (reported on a particular date) are taken out of the rest of corpus and fed into Infomap to create another English only monolingual model. Some of these English news are the original for a few hundred Japanese translated news on the same date. (NB: Actually a small percentage of the original may have been reported on the previous date due to the time difference, but this is ignored for the moment.) 4) My basic idea here is to calculate the document vectors for all of the English news and a given Japanese translation in 3) above under the bilingual training model created in 2) above, to compare the similarity, to look into a few English news with the highest similarity scores and to select a real original out of them. 5) In order to make best use of Infomap software, I have been doing the following for the idea of 4) above: 5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag, offset2word.dir/pag) and the dictionary file (dic) in the English model obtained in 2) by the ones from the bilingual training model obtained in 1). 5)-2. Recalculate document vector files (artvec.bin, art2offset.dir/pag, offset2art.dir/pag) of the English model by the count_artvec command. I suppose this calculate document vector under the bilingual model because of the word vector file replacement in 5)-1. 5)-3. Treat the given Japanese translation as a long query and calculate its vector by my slightly modified version of "associate -d" command (which accepts a filename of the Japanese translation as well) running against the English model with the bilingual word vector created in the 5)-2 step above. 5)-4 The associate command nicely lists out English news documents in the similarity order for the Japanese translation as query so that I look at the list and examine the highest ones to find the real original. 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese translations, I can add additional correct pairs (say 10-20) to the initial set of pairs and go through the 2) - 5) steps again. I hope this would gradually improve the bilingual model with a growing number of pairs. I can then use the sufficiently improved bilingual model for CLIR and other interesting tasks. ---------------( end of my bootstrapping use of Infomap------------------------------------------------ I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would still work fine, but I am not sure whether I fully understand the following code within process_region(), which I think is a key here whether my irregular usage would be still all right. /* Add the vectors up */ while( cursor <= region_out) { /* If this is a row label... */ if( ( row = ((env->word_array)[int_buffer[cursor]]).row) >= 0) for( i=0; i < singvals; i++) tmpvector[i] += (env->matrix)[row][i]; cursor++; } My casual walk through of the codes suggests that the word_array in the IF statement above will work fine still with words in int_buffer[ ] from the English only new and that it would give the document vector for the English news under the bilingual model. But I am not much confident about it. Feeling sorry for a long mail, but I would really appreciate your kind review and advice. Best regards, Shuji |