From: Dominic W. <dwi...@cs...> - 2004-04-22 00:16:07
|
Dear Shuji, Thanks for this message, and for trying out the infomap software in such a creative fashion. Please do not hesitate to ask our opinions on such matters - I'm sure we would all be delighted if our work could be put to positive use with AlertNet. My main regret is that I might not have sufficient time or expertise to give as much help as I would like, but I will gladly contribute where I can. Here are some suggestions of possible pitfalls - I don't know if any of them will actually occur. It sounds as though your approach is a very promising way of building a cross-lingual system from a comparable corpus with some seed-alignment, a development we've wished for for some time. In your creation of an initial bilingual model (steps 1 and 2), is it possible that any Japanese and English words will be accidentally represented by the same strings of 8 byte characters, or is there some way of avoiding this possibility? Now, your specific stage: > 5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English model > obtained in 2) by the ones from the bilingual training model obtained in > 1). If I understand it correctly, you'd be replacing the vectors and dictionary for a larger English collection with a much smaller set determined just from the aligned pairs? Then in 5)-2, you compute document vectors for the English collection using vectors from this smaller model, and in 5)-3 you test this by seeing if query built from a Japanese document retrieves its English counterpart? And if this works well, you can feed in other Japanese documents and treat their best matches as potential translations, increasing the size of the aligned set. Please let me know if this interpretation is correct, and if so how well it works - I definitely think it's worth a try. One worry I have is using such a small training set will give very little information about most words - many won't appear, and all those that appear with unit frequency within the same document will be mapped to exactly the same vector. But it will at least be relatively easy to test, by comparing English searches over the larger model to those within the small model. An alternative to using the vectors from the small aligned model might be to use the larger English model to get term vectors for the Japanese words in the aligned documents (by averaging the vectors of the docuents these terms appear in). But you'd still have the problem that two Japanese words of unit frequency appearing in the same documents would be mapped to the same vector. If it doesn't work well with documents, another way might be to select fairly unambiguous names (such as "Iraq", "Dow Jones stock exchange", "UNICEF") and artificially treat the English and Japanese versions of these names as identical content-bearing words, now that Beate has enabled the users to choose these words for themselves. Can you get a list of English/Japanese term-pairs likt this farily easily? I remember you presented a PowerPoint slide once with a few of these - were they drawn from a larger collection? Please let me know how you get on. Best wishes, Dominic On Tue, 20 Apr 2004, Shuji Yamaguchi wrote: > Hi all, > > I wonder whether some of you could review and validate my bootstrapping use > of Infomap for bilingual text alignment. I have described it in details > below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not > authentic but still) a doable short-cut for calculating document vectors of > new additional documents under an existing model. > > My bootstrapping use of Informap > ---------------------------------------------- > Given a comparable English and Japanese news corpus from Reuters (which I > work for), my task is to find out an English original news for a given > Japanese translated news. Roughly number of English news is 10 times more > than Japanese. > > I use Infomap as follows to narrow down candidates of English original news > for a Japanese translation. > > 1) Initially 120 news bilingual pairs are identified rather manually and > they are used as an initial "training" bilingual corpus. > > 2) Each pair of news are merged into a single text file. All of the pairs > are fed into Infomap to come up with a pseudo bilingual training model. > (NB: I have not yet used the unreleased bilingual InfoMap. I have converted > a Japanese 2 byte character into a special transliterated hex string to get > over the current limitation of 8 byte-per-character assumption in Infomap. I > have also modified count_wordvec.c locally in my copy of Infomap so that a > whole bilingual file falls into a "context window" for co-occurrence > analysis.) > > 3) Now a few thousands of English news (reported on a particular date) are > taken out of the rest of corpus and fed into Infomap to create another > English only monolingual model. Some of these English news are the original > for a few hundred Japanese translated news on the same date. (NB: Actually a > small percentage of the original may have been reported on the previous date > due to the time difference, but this is ignored for the moment.) > > 4) My basic idea here is to calculate the document vectors for all of the > English news and a given Japanese translation in 3) above under the > bilingual training model created in 2) above, to compare the similarity, to > look into a few English news with the highest similarity scores and to > select a real original out of them. > > 5) In order to make best use of Infomap software, I have been doing the > following for the idea of 4) above: > > 5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English model > obtained in 2) by the ones from the bilingual training model obtained in 1). > > 5)-2. Recalculate document vector files (artvec.bin, art2offset.dir/pag, > offset2art.dir/pag) of the English model by the count_artvec command. I > suppose this calculate document vector under the bilingual model because of > the word vector file replacement in 5)-1. > > 5)-3. Treat the given Japanese translation as a long query and calculate > its vector by my slightly modified version of "associate -d" command (which > accepts a filename of the Japanese translation as well) running against the > English model with the bilingual word vector created in the 5)-2 step above. > > 5)-4 The associate command nicely lists out English news documents in the > similarity order for the Japanese translation as query so that I look at the > list and examine the highest ones to find the real original. > > 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese > translations, I can add additional correct pairs (say 10-20) to the initial > set of pairs and go through the 2) - 5) steps again. I hope this would > gradually improve the bilingual model with a growing number of pairs. I can > then use the sufficiently improved bilingual model for CLIR and other > interesting tasks. > > ---------------( end of my bootstrapping use of > Infomap------------------------------------------------ > > I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would > still work fine, but I am not sure whether I fully understand the following > code within process_region(), which I think is a key here whether my > irregular usage would be still all right. > /* Add the vectors up */ > while( cursor <= region_out) { > /* If this is a row label... */ > if( ( row = ((env->word_array)[int_buffer[cursor]]).row) >= 0) > for( i=0; i < singvals; i++) > tmpvector[i] += (env->matrix)[row][i]; > cursor++; > } > My casual walk through of the codes suggests that the word_array in the IF > statement above will work fine still with words in int_buffer[ ] from the > English only new and that it would give the document vector for the English > news under the bilingual model. But I am not much confident about it. > > Feeling sorry for a long mail, but I would really appreciate your kind > review and advice. > Best regards, Shuji > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |