From: Shuji Y. <yam...@ya...> - 2004-04-30 18:15:10
|
Dear Dominic, Thank you for your advice. Sorry for not responding you sooner. My replies to your points are as follows. < "In your creation of an initial bilingual model (steps 1 and 2), is it possible that any Japanese and English words will be accidentally represented by the same strings of 8 byte characters, or is there some = way of avoiding this possibility?" A good point. I noted it so that I have appended an underscore '_' to = the transliterated (i.e. tr(/0-9a-f/g-v/); in Perl) hex string for a = Japanese word in my program. This gets around the clash problem you kindly = pointed out (assuming that it should be few case of an English natural word = starting with '_'. < "Please let me know if this interpretation is correct, and if so how = well it works - I definitely think it's worth a try. One worry I have is = using such a small training set will give very little information about most = words - many won't appear, and all those that appear with unit frequency = within the same document will be mapped to exactly the same vector. But it will = at least be relatively easy to test, by comparing English searches over the larger model to those within the small model." Thank you for your reading my lengthy explanation. Yes, you get it = right. I have gone through 5-6 cycles of the bootstrapping process. Results are mixed. Even with the small bilingual model out of 200 pairs the document similarity ranking gives good indication of original English news for a Japanese translation on the same or similar event especially when the = news is from small news editing bureaux. It however does not perform well = (with accuracy of less than 10-20%) to identify original English news when it = is reported from large editing bureaux e.g. in London and New York. This is because there are often more than 5-10 news reported from the large = bureaux on a single event throughout a day in slightly different angles and = timings, while it is not the case for news from the small bureaux. The small word vector is not fine-tuned enough to pin-point an exact original among the 5-10 similar news. > "another way might be to select fairly unambiguous names (such as = "Iraq", "Dow Jones stock exchange", "UNICEF") and artificially treat the English = and Japanese versions of these names as identical content-bearing words, now that Beate has enabled the users to choose these words for themselves." This sounds a very interesting approach. I will explore this further. Best regards, Shuji -----Original Message----- From: inf...@li... [mailto:inf...@li...] On Behalf Of = Dominic Widdows Sent: Wednesday, April 21, 2004 5:15 PM To: Shuji Yamaguchi Cc: inf...@li... Subject: Re: [infomap-nlp-devel] Use of Infomap for bootstrapping text alignment. Wondering whether someone could review my method. Dear Shuji, Thanks for this message, and for trying out the infomap software in such = a creative fashion. Please do not hesitate to ask our opinions on such matters - I'm sure we would all be delighted if our work could be put to positive use with AlertNet. My main regret is that I might not have sufficient time or expertise to give as much help as I would like, but I will gladly contribute where I can. Here are some suggestions of possible pitfalls - I don't know if any of them will actually occur. It sounds as though your approach is a very promising way of building a cross-lingual system from a comparable corpus with some seed-alignment, a development we've wished for for some time. In your creation of an initial bilingual model (steps 1 and 2), is it possible that any Japanese and English words will be accidentally represented by the same strings of 8 byte characters, or is there some = way of avoiding this possibility? Now, your specific stage: > 5)-1. Replace word vector files (i.e. wordvec.bin, = word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English = model > obtained in 2) by the ones from the bilingual training model obtained = in > 1). If I understand it correctly, you'd be replacing the vectors and dictionary for a larger English collection with a much smaller set determined just from the aligned pairs? Then in 5)-2, you compute = document vectors for the English collection using vectors from this smaller = model, and in 5)-3 you test this by seeing if query built from a Japanese document retrieves its English counterpart? And if this works well, you can feed in other Japanese documents and treat their best matches as potential translations, increasing the size of the aligned set. Please let me know if this interpretation is correct, and if so how well it works - I definitely think it's worth a try. One worry I have is = using such a small training set will give very little information about most words - many won't appear, and all those that appear with unit frequency within the same document will be mapped to exactly the same vector. But = it will at least be relatively easy to test, by comparing English searches over the larger model to those within the small model. An alternative to using the vectors from the small aligned model might = be to use the larger English model to get term vectors for the Japanese = words in the aligned documents (by averaging the vectors of the docuents these terms appear in). But you'd still have the problem that two Japanese = words of unit frequency appearing in the same documents would be mapped to the same vector. If it doesn't work well with documents, another way might be to select fairly unambiguous names (such as "Iraq", "Dow Jones stock exchange", "UNICEF") and artificially treat the English and Japanese versions of these names as identical content-bearing words, now that Beate has = enabled the users to choose these words for themselves. Can you get a list of English/Japanese term-pairs likt this farily easily? I remember you presented a PowerPoint slide once with a few of these - were they drawn from a larger collection? Please let me know how you get on. Best wishes, Dominic On Tue, 20 Apr 2004, Shuji Yamaguchi wrote: > Hi all, > > I wonder whether some of you could review and validate my = bootstrapping use > of Infomap for bilingual text alignment. I have described it in = details > below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not > authentic but still) a doable short-cut for calculating document = vectors of > new additional documents under an existing model. > > My bootstrapping use of Informap > ---------------------------------------------- > Given a comparable English and Japanese news corpus from Reuters = (which I > work for), my task is to find out an English original news for a given > Japanese translated news. Roughly number of English news is 10 times = more > than Japanese. > > I use Infomap as follows to narrow down candidates of English original news > for a Japanese translation. > > 1) Initially 120 news bilingual pairs are identified rather manually = and > they are used as an initial "training" bilingual corpus. > > 2) Each pair of news are merged into a single text file. All of the = pairs > are fed into Infomap to come up with a pseudo bilingual training = model. > (NB: I have not yet used the unreleased bilingual InfoMap. I have converted > a Japanese 2 byte character into a special transliterated hex string = to get > over the current limitation of 8 byte-per-character assumption in = Infomap. I > have also modified count_wordvec.c locally in my copy of Infomap so = that a > whole bilingual file falls into a "context window" for co-occurrence > analysis.) > > 3) Now a few thousands of English news (reported on a particular date) = are > taken out of the rest of corpus and fed into Infomap to create another > English only monolingual model. Some of these English news are the original > for a few hundred Japanese translated news on the same date. (NB: = Actually a > small percentage of the original may have been reported on the = previous date > due to the time difference, but this is ignored for the moment.) > > 4) My basic idea here is to calculate the document vectors for all of = the > English news and a given Japanese translation in 3) above under the > bilingual training model created in 2) above, to compare the = similarity, to > look into a few English news with the highest similarity scores and to > select a real original out of them. > > 5) In order to make best use of Infomap software, I have been doing = the > following for the idea of 4) above: > > 5)-1. Replace word vector files (i.e. wordvec.bin, = word2offset.dir/pag, > offset2word.dir/pag) and the dictionary file (dic) in the English = model > obtained in 2) by the ones from the bilingual training model obtained = in 1). > > 5)-2. Recalculate document vector files (artvec.bin, = art2offset.dir/pag, > offset2art.dir/pag) of the English model by the count_artvec command. = I > suppose this calculate document vector under the bilingual model = because of > the word vector file replacement in 5)-1. > > 5)-3. Treat the given Japanese translation as a long query and = calculate > its vector by my slightly modified version of "associate -d" command (which > accepts a filename of the Japanese translation as well) running = against the > English model with the bilingual word vector created in the 5)-2 step above. > > 5)-4 The associate command nicely lists out English news documents in = the > similarity order for the Japanese translation as query so that I look = at the > list and examine the highest ones to find the real original. > > 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese > translations, I can add additional correct pairs (say 10-20) to the initial > set of pairs and go through the 2) - 5) steps again. I hope this would > gradually improve the bilingual model with a growing number of pairs. = I can > then use the sufficiently improved bilingual model for CLIR and other > interesting tasks. > > ---------------( end of my bootstrapping use of > Infomap------------------------------------------------ > > I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would > still work fine, but I am not sure whether I fully understand the following > code within process_region(), which I think is a key here whether my > irregular usage would be still all right. > /* Add the vectors up */ > while( cursor <=3D region_out) { > /* If this is a row label... */ > if( ( row =3D ((env->word_array)[int_buffer[cursor]]).row) >=3D 0) > for( i=3D0; i < singvals; i++) > tmpvector[i] +=3D (env->matrix)[row][i]; > cursor++; > } > My casual walk through of the codes suggests that the word_array in = the IF > statement above will work fine still with words in int_buffer[ ] from = the > English only new and that it would give the document vector for the English > news under the bilingual model. But I am not much confident about it. > > Feeling sorry for a long mail, but I would really appreciate your kind > review and advice. > Best regards, Shuji > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > = administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli= ck > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli= ck _______________________________________________ infomap-nlp-devel mailing list inf...@li... https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel |