|
From: Shuji Y. <yam...@ya...> - 2004-04-21 05:24:59
|
Hi all,
I wonder whether some of you could review and validate my bootstrapping use
of Infomap for bilingual text alignment. I have described it in details
below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not
authentic but still) a doable short-cut for calculating document vectors of
new additional documents under an existing model.
My bootstrapping use of Informap
----------------------------------------------
Given a comparable English and Japanese news corpus from Reuters (which I
work for), my task is to find out an English original news for a given
Japanese translated news. Roughly number of English news is 10 times more
than Japanese.
I use Infomap as follows to narrow down candidates of English original news
for a Japanese translation.
1) Initially 120 news bilingual pairs are identified rather manually and
they are used as an initial "training" bilingual corpus.
2) Each pair of news are merged into a single text file. All of the pairs
are fed into Infomap to come up with a pseudo bilingual training model.
(NB: I have not yet used the unreleased bilingual InfoMap. I have converted
a Japanese 2 byte character into a special transliterated hex string to get
over the current limitation of 8 byte-per-character assumption in Infomap. I
have also modified count_wordvec.c locally in my copy of Infomap so that a
whole bilingual file falls into a "context window" for co-occurrence
analysis.)
3) Now a few thousands of English news (reported on a particular date) are
taken out of the rest of corpus and fed into Infomap to create another
English only monolingual model. Some of these English news are the original
for a few hundred Japanese translated news on the same date. (NB: Actually a
small percentage of the original may have been reported on the previous date
due to the time difference, but this is ignored for the moment.)
4) My basic idea here is to calculate the document vectors for all of the
English news and a given Japanese translation in 3) above under the
bilingual training model created in 2) above, to compare the similarity, to
look into a few English news with the highest similarity scores and to
select a real original out of them.
5) In order to make best use of Infomap software, I have been doing the
following for the idea of 4) above:
5)-1. Replace word vector files (i.e. wordvec.bin, word2offset.dir/pag,
offset2word.dir/pag) and the dictionary file (dic) in the English model
obtained in 2) by the ones from the bilingual training model obtained in 1).
5)-2. Recalculate document vector files (artvec.bin, art2offset.dir/pag,
offset2art.dir/pag) of the English model by the count_artvec command. I
suppose this calculate document vector under the bilingual model because of
the word vector file replacement in 5)-1.
5)-3. Treat the given Japanese translation as a long query and calculate
its vector by my slightly modified version of "associate -d" command (which
accepts a filename of the Japanese translation as well) running against the
English model with the bilingual word vector created in the 5)-2 step above.
5)-4 The associate command nicely lists out English news documents in the
similarity order for the Japanese translation as query so that I look at the
list and examine the highest ones to find the real original.
6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese
translations, I can add additional correct pairs (say 10-20) to the initial
set of pairs and go through the 2) - 5) steps again. I hope this would
gradually improve the bilingual model with a growing number of pairs. I can
then use the sufficiently improved bilingual model for CLIR and other
interesting tasks.
---------------( end of my bootstrapping use of
Infomap------------------------------------------------
I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2 would
still work fine, but I am not sure whether I fully understand the following
code within process_region(), which I think is a key here whether my
irregular usage would be still all right.
/* Add the vectors up */
while( cursor <= region_out) {
/* If this is a row label... */
if( ( row = ((env->word_array)[int_buffer[cursor]]).row) >= 0)
for( i=0; i < singvals; i++)
tmpvector[i] += (env->matrix)[row][i];
cursor++;
}
My casual walk through of the codes suggests that the word_array in the IF
statement above will work fine still with words in int_buffer[ ] from the
English only new and that it would give the document vector for the English
news under the bilingual model. But I am not much confident about it.
Feeling sorry for a long mail, but I would really appreciate your kind
review and advice.
Best regards, Shuji
|