RE: [infomap-nlp-devel] Use of Infomap for bootstrapping text alignment. Wondering whether someone c

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Dear Dominic,

Thank you for your advice. Sorry for not responding you sooner.

My replies to your points are as follows.

< "In your creation of an initial bilingual model (steps 1 and 2), is it
possible that any Japanese and English words will be accidentally
represented by the same strings of 8 byte characters, or is there some =
way
of avoiding this possibility?"

A good point. I noted it so that I have appended an underscore '_' to =
the
transliterated (i.e. tr(/0-9a-f/g-v/); in Perl) hex string for a =
Japanese
word in my program. This gets around the clash problem you kindly =
pointed
out (assuming that it should be few case of an English natural word =
starting
with '_'.

< "Please let me know if this interpretation is correct, and if so how =
well
it works - I definitely think it's worth a try. One worry I have is =
using
such a small training set will give very little information about most =
words
- many won't appear, and all those that appear with unit frequency =
within
the same document will be mapped to exactly the same vector. But it will =
at
least be relatively easy to test, by comparing English searches over the
larger model to those within the small model."

Thank you for your reading my lengthy explanation. Yes, you get it =
right. I
have gone through 5-6 cycles of the bootstrapping process. Results are
mixed. Even with the small bilingual model out of 200 pairs the document
similarity ranking gives good indication of original English news for a
Japanese translation on the same or similar event especially when the =
news
is from small news editing bureaux. It however does not perform well =
(with
accuracy of less than 10-20%) to identify original English news when it =
is
reported from large editing bureaux e.g. in London and New York. This is
because there are often more than 5-10 news reported from the large =
bureaux
on a single event throughout a day in slightly different angles and =
timings,
while it is not the case for news from the small bureaux. The small word
vector is not fine-tuned enough to pin-point an exact original among the
5-10 similar news.

> "another way might be to select fairly unambiguous names (such as =
"Iraq",
"Dow Jones stock exchange", "UNICEF") and artificially treat the English =
and
Japanese versions of these names as identical content-bearing words, now
that Beate has enabled the users to choose these words for themselves."

This sounds a very interesting approach. I will explore this further.

Best regards, Shuji

-----Original Message-----
From: inf...@li...
[mailto:inf...@li...] On Behalf Of =
Dominic
Widdows
Sent: Wednesday, April 21, 2004 5:15 PM
To: Shuji Yamaguchi
Cc: inf...@li...
Subject: Re: [infomap-nlp-devel] Use of Infomap for bootstrapping text
alignment. Wondering whether someone could review my method.

Dear Shuji,

Thanks for this message, and for trying out the infomap software in such =
a
creative fashion. Please do not hesitate to ask our opinions on such
matters - I'm sure we would all be delighted if our work could be put to
positive use with AlertNet. My main regret is that I might not have
sufficient time or expertise to give as much help as I would like, but I
will gladly contribute where I can.

Here are some suggestions of possible pitfalls - I don't know if any of
them will actually occur.

It sounds as though your approach is a
very promising way of building a cross-lingual system from a comparable
corpus with some seed-alignment, a development we've wished for for some
time.

In your creation of an initial bilingual model (steps 1 and 2), is it
possible that any Japanese and English words will be accidentally
represented by the same strings of 8 byte characters, or is there some =
way
of avoiding this possibility?

Now, your specific stage:

> 5)-1.  Replace word vector files (i.e. wordvec.bin, =
word2offset.dir/pag,
> offset2word.dir/pag) and the dictionary file (dic) in the English =
model
> obtained in 2) by the ones from the bilingual training model obtained =
in
> 1).

If I understand it correctly, you'd be replacing the vectors and
dictionary for a larger English collection with a much smaller set
determined just from the aligned pairs? Then in 5)-2, you compute =
document
vectors for the English collection using vectors from this smaller =
model,
and in 5)-3 you test this by seeing if query built from a Japanese
document retrieves its English counterpart? And if this works well, you
can feed in other Japanese documents and treat their best matches as
potential translations, increasing the size of the aligned set.

Please let me know if this interpretation is correct, and if so how well
it works - I definitely think it's worth a try. One worry I have is =
using
such a small training set will give very little information about most
words - many won't appear, and all those that appear with unit frequency
within the same document will be mapped to exactly the same vector. But =
it
will at least be relatively easy to test, by comparing English searches
over the larger model to those within the small model.

An alternative to using the vectors from the small aligned model might =
be
to use the larger English model to get term vectors for the Japanese =
words
in the aligned documents (by averaging the vectors of the docuents these
terms appear in). But you'd still have the problem that two Japanese =
words
of unit frequency appearing in the same documents would be mapped to
the same vector.

If it doesn't work well with documents, another way might be to select
fairly unambiguous names (such as "Iraq", "Dow Jones stock exchange",
"UNICEF") and artificially treat the English and Japanese versions of
these names as identical content-bearing words, now that Beate has =
enabled
the users to choose these words for themselves. Can you get a list of
English/Japanese term-pairs likt this farily easily? I remember you
presented a PowerPoint slide once with a few of these - were they drawn
from a larger collection?

Please let me know how you get on.

Best wishes,
Dominic

On Tue, 20 Apr 2004, Shuji Yamaguchi wrote:

> Hi all,
>
> I wonder whether some of you could review and validate my =
bootstrapping
use
> of Infomap for bilingual text alignment. I have described it in =
details
> below. I especially wonder whether my steps of 5)-1 and 5)-2 are (not
> authentic but still) a doable short-cut for calculating document =
vectors
of
> new additional documents under an existing model.
>
> My bootstrapping use of Informap
> ----------------------------------------------
> Given a comparable English and Japanese news corpus from Reuters =
(which I
> work for), my task is to find out an English original news for a given
> Japanese translated news. Roughly number of English news is 10 times =
more
> than Japanese.
>
> I use Infomap as follows to narrow down candidates of English original
news
> for a Japanese translation.
>
> 1) Initially 120 news bilingual pairs are identified rather manually =
and
> they are used as an initial "training" bilingual corpus.
>
> 2) Each pair of news are merged into a single text file. All of the =
pairs
> are fed into Infomap to come up with a pseudo bilingual training =
model.
> (NB: I have not yet used the unreleased bilingual InfoMap. I have
converted
> a Japanese 2 byte character into a special transliterated hex string =
to
get
> over the current limitation of 8 byte-per-character assumption in =
Infomap.
I
> have also modified count_wordvec.c locally in my copy of Infomap so =
that a
> whole bilingual file falls into a "context window" for co-occurrence
> analysis.)
>
> 3) Now a few thousands of English news (reported on a particular date) =
are
> taken out of the rest of corpus and fed into Infomap to create another
> English only monolingual model. Some of these English news are the
original
> for a few hundred Japanese translated news on the same date. (NB: =
Actually
a
> small percentage of the original may have been reported on the =
previous
date
> due to the time difference, but this is ignored for the moment.)
>
> 4) My basic idea here is to calculate the document vectors for all of =
the
> English news and a given Japanese translation in 3) above under the
> bilingual training model created in 2) above, to compare the =
similarity,
to
> look into a few English news with the highest similarity scores and to
> select a real original out of them.
>
> 5) In order to make best use of Infomap software, I have been doing =
the
> following for the idea of 4) above:
>
> 5)-1.  Replace word vector files (i.e. wordvec.bin, =
word2offset.dir/pag,
> offset2word.dir/pag) and the dictionary file (dic) in the English =
model
> obtained in 2) by the ones from the bilingual training model obtained =
in
1).
>
> 5)-2.  Recalculate document vector files (artvec.bin, =
art2offset.dir/pag,
> offset2art.dir/pag) of the English model by the count_artvec command. =
I
> suppose this calculate document vector under the bilingual model =
because
of
> the word vector file replacement in 5)-1.
>
> 5)-3.  Treat the given Japanese translation as a long query and =
calculate
> its vector by my slightly modified version of "associate -d" command
(which
> accepts a filename of the Japanese translation as well) running =
against
the
> English model with the bilingual word vector created in the 5)-2 step
above.
>
> 5)-4  The associate command nicely lists out English news documents in =
the
> similarity order for the Japanese translation as query so that I look =
at
the
> list and examine the highest ones to find the real original.
>
> 6) By repeating 5)-3 and 5)-4 over the few hundreds of Japanese
> translations, I can add additional correct pairs (say 10-20) to the
initial
> set of pairs and go through the 2) - 5) steps again. I hope this would
> gradually improve the bilingual model with a growing number of pairs. =
I
can
> then use the sufficiently improved bilingual model for CLIR and other
> interesting tasks.
>
> ---------------( end of my bootstrapping use of
> Infomap------------------------------------------------
>
> I have looked into count_artvec.c to confirm whether the 5)-1 and 5)-2
would
> still work fine, but I am not sure whether I fully understand the
following
> code within process_region(), which I think is a key here whether my
> irregular usage would be still all right.
>   /* Add the vectors up */
>   while( cursor <=3D region_out) {
>     /* If this is a row label... */
>     if( ( row =3D ((env->word_array)[int_buffer[cursor]]).row) >=3D 0)
>       for( i=3D0; i < singvals; i++)
> 	tmpvector[i] +=3D (env->matrix)[row][i];
>     cursor++;
>   }
> My casual walk through of the codes suggests that the word_array in =
the IF
> statement above will work fine still with words in int_buffer[ ] from =
the
> English only new and that it would give the document vector for the
English
> news under the bilingual model. But I am not much confident about it.
>
> Feeling sorry for a long mail, but I would really appreciate your kind
> review and advice.
> Best regards, Shuji
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> =
administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli=
ck
> _______________________________________________
> infomap-nlp-devel mailing list
> inf...@li...
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli=
ck
_______________________________________________
infomap-nlp-devel mailing list
inf...@li...
https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel

RE: [infomap-nlp-devel] Use of Infomap for bootstrapping text alignment. Wondering whether someone c

RE: [infomap-nlp-devel] Use of Infomap for bootstrapping text alignment. Wondering whether someone could review my method.