RE: [infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handl

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Beate,
Yes, tokenizer is needed outside of Informap to process corpus of =
languages
like Japanese where words are connected to each other. I install and =
plan to
use ChaSen for Japanese. For other such languages I will find such
tokenization tools for them.

Scott,
I start subscription of infomap-nlp-devel list.
I have skimmed through some of Unicode sites and found the following =
below
informative. Some of the sites include small examples.

I have however a 2nd thought that it may be quicker and more =
straightforward
to write a program which converts a Japanese character to an alphabet =
(e.g.
by mapping an internal encoding in hexadecimal to 'a' to 'p' character,
instead of the regular 0-f characters, and vice versa). InfoMap then =
will be
able to handle a 'Japanese' words as another sequence of alphabets, =
though
it would double the length of word representation within InfoMap. =
Obviously
it has a drawback that you can not read a Japanese word in the direct
outputs from InfoMap, which have to be converted back to be shown as a
meaningful character.=20
If you can think of any other pitfalls in this sort of method, please =
let me
know.=20

Unicode sites
-------------------
http://www.cl.cam.ac.uk/~mgk25/unicode.html
  Good introductory site. The following sections are particularly useful =
for
converting Informap UTF8 capable.
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
  Among approaches discussed in this section, we should probably aim for
"hard-wired" and "hard conversion" approaches in spite that it would not =
be
extensible to other multibyte encodings like EUC.

ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
  This is another useful site. The section below talks about how to =
modify C
programs.
  ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-6.html

http://www.unix-systems.org/version2/whatsnew/login_mse.html
  Useful guide on distinction between multibyte and wide-character
encodings.

Many thanks for your support.
Regards, Shuji

-----Original Message-----
From: inf...@li...
[mailto:inf...@li...] On Behalf Of =
Beate
Dorow
Sent: Friday, March 12, 2004 8:10 AM
To: Scott James Cederberg
Cc: Shuji Yamaguchi; inf...@li...
Subject: Re: [infomap-nlp-devel] Re: my_isalpha(). What else should I =
change
to make InfoMap capable of handling multibyte characters?

Dear Shuji, Scott,

I think first of all, we'll need to detect word boundaries. This is
straightforward for the European languages where words are simply
separated by spaces, but probably not so easy for Japanese. I saw that
the old infomap folks used ChaSen, a tool for detecting word boundaries =
in
Japanese, when they did cross-lingual IR on a parallel corpus of
Japanese-English patent abstracts.

Do you have a tool at hand which detects the boundaries of Japanese =
words,
Shuji?

Best wishes,
Beate

On Thu, 11 Mar 2004, Scott James Cederberg wrote:

>Hi Shuji,
>
>   I will certainly give you access to CVS when it is ready.  You may
>   want to subscribe to inf...@li... to
>   make sure you receive all relevant announcements.
>
>   I've read about what UTF-8 is, but I've never used it in programs.
>   If you have C code (or pointers to C code) using UTF-8, please let
>   me know because I'd like to take a look.
>
>   What I do know is that UTF-8 characters can consist of a variable
>   number of bytes (from one to six, but I think generally only from
>   one to three).  Thus my_isalpha() (which is defined in lib/utils.c)
>   would need a different prototype.  For instance, it could take an
>   array of bytes ("char" datatype) and an argument telling it how
>   many bytes are in the array.  Or it could just take an array of
>   bytes without knowing its size and determine it by decoding the
>   UTF-8 (where the first byte encodes how many bytes are in the
>   character).
>
>   Unfortunately, the code for tokenization would also need to be
>   changed to work with UTF-8 characters.  The next_token() function
>   in preprocessing/tokenizer.c would need to be changed, for
>   starters.  Right now it steps through an array of C "chars";
>   probably it should instead call a function that returns the next
>   UTF-8 character from the input stream.  Calls to strlen() and
>   strncmp() and other C string functions would also need to be
>   replaced with UTF-8 aware functions.  (Presumably there is a
>   library of such functions available.)
>
>   We could create a separate CVS branch for this line of development
>   (to be merged in later), since it's quite important and multiple
>   people might be able to contribute.  I can set that up once we have
>   our CVS house in order.
>
>                                                        Scott
>
>On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote:
>> Hi Scott, Beate,
>>
>> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII
>> characters from its outset.
>>
>> Are there any other parts of InfoMap I should give a closer look and =
if
>> necessary change for making it capable of handling Japanese and other
>> multibyte characters?  I think I have to do so by trials and errors, =
but
if
>> you could give me guidance it would streamline my process.
>>
>> I plan to use UTF8 as encoding. I hope that my changes would be
transparent
>> to ASCII and could be brought back to the main release if we want to. =
I
>> would be appreciate if I could have access to CVS when it is ready.
>>
>> Regards, Shuji
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: IBM Linux Tutorials
>Free Linux tutorial presented by Daniel Robbins, President and CEO of
>GenToo technologies. Learn everything from fundamentals to system
>administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcl=
ick
>_______________________________________________
>infomap-nlp-devel mailing list
>inf...@li...
>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli=
ck
_______________________________________________
infomap-nlp-devel mailing list
inf...@li...
https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel

RE: [infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handl

RE: [infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?