[infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Shuji,

   I will certainly give you access to CVS when it is ready.  You may
   want to subscribe to inf...@li... to
   make sure you receive all relevant announcements.

   I've read about what UTF-8 is, but I've never used it in programs.
   If you have C code (or pointers to C code) using UTF-8, please let
   me know because I'd like to take a look.

   What I do know is that UTF-8 characters can consist of a variable
   number of bytes (from one to six, but I think generally only from
   one to three).  Thus my_isalpha() (which is defined in lib/utils.c)
   would need a different prototype.  For instance, it could take an
   array of bytes ("char" datatype) and an argument telling it how
   many bytes are in the array.  Or it could just take an array of
   bytes without knowing its size and determine it by decoding the
   UTF-8 (where the first byte encodes how many bytes are in the
   character).

   Unfortunately, the code for tokenization would also need to be
   changed to work with UTF-8 characters.  The next_token() function
   in preprocessing/tokenizer.c would need to be changed, for
   starters.  Right now it steps through an array of C "chars";
   probably it should instead call a function that returns the next
   UTF-8 character from the input stream.  Calls to strlen() and
   strncmp() and other C string functions would also need to be
   replaced with UTF-8 aware functions.  (Presumably there is a
   library of such functions available.)

   We could create a separate CVS branch for this line of development
   (to be merged in later), since it's quite important and multiple
   people might be able to contribute.  I can set that up once we have
   our CVS house in order.

                                                        Scott

On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote:
> Hi Scott, Beate,
> 
> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII
> characters from its outset.
> 
> Are there any other parts of InfoMap I should give a closer look and if
> necessary change for making it capable of handling Japanese and other
> multibyte characters?  I think I have to do so by trials and errors, but if
> you could give me guidance it would streamline my process.
> 
> I plan to use UTF8 as encoding. I hope that my changes would be transparent
> to ASCII and could be brought back to the main release if we want to. I
> would be appreciate if I could have access to CVS when it is ready.
> 
> Regards, Shuji

[infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling

[infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?