Thread: [infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I choose and feed "content-bearing words" t

Status: Beta

Brought to you by: beate_dorow, cederber, dwiddows

infomap-nlp-devel

[infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I choose and feed "content-bearing words" to "count_wordvec"? (fwd)

From: Scott J. C. <ced...@cs...> - 2004-03-09 19:40:36

Dominic and Shugi,

   I'm CC'ing this reply to infomap-nlp-devel, because in theory the
   sort of discussion touched off by Dominic's message below (about
   how to add this feature) should take place there.

   I'm not familiar with where and how count_wordvec chooses the
   content-bearing words, but I think the easiest thing would be to
   modularize the part where it does that (e.g. into a separate function),
   and then create another function that instead read content bearing
   words from a file.  Which function was called could be controlled
   by a command-line option.

   I've already got a bit of a backlog of reported but unfixed bugs;
   I'm hoping to dig my way out from under that by the end of the
   week.  Hopefully next week I would then have time to add this
   feature.

   If anyone else wants to take it on, please let me know.

                                                          Scott

On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote:
> 
> Hi Scott,
> 
> I know we talked about this in the past - is it doable or shall we tell
> people it's on the back burner?
> 
> As far as I can tell, it's just a question of putting a different list of
> words into memory and telling the count_wordvec program to look there.
> Which could be a total can of worms in C.
> 
> Best wishes,
> Dominic
> 
> ---------- Forwarded message ----------
> Date: Fri, 5 Mar 2004 18:53:17 -0800
> From: Shuji Yamaguchi <yam...@ya...>
> To: inf...@li...
> Subject: [infomap-nlp-users] Infomap. Can I choose and feed
>     "content-bearing words" to "count_wordvec"?
> 
> Hi InfoMap admin and users,
> 
> I wonder whether I could choose the "content-bearing words" myself and feed
> them into the pre-processing of InfoMap.
> The count_wordvec appears to be the program that does it. According to its
> man page, the content words are chosen from the ones in "ranking 50-1049".
> Are there any way to customize this by use of options and/or parameters?
> 
> Thank you for your support.
> Regards, Shuji
> 
> Shuji Yamaguchi,
> Fellow, Reuters Digital Vision Program, CSLI, Stanford.
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> infomap-nlp-users mailing list
> inf...@li...
> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users

-- 
Scott Cederberg
Researcher

Infomap Project 
Computational Semantics Lab
Center for the Study of Language and Information (CSLI)
Stanford University

http://infomap.stanford.edu/

Re: [infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I choose and feed "content-bearing words" to "count_wordvec"? (fwd)

From: Beate D. <do...@IM...> - 2004-03-10 17:47:56

Dear Scott,

I am busy writing lately, but I don't mind adding this feature. Do you
think it'd be early enough if I did it during the weekend?

It's the initialize_column_indices routine (in dict.c) which picks the
column labels. I remember that we did earlier experiments with picking the
top words according to tf-idf as column labels rather than the top
frequent ones.

I think it wouldn't be a big deal to hand over a Boolean variable
$FROM_FILE to initialize_column_indices which indicates whether column
indices should be computed or read from a file.
We could let a user "turn on" this variable by adding an option
-cols_from_file to infomap-build which passes the value to
intialize_col_indices via count_wordvec.c. Does that make sense?

Best wishes,
Beate



On Tue, 9 Mar 2004, Scott James Cederberg wrote:

>Dominic and Shugi,
>
>   I'm CC'ing this reply to infomap-nlp-devel, because in theory the
>   sort of discussion touched off by Dominic's message below (about
>   how to add this feature) should take place there.
>
>   I'm not familiar with where and how count_wordvec chooses the
>   content-bearing words, but I think the easiest thing would be to
>   modularize the part where it does that (e.g. into a separate function),
>   and then create another function that instead read content bearing
>   words from a file.  Which function was called could be controlled
>   by a command-line option.
>
>   I've already got a bit of a backlog of reported but unfixed bugs;
>   I'm hoping to dig my way out from under that by the end of the
>   week.  Hopefully next week I would then have time to add this
>   feature.
>
>   If anyone else wants to take it on, please let me know.
>
>                                                          Scott
>
>
>On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote:
>>
>> Hi Scott,
>>
>> I know we talked about this in the past - is it doable or shall we tell
>> people it's on the back burner?
>>
>> As far as I can tell, it's just a question of putting a different list of
>> words into memory and telling the count_wordvec program to look there.
>> Which could be a total can of worms in C.
>>
>> Best wishes,
>> Dominic
>>
>> ---------- Forwarded message ----------
>> Date: Fri, 5 Mar 2004 18:53:17 -0800
>> From: Shuji Yamaguchi <yam...@ya...>
>> To: inf...@li...
>> Subject: [infomap-nlp-users] Infomap. Can I choose and feed
>>     "content-bearing words" to "count_wordvec"?
>>
>> Hi InfoMap admin and users,
>>
>> I wonder whether I could choose the "content-bearing words" myself and feed
>> them into the pre-processing of InfoMap.
>> The count_wordvec appears to be the program that does it. According to its
>> man page, the content words are chosen from the ones in "ranking 50-1049".
>> Are there any way to customize this by use of options and/or parameters?
>>
>> Thank you for your support.
>> Regards, Shuji
>>
>> Shuji Yamaguchi,
>> Fellow, Reuters Digital Vision Program, CSLI, Stanford.
>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.Net email is sponsored by: IBM Linux Tutorials
>> Free Linux tutorial presented by Daniel Robbins, President and CEO of
>> GenToo technologies. Learn everything from fundamentals to system
>> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
>> _______________________________________________
>> infomap-nlp-users mailing list
>> inf...@li...
>> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users
>
>--
>Scott Cederberg
>Researcher
>
>Infomap Project
>Computational Semantics Lab
>Center for the Study of Language and Information (CSLI)
>Stanford University
>
>http://infomap.stanford.edu/
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: IBM Linux Tutorials
>Free Linux tutorial presented by Daniel Robbins, President and CEO of
>GenToo technologies. Learn everything from fundamentals to system
>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
>_______________________________________________
>infomap-nlp-devel mailing list
>inf...@li...
>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>

Re: [infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I choose and feed "content-bearing words" to "count_wordvec"? (fwd)

From: Scott J. C. <ced...@cs...> - 2004-03-10 23:18:17

Beate,

   Thanks for your help!  What you describe sounds like a reasonable
   approach.

   Unfortunately, I need to do some housekeeping with our CVS
   repository before it can be changed by multiple people without
   making a mess.  I am planning to do that by the end of the week,
   and I'll get back to you.

                                                    Scott

On Wed, Mar 10, 2004 at 06:37:49PM +0100, Beate Dorow wrote:
> 
> 
> Dear Scott,
> 
> I am busy writing lately, but I don't mind adding this feature. Do you
> think it'd be early enough if I did it during the weekend?
> 
> It's the initialize_column_indices routine (in dict.c) which picks the
> column labels. I remember that we did earlier experiments with picking the
> top words according to tf-idf as column labels rather than the top
> frequent ones.
> 
> I think it wouldn't be a big deal to hand over a Boolean variable
> $FROM_FILE to initialize_column_indices which indicates whether column
> indices should be computed or read from a file.
> We could let a user "turn on" this variable by adding an option
> -cols_from_file to infomap-build which passes the value to
> intialize_col_indices via count_wordvec.c. Does that make sense?
> 
> Best wishes,
> Beate
> 
> 
> 
> On Tue, 9 Mar 2004, Scott James Cederberg wrote:
> 
> >Dominic and Shugi,
> >
> >   I'm CC'ing this reply to infomap-nlp-devel, because in theory the
> >   sort of discussion touched off by Dominic's message below (about
> >   how to add this feature) should take place there.
> >
> >   I'm not familiar with where and how count_wordvec chooses the
> >   content-bearing words, but I think the easiest thing would be to
> >   modularize the part where it does that (e.g. into a separate function),
> >   and then create another function that instead read content bearing
> >   words from a file.  Which function was called could be controlled
> >   by a command-line option.
> >
> >   I've already got a bit of a backlog of reported but unfixed bugs;
> >   I'm hoping to dig my way out from under that by the end of the
> >   week.  Hopefully next week I would then have time to add this
> >   feature.
> >
> >   If anyone else wants to take it on, please let me know.
> >
> >                                                          Scott
> >
> >
> >On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote:
> >>
> >> Hi Scott,
> >>
> >> I know we talked about this in the past - is it doable or shall we tell
> >> people it's on the back burner?
> >>
> >> As far as I can tell, it's just a question of putting a different list of
> >> words into memory and telling the count_wordvec program to look there.
> >> Which could be a total can of worms in C.
> >>
> >> Best wishes,
> >> Dominic
> >>
> >> ---------- Forwarded message ----------
> >> Date: Fri, 5 Mar 2004 18:53:17 -0800
> >> From: Shuji Yamaguchi <yam...@ya...>
> >> To: inf...@li...
> >> Subject: [infomap-nlp-users] Infomap. Can I choose and feed
> >>     "content-bearing words" to "count_wordvec"?
> >>
> >> Hi InfoMap admin and users,
> >>
> >> I wonder whether I could choose the "content-bearing words" myself and feed
> >> them into the pre-processing of InfoMap.
> >> The count_wordvec appears to be the program that does it. According to its
> >> man page, the content words are chosen from the ones in "ranking 50-1049".
> >> Are there any way to customize this by use of options and/or parameters?
> >>
> >> Thank you for your support.
> >> Regards, Shuji
> >>
> >> Shuji Yamaguchi,
> >> Fellow, Reuters Digital Vision Program, CSLI, Stanford.
> >>
> >>
> >>
> >>
> >> -------------------------------------------------------
> >> This SF.Net email is sponsored by: IBM Linux Tutorials
> >> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> >> GenToo technologies. Learn everything from fundamentals to system
> >> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >> _______________________________________________
> >> infomap-nlp-users mailing list
> >> inf...@li...
> >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users
> >
> >--
> >Scott Cederberg
> >Researcher
> >
> >Infomap Project
> >Computational Semantics Lab
> >Center for the Study of Language and Information (CSLI)
> >Stanford University
> >
> >http://infomap.stanford.edu/
> >
> >
> >-------------------------------------------------------
> >This SF.Net email is sponsored by: IBM Linux Tutorials
> >Free Linux tutorial presented by Daniel Robbins, President and CEO of
> >GenToo technologies. Learn everything from fundamentals to system
> >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >_______________________________________________
> >infomap-nlp-devel mailing list
> >inf...@li...
> >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
> >

-- 
Scott Cederberg
Researcher

Infomap Project 
Computational Semantics Lab
Center for the Study of Language and Information (CSLI)
Stanford University

http://infomap.stanford.edu/

[infomap-nlp-devel] my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?

From: Shuji Y. <yam...@ya...> - 2004-03-11 14:17:39

Hi Scott, Beate,

As Beate wrote on my_isalpha(), I note it does not accept non-ASCII
characters from its outset.

Are there any other parts of InfoMap I should give a closer look and if
necessary change for making it capable of handling Japanese and other
multibyte characters?  I think I have to do so by trials and errors, but if
you could give me guidance it would streamline my process.

I plan to use UTF8 as encoding. I hope that my changes would be transparent
to ASCII and could be brought back to the main release if we want to. I
would be appreciate if I could have access to CVS when it is ready.

Regards, Shuji

-----Original Message-----
From: Scott James Cederberg [mailto:ced...@cs...] 
Sent: Wednesday, March 10, 2004 3:08 PM
To: Beate Dorow
Cc: inf...@li...; yam...@ya...
Subject: Re: [infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I
choose and feed "content-bearing words" to "count_wordvec"? (fwd)


Beate,

   Thanks for your help!  What you describe sounds like a reasonable
   approach.

   Unfortunately, I need to do some housekeeping with our CVS
   repository before it can be changed by multiple people without
   making a mess.  I am planning to do that by the end of the week,
   and I'll get back to you.

                                                    Scott

Re: [infomap-nlp-devel] my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?

From: Scott J. C. <ced...@cs...> - 2004-03-11 20:35:50

On the bright side, any code that works with UTF-8 will automatically
work with ASCII, since ASCII characters are valid UTF-8 characters.

                                                      Scott

[infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?

From: Scott J. C. <ced...@cs...> - 2004-03-11 20:34:17

Hi Shuji,

   I will certainly give you access to CVS when it is ready.  You may
   want to subscribe to inf...@li... to
   make sure you receive all relevant announcements.

   I've read about what UTF-8 is, but I've never used it in programs.
   If you have C code (or pointers to C code) using UTF-8, please let
   me know because I'd like to take a look.

   What I do know is that UTF-8 characters can consist of a variable
   number of bytes (from one to six, but I think generally only from
   one to three).  Thus my_isalpha() (which is defined in lib/utils.c)
   would need a different prototype.  For instance, it could take an
   array of bytes ("char" datatype) and an argument telling it how
   many bytes are in the array.  Or it could just take an array of
   bytes without knowing its size and determine it by decoding the
   UTF-8 (where the first byte encodes how many bytes are in the
   character).

   Unfortunately, the code for tokenization would also need to be
   changed to work with UTF-8 characters.  The next_token() function
   in preprocessing/tokenizer.c would need to be changed, for
   starters.  Right now it steps through an array of C "chars";
   probably it should instead call a function that returns the next
   UTF-8 character from the input stream.  Calls to strlen() and
   strncmp() and other C string functions would also need to be
   replaced with UTF-8 aware functions.  (Presumably there is a
   library of such functions available.)

   We could create a separate CVS branch for this line of development
   (to be merged in later), since it's quite important and multiple
   people might be able to contribute.  I can set that up once we have
   our CVS house in order.

                                                        Scott

On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote:
> Hi Scott, Beate,
> 
> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII
> characters from its outset.
> 
> Are there any other parts of InfoMap I should give a closer look and if
> necessary change for making it capable of handling Japanese and other
> multibyte characters?  I think I have to do so by trials and errors, but if
> you could give me guidance it would streamline my process.
> 
> I plan to use UTF8 as encoding. I hope that my changes would be transparent
> to ASCII and could be brought back to the main release if we want to. I
> would be appreciate if I could have access to CVS when it is ready.
> 
> Regards, Shuji

Re: [infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?

From: Beate D. <do...@IM...> - 2004-03-12 16:21:17

Dear Shuji, Scott,

I think first of all, we'll need to detect word boundaries. This is
straightforward for the European languages where words are simply
separated by spaces, but probably not so easy for Japanese. I saw that
the old infomap folks used ChaSen, a tool for detecting word boundaries in
Japanese, when they did cross-lingual IR on a parallel corpus of
Japanese-English patent abstracts.

Do you have a tool at hand which detects the boundaries of Japanese words,
Shuji?

Best wishes,
Beate


On Thu, 11 Mar 2004, Scott James Cederberg wrote:

>Hi Shuji,
>
>   I will certainly give you access to CVS when it is ready.  You may
>   want to subscribe to inf...@li... to
>   make sure you receive all relevant announcements.
>
>   I've read about what UTF-8 is, but I've never used it in programs.
>   If you have C code (or pointers to C code) using UTF-8, please let
>   me know because I'd like to take a look.
>
>   What I do know is that UTF-8 characters can consist of a variable
>   number of bytes (from one to six, but I think generally only from
>   one to three).  Thus my_isalpha() (which is defined in lib/utils.c)
>   would need a different prototype.  For instance, it could take an
>   array of bytes ("char" datatype) and an argument telling it how
>   many bytes are in the array.  Or it could just take an array of
>   bytes without knowing its size and determine it by decoding the
>   UTF-8 (where the first byte encodes how many bytes are in the
>   character).
>
>   Unfortunately, the code for tokenization would also need to be
>   changed to work with UTF-8 characters.  The next_token() function
>   in preprocessing/tokenizer.c would need to be changed, for
>   starters.  Right now it steps through an array of C "chars";
>   probably it should instead call a function that returns the next
>   UTF-8 character from the input stream.  Calls to strlen() and
>   strncmp() and other C string functions would also need to be
>   replaced with UTF-8 aware functions.  (Presumably there is a
>   library of such functions available.)
>
>   We could create a separate CVS branch for this line of development
>   (to be merged in later), since it's quite important and multiple
>   people might be able to contribute.  I can set that up once we have
>   our CVS house in order.
>
>                                                        Scott
>
>On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote:
>> Hi Scott, Beate,
>>
>> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII
>> characters from its outset.
>>
>> Are there any other parts of InfoMap I should give a closer look and if
>> necessary change for making it capable of handling Japanese and other
>> multibyte characters?  I think I have to do so by trials and errors, but if
>> you could give me guidance it would streamline my process.
>>
>> I plan to use UTF8 as encoding. I hope that my changes would be transparent
>> to ASCII and could be brought back to the main release if we want to. I
>> would be appreciate if I could have access to CVS when it is ready.
>>
>> Regards, Shuji
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: IBM Linux Tutorials
>Free Linux tutorial presented by Daniel Robbins, President and CEO of
>GenToo technologies. Learn everything from fundamentals to system
>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
>_______________________________________________
>infomap-nlp-devel mailing list
>inf...@li...
>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>

RE: [infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?

From: Shuji Y. <yam...@ya...> - 2004-03-15 10:03:54

Beate,
Yes, tokenizer is needed outside of Informap to process corpus of =
languages
like Japanese where words are connected to each other. I install and =
plan to
use ChaSen for Japanese. For other such languages I will find such
tokenization tools for them.

Scott,
I start subscription of infomap-nlp-devel list.
I have skimmed through some of Unicode sites and found the following =
below
informative. Some of the sites include small examples.

I have however a 2nd thought that it may be quicker and more =
straightforward
to write a program which converts a Japanese character to an alphabet =
(e.g.
by mapping an internal encoding in hexadecimal to 'a' to 'p' character,
instead of the regular 0-f characters, and vice versa). InfoMap then =
will be
able to handle a 'Japanese' words as another sequence of alphabets, =
though
it would double the length of word representation within InfoMap. =
Obviously
it has a drawback that you can not read a Japanese word in the direct
outputs from InfoMap, which have to be converted back to be shown as a
meaningful character.=20
If you can think of any other pitfalls in this sort of method, please =
let me
know.=20

Unicode sites
-------------------
http://www.cl.cam.ac.uk/~mgk25/unicode.html
  Good introductory site. The following sections are particularly useful =
for
converting Informap UTF8 capable.
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod
  http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
  Among approaches discussed in this section, we should probably aim for
"hard-wired" and "hard conversion" approaches in spite that it would not =
be
extensible to other multibyte encodings like EUC.

ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
  This is another useful site. The section below talks about how to =
modify C
programs.
  ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-6.html

http://www.unix-systems.org/version2/whatsnew/login_mse.html
  Useful guide on distinction between multibyte and wide-character
encodings.

Many thanks for your support.
Regards, Shuji

-----Original Message-----
From: inf...@li...
[mailto:inf...@li...] On Behalf Of =
Beate
Dorow
Sent: Friday, March 12, 2004 8:10 AM
To: Scott James Cederberg
Cc: Shuji Yamaguchi; inf...@li...
Subject: Re: [infomap-nlp-devel] Re: my_isalpha(). What else should I =
change
to make InfoMap capable of handling multibyte characters?

Dear Shuji, Scott,

I think first of all, we'll need to detect word boundaries. This is
straightforward for the European languages where words are simply
separated by spaces, but probably not so easy for Japanese. I saw that
the old infomap folks used ChaSen, a tool for detecting word boundaries =
in
Japanese, when they did cross-lingual IR on a parallel corpus of
Japanese-English patent abstracts.

Do you have a tool at hand which detects the boundaries of Japanese =
words,
Shuji?

Best wishes,
Beate

On Thu, 11 Mar 2004, Scott James Cederberg wrote:

>Hi Shuji,
>
>   I will certainly give you access to CVS when it is ready.  You may
>   want to subscribe to inf...@li... to
>   make sure you receive all relevant announcements.
>
>   I've read about what UTF-8 is, but I've never used it in programs.
>   If you have C code (or pointers to C code) using UTF-8, please let
>   me know because I'd like to take a look.
>
>   What I do know is that UTF-8 characters can consist of a variable
>   number of bytes (from one to six, but I think generally only from
>   one to three).  Thus my_isalpha() (which is defined in lib/utils.c)
>   would need a different prototype.  For instance, it could take an
>   array of bytes ("char" datatype) and an argument telling it how
>   many bytes are in the array.  Or it could just take an array of
>   bytes without knowing its size and determine it by decoding the
>   UTF-8 (where the first byte encodes how many bytes are in the
>   character).
>
>   Unfortunately, the code for tokenization would also need to be
>   changed to work with UTF-8 characters.  The next_token() function
>   in preprocessing/tokenizer.c would need to be changed, for
>   starters.  Right now it steps through an array of C "chars";
>   probably it should instead call a function that returns the next
>   UTF-8 character from the input stream.  Calls to strlen() and
>   strncmp() and other C string functions would also need to be
>   replaced with UTF-8 aware functions.  (Presumably there is a
>   library of such functions available.)
>
>   We could create a separate CVS branch for this line of development
>   (to be merged in later), since it's quite important and multiple
>   people might be able to contribute.  I can set that up once we have
>   our CVS house in order.
>
>                                                        Scott
>
>On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote:
>> Hi Scott, Beate,
>>
>> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII
>> characters from its outset.
>>
>> Are there any other parts of InfoMap I should give a closer look and =
if
>> necessary change for making it capable of handling Japanese and other
>> multibyte characters?  I think I have to do so by trials and errors, =
but
if
>> you could give me guidance it would streamline my process.
>>
>> I plan to use UTF8 as encoding. I hope that my changes would be
transparent
>> to ASCII and could be brought back to the main release if we want to. =
I
>> would be appreciate if I could have access to CVS when it is ready.
>>
>> Regards, Shuji
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by: IBM Linux Tutorials
>Free Linux tutorial presented by Daniel Robbins, President and CEO of
>GenToo technologies. Learn everything from fundamentals to system
>administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcl=
ick
>_______________________________________________
>infomap-nlp-devel mailing list
>inf...@li...
>https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>

-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli=
ck
_______________________________________________
infomap-nlp-devel mailing list
inf...@li...
https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel

Re: [infomap-nlp-devel] Re: my_isalpha(). What else should I change to make InfoMap capable of handling multibyte characters?

From: Scott J. C. <ced...@cs...> - 2004-03-16 17:50:48

Shuji,

   Thanks for the pointers to Unicode sites.

On Mon, Mar 15, 2004 at 02:03:48AM -0800, Shuji Yamaguchi wrote:
> I have however a 2nd thought that it may be quicker and more straightforward
> to write a program which converts a Japanese character to an alphabet (e.g.
> by mapping an internal encoding in hexadecimal to 'a' to 'p' character,
> instead of the regular 0-f characters, and vice versa). InfoMap then will be
> able to handle a 'Japanese' words as another sequence of alphabets, though
> it would double the length of word representation within InfoMap. Obviously
> it has a drawback that you can not read a Japanese word in the direct
> outputs from InfoMap, which have to be converted back to be shown as a
> meaningful character. 
> If you can think of any other pitfalls in this sort of method, please let me
> know. 
> 

    So you're saying that any (alphabetic) Japanese character would be
    repesented by a unique string of alphabetic ASCII characters?

    That's an interesting approach.  I'd like to think it over a
    little more before offering other comments.

    I'm sorry for the delays on making CVS available.  I'll post to
    the infomap-nlp-devel list when it's ready.

                                                     Scott