From: Scott J. C. <ced...@cs...> - 2004-03-09 19:40:36
|
Dominic and Shugi, I'm CC'ing this reply to infomap-nlp-devel, because in theory the sort of discussion touched off by Dominic's message below (about how to add this feature) should take place there. I'm not familiar with where and how count_wordvec chooses the content-bearing words, but I think the easiest thing would be to modularize the part where it does that (e.g. into a separate function), and then create another function that instead read content bearing words from a file. Which function was called could be controlled by a command-line option. I've already got a bit of a backlog of reported but unfixed bugs; I'm hoping to dig my way out from under that by the end of the week. Hopefully next week I would then have time to add this feature. If anyone else wants to take it on, please let me know. Scott On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote: > > Hi Scott, > > I know we talked about this in the past - is it doable or shall we tell > people it's on the back burner? > > As far as I can tell, it's just a question of putting a different list of > words into memory and telling the count_wordvec program to look there. > Which could be a total can of worms in C. > > Best wishes, > Dominic > > ---------- Forwarded message ---------- > Date: Fri, 5 Mar 2004 18:53:17 -0800 > From: Shuji Yamaguchi <yam...@ya...> > To: inf...@li... > Subject: [infomap-nlp-users] Infomap. Can I choose and feed > "content-bearing words" to "count_wordvec"? > > Hi InfoMap admin and users, > > I wonder whether I could choose the "content-bearing words" myself and feed > them into the pre-processing of InfoMap. > The count_wordvec appears to be the program that does it. According to its > man page, the content words are chosen from the ones in "ranking 50-1049". > Are there any way to customize this by use of options and/or parameters? > > Thank you for your support. > Regards, Shuji > > Shuji Yamaguchi, > Fellow, Reuters Digital Vision Program, CSLI, Stanford. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users -- Scott Cederberg Researcher Infomap Project Computational Semantics Lab Center for the Study of Language and Information (CSLI) Stanford University http://infomap.stanford.edu/ |
From: Beate D. <do...@IM...> - 2004-03-10 17:47:56
|
Dear Scott, I am busy writing lately, but I don't mind adding this feature. Do you think it'd be early enough if I did it during the weekend? It's the initialize_column_indices routine (in dict.c) which picks the column labels. I remember that we did earlier experiments with picking the top words according to tf-idf as column labels rather than the top frequent ones. I think it wouldn't be a big deal to hand over a Boolean variable $FROM_FILE to initialize_column_indices which indicates whether column indices should be computed or read from a file. We could let a user "turn on" this variable by adding an option -cols_from_file to infomap-build which passes the value to intialize_col_indices via count_wordvec.c. Does that make sense? Best wishes, Beate On Tue, 9 Mar 2004, Scott James Cederberg wrote: >Dominic and Shugi, > > I'm CC'ing this reply to infomap-nlp-devel, because in theory the > sort of discussion touched off by Dominic's message below (about > how to add this feature) should take place there. > > I'm not familiar with where and how count_wordvec chooses the > content-bearing words, but I think the easiest thing would be to > modularize the part where it does that (e.g. into a separate function), > and then create another function that instead read content bearing > words from a file. Which function was called could be controlled > by a command-line option. > > I've already got a bit of a backlog of reported but unfixed bugs; > I'm hoping to dig my way out from under that by the end of the > week. Hopefully next week I would then have time to add this > feature. > > If anyone else wants to take it on, please let me know. > > Scott > > >On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote: >> >> Hi Scott, >> >> I know we talked about this in the past - is it doable or shall we tell >> people it's on the back burner? >> >> As far as I can tell, it's just a question of putting a different list of >> words into memory and telling the count_wordvec program to look there. >> Which could be a total can of worms in C. >> >> Best wishes, >> Dominic >> >> ---------- Forwarded message ---------- >> Date: Fri, 5 Mar 2004 18:53:17 -0800 >> From: Shuji Yamaguchi <yam...@ya...> >> To: inf...@li... >> Subject: [infomap-nlp-users] Infomap. Can I choose and feed >> "content-bearing words" to "count_wordvec"? >> >> Hi InfoMap admin and users, >> >> I wonder whether I could choose the "content-bearing words" myself and feed >> them into the pre-processing of InfoMap. >> The count_wordvec appears to be the program that does it. According to its >> man page, the content words are chosen from the ones in "ranking 50-1049". >> Are there any way to customize this by use of options and/or parameters? >> >> Thank you for your support. >> Regards, Shuji >> >> Shuji Yamaguchi, >> Fellow, Reuters Digital Vision Program, CSLI, Stanford. >> >> >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by: IBM Linux Tutorials >> Free Linux tutorial presented by Daniel Robbins, President and CEO of >> GenToo technologies. Learn everything from fundamentals to system >> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> _______________________________________________ >> infomap-nlp-users mailing list >> inf...@li... >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > >-- >Scott Cederberg >Researcher > >Infomap Project >Computational Semantics Lab >Center for the Study of Language and Information (CSLI) >Stanford University > >http://infomap.stanford.edu/ > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Scott J. C. <ced...@cs...> - 2004-03-10 23:18:17
|
Beate, Thanks for your help! What you describe sounds like a reasonable approach. Unfortunately, I need to do some housekeeping with our CVS repository before it can be changed by multiple people without making a mess. I am planning to do that by the end of the week, and I'll get back to you. Scott On Wed, Mar 10, 2004 at 06:37:49PM +0100, Beate Dorow wrote: > > > Dear Scott, > > I am busy writing lately, but I don't mind adding this feature. Do you > think it'd be early enough if I did it during the weekend? > > It's the initialize_column_indices routine (in dict.c) which picks the > column labels. I remember that we did earlier experiments with picking the > top words according to tf-idf as column labels rather than the top > frequent ones. > > I think it wouldn't be a big deal to hand over a Boolean variable > $FROM_FILE to initialize_column_indices which indicates whether column > indices should be computed or read from a file. > We could let a user "turn on" this variable by adding an option > -cols_from_file to infomap-build which passes the value to > intialize_col_indices via count_wordvec.c. Does that make sense? > > Best wishes, > Beate > > > > On Tue, 9 Mar 2004, Scott James Cederberg wrote: > > >Dominic and Shugi, > > > > I'm CC'ing this reply to infomap-nlp-devel, because in theory the > > sort of discussion touched off by Dominic's message below (about > > how to add this feature) should take place there. > > > > I'm not familiar with where and how count_wordvec chooses the > > content-bearing words, but I think the easiest thing would be to > > modularize the part where it does that (e.g. into a separate function), > > and then create another function that instead read content bearing > > words from a file. Which function was called could be controlled > > by a command-line option. > > > > I've already got a bit of a backlog of reported but unfixed bugs; > > I'm hoping to dig my way out from under that by the end of the > > week. Hopefully next week I would then have time to add this > > feature. > > > > If anyone else wants to take it on, please let me know. > > > > Scott > > > > > >On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote: > >> > >> Hi Scott, > >> > >> I know we talked about this in the past - is it doable or shall we tell > >> people it's on the back burner? > >> > >> As far as I can tell, it's just a question of putting a different list of > >> words into memory and telling the count_wordvec program to look there. > >> Which could be a total can of worms in C. > >> > >> Best wishes, > >> Dominic > >> > >> ---------- Forwarded message ---------- > >> Date: Fri, 5 Mar 2004 18:53:17 -0800 > >> From: Shuji Yamaguchi <yam...@ya...> > >> To: inf...@li... > >> Subject: [infomap-nlp-users] Infomap. Can I choose and feed > >> "content-bearing words" to "count_wordvec"? > >> > >> Hi InfoMap admin and users, > >> > >> I wonder whether I could choose the "content-bearing words" myself and feed > >> them into the pre-processing of InfoMap. > >> The count_wordvec appears to be the program that does it. According to its > >> man page, the content words are chosen from the ones in "ranking 50-1049". > >> Are there any way to customize this by use of options and/or parameters? > >> > >> Thank you for your support. > >> Regards, Shuji > >> > >> Shuji Yamaguchi, > >> Fellow, Reuters Digital Vision Program, CSLI, Stanford. > >> > >> > >> > >> > >> ------------------------------------------------------- > >> This SF.Net email is sponsored by: IBM Linux Tutorials > >> Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> GenToo technologies. Learn everything from fundamentals to system > >> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> _______________________________________________ > >> infomap-nlp-users mailing list > >> inf...@li... > >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > > > >-- > >Scott Cederberg > >Researcher > > > >Infomap Project > >Computational Semantics Lab > >Center for the Study of Language and Information (CSLI) > >Stanford University > > > >http://infomap.stanford.edu/ > > > > > >------------------------------------------------------- > >This SF.Net email is sponsored by: IBM Linux Tutorials > >Free Linux tutorial presented by Daniel Robbins, President and CEO of > >GenToo technologies. Learn everything from fundamentals to system > >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >_______________________________________________ > >infomap-nlp-devel mailing list > >inf...@li... > >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > > -- Scott Cederberg Researcher Infomap Project Computational Semantics Lab Center for the Study of Language and Information (CSLI) Stanford University http://infomap.stanford.edu/ |
From: Shuji Y. <yam...@ya...> - 2004-03-11 14:17:39
|
Hi Scott, Beate, As Beate wrote on my_isalpha(), I note it does not accept non-ASCII characters from its outset. Are there any other parts of InfoMap I should give a closer look and if necessary change for making it capable of handling Japanese and other multibyte characters? I think I have to do so by trials and errors, but if you could give me guidance it would streamline my process. I plan to use UTF8 as encoding. I hope that my changes would be transparent to ASCII and could be brought back to the main release if we want to. I would be appreciate if I could have access to CVS when it is ready. Regards, Shuji -----Original Message----- From: Scott James Cederberg [mailto:ced...@cs...] Sent: Wednesday, March 10, 2004 3:08 PM To: Beate Dorow Cc: inf...@li...; yam...@ya... Subject: Re: [infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I choose and feed "content-bearing words" to "count_wordvec"? (fwd) Beate, Thanks for your help! What you describe sounds like a reasonable approach. Unfortunately, I need to do some housekeeping with our CVS repository before it can be changed by multiple people without making a mess. I am planning to do that by the end of the week, and I'll get back to you. Scott |
From: Scott J. C. <ced...@cs...> - 2004-03-11 20:35:50
|
On the bright side, any code that works with UTF-8 will automatically work with ASCII, since ASCII characters are valid UTF-8 characters. Scott |
From: Scott J. C. <ced...@cs...> - 2004-03-11 20:34:17
|
Hi Shuji, I will certainly give you access to CVS when it is ready. You may want to subscribe to inf...@li... to make sure you receive all relevant announcements. I've read about what UTF-8 is, but I've never used it in programs. If you have C code (or pointers to C code) using UTF-8, please let me know because I'd like to take a look. What I do know is that UTF-8 characters can consist of a variable number of bytes (from one to six, but I think generally only from one to three). Thus my_isalpha() (which is defined in lib/utils.c) would need a different prototype. For instance, it could take an array of bytes ("char" datatype) and an argument telling it how many bytes are in the array. Or it could just take an array of bytes without knowing its size and determine it by decoding the UTF-8 (where the first byte encodes how many bytes are in the character). Unfortunately, the code for tokenization would also need to be changed to work with UTF-8 characters. The next_token() function in preprocessing/tokenizer.c would need to be changed, for starters. Right now it steps through an array of C "chars"; probably it should instead call a function that returns the next UTF-8 character from the input stream. Calls to strlen() and strncmp() and other C string functions would also need to be replaced with UTF-8 aware functions. (Presumably there is a library of such functions available.) We could create a separate CVS branch for this line of development (to be merged in later), since it's quite important and multiple people might be able to contribute. I can set that up once we have our CVS house in order. Scott On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote: > Hi Scott, Beate, > > As Beate wrote on my_isalpha(), I note it does not accept non-ASCII > characters from its outset. > > Are there any other parts of InfoMap I should give a closer look and if > necessary change for making it capable of handling Japanese and other > multibyte characters? I think I have to do so by trials and errors, but if > you could give me guidance it would streamline my process. > > I plan to use UTF8 as encoding. I hope that my changes would be transparent > to ASCII and could be brought back to the main release if we want to. I > would be appreciate if I could have access to CVS when it is ready. > > Regards, Shuji |
From: Beate D. <do...@IM...> - 2004-03-12 16:21:17
|
Dear Shuji, Scott, I think first of all, we'll need to detect word boundaries. This is straightforward for the European languages where words are simply separated by spaces, but probably not so easy for Japanese. I saw that the old infomap folks used ChaSen, a tool for detecting word boundaries in Japanese, when they did cross-lingual IR on a parallel corpus of Japanese-English patent abstracts. Do you have a tool at hand which detects the boundaries of Japanese words, Shuji? Best wishes, Beate On Thu, 11 Mar 2004, Scott James Cederberg wrote: >Hi Shuji, > > I will certainly give you access to CVS when it is ready. You may > want to subscribe to inf...@li... to > make sure you receive all relevant announcements. > > I've read about what UTF-8 is, but I've never used it in programs. > If you have C code (or pointers to C code) using UTF-8, please let > me know because I'd like to take a look. > > What I do know is that UTF-8 characters can consist of a variable > number of bytes (from one to six, but I think generally only from > one to three). Thus my_isalpha() (which is defined in lib/utils.c) > would need a different prototype. For instance, it could take an > array of bytes ("char" datatype) and an argument telling it how > many bytes are in the array. Or it could just take an array of > bytes without knowing its size and determine it by decoding the > UTF-8 (where the first byte encodes how many bytes are in the > character). > > Unfortunately, the code for tokenization would also need to be > changed to work with UTF-8 characters. The next_token() function > in preprocessing/tokenizer.c would need to be changed, for > starters. Right now it steps through an array of C "chars"; > probably it should instead call a function that returns the next > UTF-8 character from the input stream. Calls to strlen() and > strncmp() and other C string functions would also need to be > replaced with UTF-8 aware functions. (Presumably there is a > library of such functions available.) > > We could create a separate CVS branch for this line of development > (to be merged in later), since it's quite important and multiple > people might be able to contribute. I can set that up once we have > our CVS house in order. > > Scott > >On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote: >> Hi Scott, Beate, >> >> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII >> characters from its outset. >> >> Are there any other parts of InfoMap I should give a closer look and if >> necessary change for making it capable of handling Japanese and other >> multibyte characters? I think I have to do so by trials and errors, but if >> you could give me guidance it would streamline my process. >> >> I plan to use UTF8 as encoding. I hope that my changes would be transparent >> to ASCII and could be brought back to the main release if we want to. I >> would be appreciate if I could have access to CVS when it is ready. >> >> Regards, Shuji > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Shuji Y. <yam...@ya...> - 2004-03-15 10:03:54
|
Beate, Yes, tokenizer is needed outside of Informap to process corpus of = languages like Japanese where words are connected to each other. I install and = plan to use ChaSen for Japanese. For other such languages I will find such tokenization tools for them. Scott, I start subscription of infomap-nlp-devel list. I have skimmed through some of Unicode sites and found the following = below informative. Some of the sites include small examples. I have however a 2nd thought that it may be quicker and more = straightforward to write a program which converts a Japanese character to an alphabet = (e.g. by mapping an internal encoding in hexadecimal to 'a' to 'p' character, instead of the regular 0-f characters, and vice versa). InfoMap then = will be able to handle a 'Japanese' words as another sequence of alphabets, = though it would double the length of word representation within InfoMap. = Obviously it has a drawback that you can not read a Japanese word in the direct outputs from InfoMap, which have to be converted back to be shown as a meaningful character.=20 If you can think of any other pitfalls in this sort of method, please = let me know.=20 Unicode sites ------------------- http://www.cl.cam.ac.uk/~mgk25/unicode.html Good introductory site. The following sections are particularly useful = for converting Informap UTF8 capable. http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod http://www.cl.cam.ac.uk/~mgk25/unicode.html#c Among approaches discussed in this section, we should probably aim for "hard-wired" and "hard conversion" approaches in spite that it would not = be extensible to other multibyte encodings like EUC. ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html This is another useful site. The section below talks about how to = modify C programs. ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-6.html http://www.unix-systems.org/version2/whatsnew/login_mse.html Useful guide on distinction between multibyte and wide-character encodings. Many thanks for your support. Regards, Shuji -----Original Message----- From: inf...@li... [mailto:inf...@li...] On Behalf Of = Beate Dorow Sent: Friday, March 12, 2004 8:10 AM To: Scott James Cederberg Cc: Shuji Yamaguchi; inf...@li... Subject: Re: [infomap-nlp-devel] Re: my_isalpha(). What else should I = change to make InfoMap capable of handling multibyte characters? Dear Shuji, Scott, I think first of all, we'll need to detect word boundaries. This is straightforward for the European languages where words are simply separated by spaces, but probably not so easy for Japanese. I saw that the old infomap folks used ChaSen, a tool for detecting word boundaries = in Japanese, when they did cross-lingual IR on a parallel corpus of Japanese-English patent abstracts. Do you have a tool at hand which detects the boundaries of Japanese = words, Shuji? Best wishes, Beate On Thu, 11 Mar 2004, Scott James Cederberg wrote: >Hi Shuji, > > I will certainly give you access to CVS when it is ready. You may > want to subscribe to inf...@li... to > make sure you receive all relevant announcements. > > I've read about what UTF-8 is, but I've never used it in programs. > If you have C code (or pointers to C code) using UTF-8, please let > me know because I'd like to take a look. > > What I do know is that UTF-8 characters can consist of a variable > number of bytes (from one to six, but I think generally only from > one to three). Thus my_isalpha() (which is defined in lib/utils.c) > would need a different prototype. For instance, it could take an > array of bytes ("char" datatype) and an argument telling it how > many bytes are in the array. Or it could just take an array of > bytes without knowing its size and determine it by decoding the > UTF-8 (where the first byte encodes how many bytes are in the > character). > > Unfortunately, the code for tokenization would also need to be > changed to work with UTF-8 characters. The next_token() function > in preprocessing/tokenizer.c would need to be changed, for > starters. Right now it steps through an array of C "chars"; > probably it should instead call a function that returns the next > UTF-8 character from the input stream. Calls to strlen() and > strncmp() and other C string functions would also need to be > replaced with UTF-8 aware functions. (Presumably there is a > library of such functions available.) > > We could create a separate CVS branch for this line of development > (to be merged in later), since it's quite important and multiple > people might be able to contribute. I can set that up once we have > our CVS house in order. > > Scott > >On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote: >> Hi Scott, Beate, >> >> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII >> characters from its outset. >> >> Are there any other parts of InfoMap I should give a closer look and = if >> necessary change for making it capable of handling Japanese and other >> multibyte characters? I think I have to do so by trials and errors, = but if >> you could give me guidance it would streamline my process. >> >> I plan to use UTF8 as encoding. I hope that my changes would be transparent >> to ASCII and could be brought back to the main release if we want to. = I >> would be appreciate if I could have access to CVS when it is ready. >> >> Regards, Shuji > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcl= ick >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli= ck _______________________________________________ infomap-nlp-devel mailing list inf...@li... https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel |
From: Scott J. C. <ced...@cs...> - 2004-03-16 17:50:48
|
Shuji, Thanks for the pointers to Unicode sites. On Mon, Mar 15, 2004 at 02:03:48AM -0800, Shuji Yamaguchi wrote: > I have however a 2nd thought that it may be quicker and more straightforward > to write a program which converts a Japanese character to an alphabet (e.g. > by mapping an internal encoding in hexadecimal to 'a' to 'p' character, > instead of the regular 0-f characters, and vice versa). InfoMap then will be > able to handle a 'Japanese' words as another sequence of alphabets, though > it would double the length of word representation within InfoMap. Obviously > it has a drawback that you can not read a Japanese word in the direct > outputs from InfoMap, which have to be converted back to be shown as a > meaningful character. > If you can think of any other pitfalls in this sort of method, please let me > know. > So you're saying that any (alphabetic) Japanese character would be repesented by a unique string of alphabetic ASCII characters? That's an interesting approach. I'd like to think it over a little more before offering other comments. I'm sorry for the delays on making CVS available. I'll post to the infomap-nlp-devel list when it's ready. Scott |