You can subscribe to this list here.
2004 |
Jan
(1) |
Feb
|
Mar
(11) |
Apr
(16) |
May
(1) |
Jun
|
Jul
(1) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(3) |
Oct
(5) |
Nov
|
Dec
|
2007 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Viktor T. <v....@ed...> - 2004-04-19 16:57:27
|
Yes finally I have uploaded the changes. It took me a while cause I wnated to document it so I extended the manpages. (Nothing to the tutorial, though) *Everything* should work like before out of the box. Please check this if you can with a clean temp checkout and compile, etc. Best Viktor On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > Dear Viktor, > > Did you manage to commit your changes to the infomap code to SourcForge at > all? > > Best wishes, > Dominic > > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> Hello Dominic >> I am viktron on Sourcefourge, if you want to add me. >> and then I can commit changes. >> Or maybe you want me to add changes to the documentation as well. >> But then again, that makes sense only if a proper >> conception is crystallized concerning what we want the tokenization >> to do. >> BTW, do you know Colin Bannard? >> Best >> Viktor >> >> >> Quoting Dominic Widdows <dwi...@cs...>: >> >> > >> > Dear Viktor, >> > >> > Thanks so much for doing all of this and documenting the changes for >> > the >> > list. I agree that the my_isalpha function was long overdue an >> > overhaul. >> > It sounds like your changes are much more far reaching than just this, >> > though, and should enable the software to be much more >> > language-general. >> > For example, we've been hoping to enable support for Japanese and it >> > sounds like this will be possible now? >> > >> > It definitely makes more sense to specify what characters you want the >> > tokenizer to treat as alphabetic in a separate file. >> > >> > I'd definitely like to incorporate these changes to the software - >> > would >> > the best way be to add you to the project admins on SourceForge and >> > allow >> > you to commit the changes? If you sign up for an account at >> > https://sourceforge.net/ (or if you have one already) >> > we can add you as a project developer with the necessary permissions. >> > >> > Again, thanks so much for the feedback and the contributions. >> > Best wishes, >> > Dominic >> > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> > >> > > Hello all, >> > > >> > > Your software is great, but praises should be on the user list :-). >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 >> > > >> > > If you are interested I send you the tarball or work it out with docs >> > etc >> > > and commit in cvs. >> > > >> > > Story and summary of changes are below. >> > > Cheers >> > > Viktor >> > > >> > > It all started out yesterday. I wanted to use infomap on a >> > > Hungarian corpus. I soon figured out why things went wrong already >> > at >> > > the tokenization step. >> > > >> > > The problem was: >> > > utils.c >> > > lines 46--53 >> > > >> > > /* This is a somewhat radical approach, in that it assumes >> > > ASCII for efficiency and will *break* with other character >> > > encodings. */ >> > > int my_isalpha( int c) { // configured to let underscore through for >> > POS >> > > and tilda for indexing compounds >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') >> > || c >> > > == '~'); >> > > } >> > > >> > > This function is used by the tokenizer to determine which are the >> > non-word >> > > (breaking) characters. >> > > It views 8 bit ascii chars above 128 as non-word (breaking) >> > characters, >> > > These characters happen to constitute a crucial part of most >> > languages >> > > other than English >> > > usually encoded in ISO-8859-X coding with X>1. >> > > >> > > It is not that it is a 'radical approach' as someone appropriately >> > > described it, >> > > but actually makes the program entirely English-specific entirely >> > > unnecessarily. >> > > So I set out to fix it. >> > > >> > > The whole alpha test should be done directly by the tokenizer. This >> > > funciton actually >> > > says how to segment a stram of strings, which is an extremely >> > important >> > > *meaningful* part of the tokenizer, not an auxiliary function like >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by >> > > tokenizer.c. >> > > >> > > To correctly handle all this, I introduced an extra resource file >> > > containing >> > > a string of legitimate characters considered valid in words. >> > > All other characters will be considered as breaking characters by >> > the >> > > tokenizer >> > > and are skipped. >> > > >> > > The resource file is read in by initialize_tokenizer (appropriately >> > > together with the corpus filenames file) and used to initialize >> > > an array (details below). Then lookup from this array can >> > conveniently >> > > replace >> > > all uses of the previous my_isalpha test. >> > > >> > > This should give sufficiently flexible and charset-independent >> > control >> > > over simple text-based tokenization, which means it can be a proper >> > > multilingual software. >> > > Well, I checked and it worked for my Hungarian stuff. >> > > >> > > Surely I have further ideas of very simple extensions which would >> > perform >> > > tokenization of already tokenized (e.g. xml) files directly. >> > > With this in place the solution with valid_chars would just be >> > > one of the two major tokenization modes. >> > > Also: read-in doesn't seem to me to be optimized (characters of a line >> > are >> > > scanned over twice). Since with large corpora this takes up a great >> > deal >> > > of time, we might want to consider to rewrite it. >> > > >> > > >> > > Details of the changes: >> > > nothing in the documentation yet. >> > > >> > > utils.{c,h}: >> > > function my_isalpha no longer exists, superseded by >> > > more configurable method in tokenizer >> > > >> > > tokenizer.{c,h}: >> > > introduced an int array: valid_chars[256] to look up >> > > for a character c, valid_chars[c] is nonzero iff it is a valid >> > > word-character >> > > if it is 0, it is considered as breaking (and skipped) by the >> > tokenizer >> > > >> > > initialize_tokenizer: now also initializes valid_chars by >> > > reading from a file passed as an extra argument >> > > >> > > prepare_corpus.c: >> > > modified invocation of initialize_tokenizer accordingly >> > > added parsing code for extra option '-chfile' >> > > >> > > For proper invocation of prepare_corpus Makefile.data.in and >> > > informap-build.in >> > > needed to be modified and for proper configuration/installation, >> > some >> > > further changes: >> > > >> > > admin/valid_chars.en: >> > > new file: contains the valid chars that exactly replicate the chars >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c >> > == >> > > '~'); >> > > >> > > admin/default-params.in: >> > > line 13: added default value >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" >> > > >> > > admin/Makefile: >> > > line 216: added default valid chars file 'valid_chars.en' to >> > EXTRA_DIST >> > > list >> > > to be copied into central data directory >> > > >> > > admin/Makefile.data.in: >> > > line 119-125: quotes supplied for all arguments >> > > (lack of quotes caused the build procedure to stop already >> > at >> > > invoking prepare-corpus if some filenames were empty, >> > > rather than reaching the point where it could tell what is missing >> > > if at all a problem that it is missing.) >> > > line 125: added line for valid_chars >> > > >> > > admin/infomap-build.in: >> > > line 113: added line to dump value of VALID_CHARS_FILE >> > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this >> > this >> > > morning) >> > > this dumps overriding command line settings (-D option) to an extra >> > > parameter >> > > file which is then sourced. >> > > cat expected actual setting strings (such as >> > "STOPLIST_FILE=my_stop_list") >> > > to be filenames >> > > >> > > +------------------------------------------------------------------+ >> > > |Viktor Tron v....@ed...| >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| >> > > |European Postgraduate College www.coli.uni-sb.de/egk| >> > > |School of Informatics www.informatics.ed.ac.uk| >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> > > | @ University of Edinburgh, UK www.ed.ac.uk| >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> > > |use LINUX and FREE Software www.linux.org| >> > > +------------------------------------------------------------------+ >> > > >> > > >> > > >> > > ------------------------------------------------------- >> > > This SF.Net email is sponsored by: IBM Linux Tutorials >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO >> > of >> > > GenToo technologies. Learn everything from fundamentals to system >> > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> > > _______________________________________________ >> > > infomap-nlp-devel mailing list >> > > inf...@li... >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> > > >> > >> > >> > ------------------------------------------------------- >> > This SF.Net email is sponsored by: IBM Linux Tutorials >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of >> > GenToo technologies. Learn everything from fundamentals to system >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> > _______________________________________________ >> > infomap-nlp-devel mailing list >> > inf...@li... >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> > >> >> >> >> +------------------------------------------------------------------+ >> |Viktor Tron v....@ed...| >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| >> |European Postgraduate College www.coli.uni-sb.de/egk| >> |School of Informatics www.informatics.ed.ac.uk| >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> | @ University of Edinburgh, UK www.ed.ac.uk| >> |Dept of Computational Linguistics www.coli.uni-sb.de| >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> |use LINUX and FREE Software www.linux.org| >> +------------------------------------------------------------------+ >> |
From: Thierry D. <dec...@df...> - 2004-04-14 07:01:52
|
Dear Authors, Thanks for submitting your tools to the ACL Registry. You can check the entry at http://registry.dfki.de/show.php3?f_system=384 BEst Regards The regsitry team |
From: Declerck <dec...@df...> - 2004-04-13 09:53:42
|
Dear Authors, Thanks for submitting your tools to the ACL Registry. Please check your entry at: http://registry.dfki.de/show.php3?f_system=384 Best Regards The registry team PS Your tools will also be soon listed at lt-world (www.lt-world.org) -- Thierry Declerck, Project leader at the Saarland University & Senior Consultant at DFKI GmbH, Language Technology Lab Stuhlsatzenhausweg, 3 D-66123 Saarbruecken Tel: +49 (0)681 302 5358 Fax: +49 (0)681 302 5338 |
From: Dominic W. <dwi...@cs...> - 2004-04-08 15:40:30
|
Dear Viktor, Thanks so much for doing all of this and documenting the changes for the list. I agree that the my_isalpha function was long overdue an overhaul. It sounds like your changes are much more far reaching than just this, though, and should enable the software to be much more language-general. For example, we've been hoping to enable support for Japanese and it sounds like this will be possible now? It definitely makes more sense to specify what characters you want the tokenizer to treat as alphabetic in a separate file. I'd definitely like to incorporate these changes to the software - would the best way be to add you to the project admins on SourceForge and allow you to commit the changes? If you sign up for an account at https://sourceforge.net/ (or if you have one already) we can add you as a project developer with the necessary permissions. Again, thanks so much for the feedback and the contributions. Best wishes, Dominic On Thu, 8 Apr 2004, Viktor Tron wrote: > Hello all, > > Your software is great, but praises should be on the user list :-). > I subsribed to the list now, because I suggest some changes to 0.8.4 > > If you are interested I send you the tarball or work it out with docs etc > and commit in cvs. > > Story and summary of changes are below. > Cheers > Viktor > > It all started out yesterday. I wanted to use infomap on a > Hungarian corpus. I soon figured out why things went wrong already at > the tokenization step. > > The problem was: > utils.c > lines 46--53 > > /* This is a somewhat radical approach, in that it assumes > ASCII for efficiency and will *break* with other character > encodings. */ > int my_isalpha( int c) { // configured to let underscore through for POS > and tilda for indexing compounds > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > == '~'); > } > > This function is used by the tokenizer to determine which are the non-word > (breaking) characters. > It views 8 bit ascii chars above 128 as non-word (breaking) characters, > These characters happen to constitute a crucial part of most languages > other than English > usually encoded in ISO-8859-X coding with X>1. > > It is not that it is a 'radical approach' as someone appropriately > described it, > but actually makes the program entirely English-specific entirely > unnecessarily. > So I set out to fix it. > > The whole alpha test should be done directly by the tokenizer. This > funciton actually > says how to segment a stram of strings, which is an extremely important > *meaningful* part of the tokenizer, not an auxiliary function like > my_fopen, etc. Fortunately my_isalpha is indeed only used by > tokenizer.c. > > To correctly handle all this, I introduced an extra resource file > containing > a string of legitimate characters considered valid in words. > All other characters will be considered as breaking characters by the > tokenizer > and are skipped. > > The resource file is read in by initialize_tokenizer (appropriately > together with the corpus filenames file) and used to initialize > an array (details below). Then lookup from this array can conveniently > replace > all uses of the previous my_isalpha test. > > This should give sufficiently flexible and charset-independent control > over simple text-based tokenization, which means it can be a proper > multilingual software. > Well, I checked and it worked for my Hungarian stuff. > > Surely I have further ideas of very simple extensions which would perform > tokenization of already tokenized (e.g. xml) files directly. > With this in place the solution with valid_chars would just be > one of the two major tokenization modes. > Also: read-in doesn't seem to me to be optimized (characters of a line are > scanned over twice). Since with large corpora this takes up a great deal > of time, we might want to consider to rewrite it. > > > Details of the changes: > nothing in the documentation yet. > > utils.{c,h}: > function my_isalpha no longer exists, superseded by > more configurable method in tokenizer > > tokenizer.{c,h}: > introduced an int array: valid_chars[256] to look up > for a character c, valid_chars[c] is nonzero iff it is a valid > word-character > if it is 0, it is considered as breaking (and skipped) by the tokenizer > > initialize_tokenizer: now also initializes valid_chars by > reading from a file passed as an extra argument > > prepare_corpus.c: > modified invocation of initialize_tokenizer accordingly > added parsing code for extra option '-chfile' > > For proper invocation of prepare_corpus Makefile.data.in and > informap-build.in > needed to be modified and for proper configuration/installation, some > further changes: > > admin/valid_chars.en: > new file: contains the valid chars that exactly replicate the chars > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c == > '~'); > > admin/default-params.in: > line 13: added default value > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > > admin/Makefile: > line 216: added default valid chars file 'valid_chars.en' to EXTRA_DIST > list > to be copied into central data directory > > admin/Makefile.data.in: > line 119-125: quotes supplied for all arguments > (lack of quotes caused the build procedure to stop already at > invoking prepare-corpus if some filenames were empty, > rather than reaching the point where it could tell what is missing > if at all a problem that it is missing.) > line 125: added line for valid_chars > > admin/infomap-build.in: > line 113: added line to dump value of VALID_CHARS_FILE > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this this > morning) > this dumps overriding command line settings (-D option) to an extra > parameter > file which is then sourced. > cat expected actual setting strings (such as "STOPLIST_FILE=my_stop_list") > to be filenames > > +------------------------------------------------------------------+ > |Viktor Tron v....@ed...| > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > |European Postgraduate College www.coli.uni-sb.de/egk| > |School of Informatics www.informatics.ed.ac.uk| > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > | @ University of Edinburgh, UK www.ed.ac.uk| > |Dept of Computational Linguistics www.coli.uni-sb.de| > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > |use LINUX and FREE Software www.linux.org| > +------------------------------------------------------------------+ > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Viktor T. <v....@ed...> - 2004-04-08 14:12:03
|
Hello all, Your software is great, but praises should be on the user list :-). I subsribed to the list now, because I suggest some changes to 0.8.4 If you are interested I send you the tarball or work it out with docs etc and commit in cvs. Story and summary of changes are below. Cheers Viktor It all started out yesterday. I wanted to use infomap on a Hungarian corpus. I soon figured out why things went wrong already at the tokenization step. The problem was: utils.c lines 46--53 /* This is a somewhat radical approach, in that it assumes ASCII for efficiency and will *break* with other character encodings. */ int my_isalpha( int c) { // configured to let underscore through for POS and tilda for indexing compounds return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c == '~'); } This function is used by the tokenizer to determine which are the non-word (breaking) characters. It views 8 bit ascii chars above 128 as non-word (breaking) characters, These characters happen to constitute a crucial part of most languages other than English usually encoded in ISO-8859-X coding with X>1. It is not that it is a 'radical approach' as someone appropriately described it, but actually makes the program entirely English-specific entirely unnecessarily. So I set out to fix it. The whole alpha test should be done directly by the tokenizer. This funciton actually says how to segment a stram of strings, which is an extremely important *meaningful* part of the tokenizer, not an auxiliary function like my_fopen, etc. Fortunately my_isalpha is indeed only used by tokenizer.c. To correctly handle all this, I introduced an extra resource file containing a string of legitimate characters considered valid in words. All other characters will be considered as breaking characters by the tokenizer and are skipped. The resource file is read in by initialize_tokenizer (appropriately together with the corpus filenames file) and used to initialize an array (details below). Then lookup from this array can conveniently replace all uses of the previous my_isalpha test. This should give sufficiently flexible and charset-independent control over simple text-based tokenization, which means it can be a proper multilingual software. Well, I checked and it worked for my Hungarian stuff. Surely I have further ideas of very simple extensions which would perform tokenization of already tokenized (e.g. xml) files directly. With this in place the solution with valid_chars would just be one of the two major tokenization modes. Also: read-in doesn't seem to me to be optimized (characters of a line are scanned over twice). Since with large corpora this takes up a great deal of time, we might want to consider to rewrite it. Details of the changes: nothing in the documentation yet. utils.{c,h}: function my_isalpha no longer exists, superseded by more configurable method in tokenizer tokenizer.{c,h}: introduced an int array: valid_chars[256] to look up for a character c, valid_chars[c] is nonzero iff it is a valid word-character if it is 0, it is considered as breaking (and skipped) by the tokenizer initialize_tokenizer: now also initializes valid_chars by reading from a file passed as an extra argument prepare_corpus.c: modified invocation of initialize_tokenizer accordingly added parsing code for extra option '-chfile' For proper invocation of prepare_corpus Makefile.data.in and informap-build.in needed to be modified and for proper configuration/installation, some further changes: admin/valid_chars.en: new file: contains the valid chars that exactly replicate the chars accepted as non-breaking by the now obsolete my_isalpha (utils.c) I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c == '~'); admin/default-params.in: line 13: added default value VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" admin/Makefile: line 216: added default valid chars file 'valid_chars.en' to EXTRA_DIST list to be copied into central data directory admin/Makefile.data.in: line 119-125: quotes supplied for all arguments (lack of quotes caused the build procedure to stop already at invoking prepare-corpus if some filenames were empty, rather than reaching the point where it could tell what is missing if at all a problem that it is missing.) line 125: added line for valid_chars admin/infomap-build.in: line 113: added line to dump value of VALID_CHARS_FILE line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this this morning) this dumps overriding command line settings (-D option) to an extra parameter file which is then sourced. cat expected actual setting strings (such as "STOPLIST_FILE=my_stop_list") to be filenames +------------------------------------------------------------------+ |Viktor Tron v....@ed...| |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| |European Postgraduate College www.coli.uni-sb.de/egk| |School of Informatics www.informatics.ed.ac.uk| |Theoretical and Applied Linguistics www.ling.ed.ac.uk| | @ University of Edinburgh, UK www.ed.ac.uk| |Dept of Computational Linguistics www.coli.uni-sb.de| | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| |use LINUX and FREE Software www.linux.org| +------------------------------------------------------------------+ |
From: Scott J. C. <ced...@cs...> - 2004-03-17 17:34:53
|
Hello Infomap NLP users and developers, I'm pleased to announce that we have migrated the project's CVS repository to SourceForge. Those of you interested in working on the code should use this version (everyone else should stick to the official releases, and can disregard the rest of this message). To do anonymous checkout, use the following commands: $ cvs -d:pserver:ano...@cv...:/cvsroot/infomap-nlp Password: (hit return) $ cvs -d:pserver:ano...@cv...:/cvsroot/infomap-nlp co infomap-nlp (Note that these directions are also available on the SourceForge project summary page.) Those listed as developers on the SourceForge project should use the "-d:ext:use...@cv......" SSH-tunneled checkout instead, which will allow commit access. We welcome patches against the CVS version of the code, which should be sent to inf...@li...; please send any problems or questions related to CVS to that list as well. Scott |
From: Scott J. C. <ced...@cs...> - 2004-03-16 23:24:10
|
Hey guys, After some unanticipated headaches, I think I've got CVS on sapir working to the point that you should be able to checkout, configure, and compile without needing any Autotools (Autoconf, Automake, and the like). Can those of you with sapir accounts please try out the new repository (an explicit list of commands follows this message)? Let me know if you have any troubles. My next goal is migrating CVS to SourceForge, after which anonymous CVS read access will be available to all. I'll keep this list posted. Scott Commands: (Please try on various machines; CVS root directory is /sapir/s1/semlab/CVSROOT). $ cvs checkout -r pub-rel-autotools infomap $ cd infomap $ ./configure $ make |
From: Scott J. C. <ced...@cs...> - 2004-03-16 17:50:48
|
Shuji, Thanks for the pointers to Unicode sites. On Mon, Mar 15, 2004 at 02:03:48AM -0800, Shuji Yamaguchi wrote: > I have however a 2nd thought that it may be quicker and more straightforward > to write a program which converts a Japanese character to an alphabet (e.g. > by mapping an internal encoding in hexadecimal to 'a' to 'p' character, > instead of the regular 0-f characters, and vice versa). InfoMap then will be > able to handle a 'Japanese' words as another sequence of alphabets, though > it would double the length of word representation within InfoMap. Obviously > it has a drawback that you can not read a Japanese word in the direct > outputs from InfoMap, which have to be converted back to be shown as a > meaningful character. > If you can think of any other pitfalls in this sort of method, please let me > know. > So you're saying that any (alphabetic) Japanese character would be repesented by a unique string of alphabetic ASCII characters? That's an interesting approach. I'd like to think it over a little more before offering other comments. I'm sorry for the delays on making CVS available. I'll post to the infomap-nlp-devel list when it's ready. Scott |
From: Shuji Y. <yam...@ya...> - 2004-03-15 10:03:54
|
Beate, Yes, tokenizer is needed outside of Informap to process corpus of = languages like Japanese where words are connected to each other. I install and = plan to use ChaSen for Japanese. For other such languages I will find such tokenization tools for them. Scott, I start subscription of infomap-nlp-devel list. I have skimmed through some of Unicode sites and found the following = below informative. Some of the sites include small examples. I have however a 2nd thought that it may be quicker and more = straightforward to write a program which converts a Japanese character to an alphabet = (e.g. by mapping an internal encoding in hexadecimal to 'a' to 'p' character, instead of the regular 0-f characters, and vice versa). InfoMap then = will be able to handle a 'Japanese' words as another sequence of alphabets, = though it would double the length of word representation within InfoMap. = Obviously it has a drawback that you can not read a Japanese word in the direct outputs from InfoMap, which have to be converted back to be shown as a meaningful character.=20 If you can think of any other pitfalls in this sort of method, please = let me know.=20 Unicode sites ------------------- http://www.cl.cam.ac.uk/~mgk25/unicode.html Good introductory site. The following sections are particularly useful = for converting Informap UTF8 capable. http://www.cl.cam.ac.uk/~mgk25/unicode.html#mod http://www.cl.cam.ac.uk/~mgk25/unicode.html#c Among approaches discussed in this section, we should probably aim for "hard-wired" and "hard conversion" approaches in spite that it would not = be extensible to other multibyte encodings like EUC. ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html This is another useful site. The section below talks about how to = modify C programs. ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-6.html http://www.unix-systems.org/version2/whatsnew/login_mse.html Useful guide on distinction between multibyte and wide-character encodings. Many thanks for your support. Regards, Shuji -----Original Message----- From: inf...@li... [mailto:inf...@li...] On Behalf Of = Beate Dorow Sent: Friday, March 12, 2004 8:10 AM To: Scott James Cederberg Cc: Shuji Yamaguchi; inf...@li... Subject: Re: [infomap-nlp-devel] Re: my_isalpha(). What else should I = change to make InfoMap capable of handling multibyte characters? Dear Shuji, Scott, I think first of all, we'll need to detect word boundaries. This is straightforward for the European languages where words are simply separated by spaces, but probably not so easy for Japanese. I saw that the old infomap folks used ChaSen, a tool for detecting word boundaries = in Japanese, when they did cross-lingual IR on a parallel corpus of Japanese-English patent abstracts. Do you have a tool at hand which detects the boundaries of Japanese = words, Shuji? Best wishes, Beate On Thu, 11 Mar 2004, Scott James Cederberg wrote: >Hi Shuji, > > I will certainly give you access to CVS when it is ready. You may > want to subscribe to inf...@li... to > make sure you receive all relevant announcements. > > I've read about what UTF-8 is, but I've never used it in programs. > If you have C code (or pointers to C code) using UTF-8, please let > me know because I'd like to take a look. > > What I do know is that UTF-8 characters can consist of a variable > number of bytes (from one to six, but I think generally only from > one to three). Thus my_isalpha() (which is defined in lib/utils.c) > would need a different prototype. For instance, it could take an > array of bytes ("char" datatype) and an argument telling it how > many bytes are in the array. Or it could just take an array of > bytes without knowing its size and determine it by decoding the > UTF-8 (where the first byte encodes how many bytes are in the > character). > > Unfortunately, the code for tokenization would also need to be > changed to work with UTF-8 characters. The next_token() function > in preprocessing/tokenizer.c would need to be changed, for > starters. Right now it steps through an array of C "chars"; > probably it should instead call a function that returns the next > UTF-8 character from the input stream. Calls to strlen() and > strncmp() and other C string functions would also need to be > replaced with UTF-8 aware functions. (Presumably there is a > library of such functions available.) > > We could create a separate CVS branch for this line of development > (to be merged in later), since it's quite important and multiple > people might be able to contribute. I can set that up once we have > our CVS house in order. > > Scott > >On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote: >> Hi Scott, Beate, >> >> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII >> characters from its outset. >> >> Are there any other parts of InfoMap I should give a closer look and = if >> necessary change for making it capable of handling Japanese and other >> multibyte characters? I think I have to do so by trials and errors, = but if >> you could give me guidance it would streamline my process. >> >> I plan to use UTF8 as encoding. I hope that my changes would be transparent >> to ASCII and could be brought back to the main release if we want to. = I >> would be appreciate if I could have access to CVS when it is ready. >> >> Regards, Shuji > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcl= ick >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > ------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli= ck _______________________________________________ infomap-nlp-devel mailing list inf...@li... https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel |
From: Beate D. <do...@IM...> - 2004-03-12 16:21:17
|
Dear Shuji, Scott, I think first of all, we'll need to detect word boundaries. This is straightforward for the European languages where words are simply separated by spaces, but probably not so easy for Japanese. I saw that the old infomap folks used ChaSen, a tool for detecting word boundaries in Japanese, when they did cross-lingual IR on a parallel corpus of Japanese-English patent abstracts. Do you have a tool at hand which detects the boundaries of Japanese words, Shuji? Best wishes, Beate On Thu, 11 Mar 2004, Scott James Cederberg wrote: >Hi Shuji, > > I will certainly give you access to CVS when it is ready. You may > want to subscribe to inf...@li... to > make sure you receive all relevant announcements. > > I've read about what UTF-8 is, but I've never used it in programs. > If you have C code (or pointers to C code) using UTF-8, please let > me know because I'd like to take a look. > > What I do know is that UTF-8 characters can consist of a variable > number of bytes (from one to six, but I think generally only from > one to three). Thus my_isalpha() (which is defined in lib/utils.c) > would need a different prototype. For instance, it could take an > array of bytes ("char" datatype) and an argument telling it how > many bytes are in the array. Or it could just take an array of > bytes without knowing its size and determine it by decoding the > UTF-8 (where the first byte encodes how many bytes are in the > character). > > Unfortunately, the code for tokenization would also need to be > changed to work with UTF-8 characters. The next_token() function > in preprocessing/tokenizer.c would need to be changed, for > starters. Right now it steps through an array of C "chars"; > probably it should instead call a function that returns the next > UTF-8 character from the input stream. Calls to strlen() and > strncmp() and other C string functions would also need to be > replaced with UTF-8 aware functions. (Presumably there is a > library of such functions available.) > > We could create a separate CVS branch for this line of development > (to be merged in later), since it's quite important and multiple > people might be able to contribute. I can set that up once we have > our CVS house in order. > > Scott > >On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote: >> Hi Scott, Beate, >> >> As Beate wrote on my_isalpha(), I note it does not accept non-ASCII >> characters from its outset. >> >> Are there any other parts of InfoMap I should give a closer look and if >> necessary change for making it capable of handling Japanese and other >> multibyte characters? I think I have to do so by trials and errors, but if >> you could give me guidance it would streamline my process. >> >> I plan to use UTF8 as encoding. I hope that my changes would be transparent >> to ASCII and could be brought back to the main release if we want to. I >> would be appreciate if I could have access to CVS when it is ready. >> >> Regards, Shuji > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Scott J. C. <ced...@cs...> - 2004-03-11 20:35:50
|
On the bright side, any code that works with UTF-8 will automatically work with ASCII, since ASCII characters are valid UTF-8 characters. Scott |
From: Scott J. C. <ced...@cs...> - 2004-03-11 20:34:17
|
Hi Shuji, I will certainly give you access to CVS when it is ready. You may want to subscribe to inf...@li... to make sure you receive all relevant announcements. I've read about what UTF-8 is, but I've never used it in programs. If you have C code (or pointers to C code) using UTF-8, please let me know because I'd like to take a look. What I do know is that UTF-8 characters can consist of a variable number of bytes (from one to six, but I think generally only from one to three). Thus my_isalpha() (which is defined in lib/utils.c) would need a different prototype. For instance, it could take an array of bytes ("char" datatype) and an argument telling it how many bytes are in the array. Or it could just take an array of bytes without knowing its size and determine it by decoding the UTF-8 (where the first byte encodes how many bytes are in the character). Unfortunately, the code for tokenization would also need to be changed to work with UTF-8 characters. The next_token() function in preprocessing/tokenizer.c would need to be changed, for starters. Right now it steps through an array of C "chars"; probably it should instead call a function that returns the next UTF-8 character from the input stream. Calls to strlen() and strncmp() and other C string functions would also need to be replaced with UTF-8 aware functions. (Presumably there is a library of such functions available.) We could create a separate CVS branch for this line of development (to be merged in later), since it's quite important and multiple people might be able to contribute. I can set that up once we have our CVS house in order. Scott On Thu, Mar 11, 2004 at 06:07:19AM -0800, Shuji Yamaguchi wrote: > Hi Scott, Beate, > > As Beate wrote on my_isalpha(), I note it does not accept non-ASCII > characters from its outset. > > Are there any other parts of InfoMap I should give a closer look and if > necessary change for making it capable of handling Japanese and other > multibyte characters? I think I have to do so by trials and errors, but if > you could give me guidance it would streamline my process. > > I plan to use UTF8 as encoding. I hope that my changes would be transparent > to ASCII and could be brought back to the main release if we want to. I > would be appreciate if I could have access to CVS when it is ready. > > Regards, Shuji |
From: Shuji Y. <yam...@ya...> - 2004-03-11 14:17:39
|
Hi Scott, Beate, As Beate wrote on my_isalpha(), I note it does not accept non-ASCII characters from its outset. Are there any other parts of InfoMap I should give a closer look and if necessary change for making it capable of handling Japanese and other multibyte characters? I think I have to do so by trials and errors, but if you could give me guidance it would streamline my process. I plan to use UTF8 as encoding. I hope that my changes would be transparent to ASCII and could be brought back to the main release if we want to. I would be appreciate if I could have access to CVS when it is ready. Regards, Shuji -----Original Message----- From: Scott James Cederberg [mailto:ced...@cs...] Sent: Wednesday, March 10, 2004 3:08 PM To: Beate Dorow Cc: inf...@li...; yam...@ya... Subject: Re: [infomap-nlp-devel] Re: [infomap-nlp-users] Infomap. Can I choose and feed "content-bearing words" to "count_wordvec"? (fwd) Beate, Thanks for your help! What you describe sounds like a reasonable approach. Unfortunately, I need to do some housekeeping with our CVS repository before it can be changed by multiple people without making a mess. I am planning to do that by the end of the week, and I'll get back to you. Scott |
From: Scott J. C. <ced...@cs...> - 2004-03-10 23:18:17
|
Beate, Thanks for your help! What you describe sounds like a reasonable approach. Unfortunately, I need to do some housekeeping with our CVS repository before it can be changed by multiple people without making a mess. I am planning to do that by the end of the week, and I'll get back to you. Scott On Wed, Mar 10, 2004 at 06:37:49PM +0100, Beate Dorow wrote: > > > Dear Scott, > > I am busy writing lately, but I don't mind adding this feature. Do you > think it'd be early enough if I did it during the weekend? > > It's the initialize_column_indices routine (in dict.c) which picks the > column labels. I remember that we did earlier experiments with picking the > top words according to tf-idf as column labels rather than the top > frequent ones. > > I think it wouldn't be a big deal to hand over a Boolean variable > $FROM_FILE to initialize_column_indices which indicates whether column > indices should be computed or read from a file. > We could let a user "turn on" this variable by adding an option > -cols_from_file to infomap-build which passes the value to > intialize_col_indices via count_wordvec.c. Does that make sense? > > Best wishes, > Beate > > > > On Tue, 9 Mar 2004, Scott James Cederberg wrote: > > >Dominic and Shugi, > > > > I'm CC'ing this reply to infomap-nlp-devel, because in theory the > > sort of discussion touched off by Dominic's message below (about > > how to add this feature) should take place there. > > > > I'm not familiar with where and how count_wordvec chooses the > > content-bearing words, but I think the easiest thing would be to > > modularize the part where it does that (e.g. into a separate function), > > and then create another function that instead read content bearing > > words from a file. Which function was called could be controlled > > by a command-line option. > > > > I've already got a bit of a backlog of reported but unfixed bugs; > > I'm hoping to dig my way out from under that by the end of the > > week. Hopefully next week I would then have time to add this > > feature. > > > > If anyone else wants to take it on, please let me know. > > > > Scott > > > > > >On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote: > >> > >> Hi Scott, > >> > >> I know we talked about this in the past - is it doable or shall we tell > >> people it's on the back burner? > >> > >> As far as I can tell, it's just a question of putting a different list of > >> words into memory and telling the count_wordvec program to look there. > >> Which could be a total can of worms in C. > >> > >> Best wishes, > >> Dominic > >> > >> ---------- Forwarded message ---------- > >> Date: Fri, 5 Mar 2004 18:53:17 -0800 > >> From: Shuji Yamaguchi <yam...@ya...> > >> To: inf...@li... > >> Subject: [infomap-nlp-users] Infomap. Can I choose and feed > >> "content-bearing words" to "count_wordvec"? > >> > >> Hi InfoMap admin and users, > >> > >> I wonder whether I could choose the "content-bearing words" myself and feed > >> them into the pre-processing of InfoMap. > >> The count_wordvec appears to be the program that does it. According to its > >> man page, the content words are chosen from the ones in "ranking 50-1049". > >> Are there any way to customize this by use of options and/or parameters? > >> > >> Thank you for your support. > >> Regards, Shuji > >> > >> Shuji Yamaguchi, > >> Fellow, Reuters Digital Vision Program, CSLI, Stanford. > >> > >> > >> > >> > >> ------------------------------------------------------- > >> This SF.Net email is sponsored by: IBM Linux Tutorials > >> Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> GenToo technologies. Learn everything from fundamentals to system > >> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> _______________________________________________ > >> infomap-nlp-users mailing list > >> inf...@li... > >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > > > >-- > >Scott Cederberg > >Researcher > > > >Infomap Project > >Computational Semantics Lab > >Center for the Study of Language and Information (CSLI) > >Stanford University > > > >http://infomap.stanford.edu/ > > > > > >------------------------------------------------------- > >This SF.Net email is sponsored by: IBM Linux Tutorials > >Free Linux tutorial presented by Daniel Robbins, President and CEO of > >GenToo technologies. Learn everything from fundamentals to system > >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >_______________________________________________ > >infomap-nlp-devel mailing list > >inf...@li... > >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > > -- Scott Cederberg Researcher Infomap Project Computational Semantics Lab Center for the Study of Language and Information (CSLI) Stanford University http://infomap.stanford.edu/ |
From: Beate D. <do...@IM...> - 2004-03-10 17:47:56
|
Dear Scott, I am busy writing lately, but I don't mind adding this feature. Do you think it'd be early enough if I did it during the weekend? It's the initialize_column_indices routine (in dict.c) which picks the column labels. I remember that we did earlier experiments with picking the top words according to tf-idf as column labels rather than the top frequent ones. I think it wouldn't be a big deal to hand over a Boolean variable $FROM_FILE to initialize_column_indices which indicates whether column indices should be computed or read from a file. We could let a user "turn on" this variable by adding an option -cols_from_file to infomap-build which passes the value to intialize_col_indices via count_wordvec.c. Does that make sense? Best wishes, Beate On Tue, 9 Mar 2004, Scott James Cederberg wrote: >Dominic and Shugi, > > I'm CC'ing this reply to infomap-nlp-devel, because in theory the > sort of discussion touched off by Dominic's message below (about > how to add this feature) should take place there. > > I'm not familiar with where and how count_wordvec chooses the > content-bearing words, but I think the easiest thing would be to > modularize the part where it does that (e.g. into a separate function), > and then create another function that instead read content bearing > words from a file. Which function was called could be controlled > by a command-line option. > > I've already got a bit of a backlog of reported but unfixed bugs; > I'm hoping to dig my way out from under that by the end of the > week. Hopefully next week I would then have time to add this > feature. > > If anyone else wants to take it on, please let me know. > > Scott > > >On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote: >> >> Hi Scott, >> >> I know we talked about this in the past - is it doable or shall we tell >> people it's on the back burner? >> >> As far as I can tell, it's just a question of putting a different list of >> words into memory and telling the count_wordvec program to look there. >> Which could be a total can of worms in C. >> >> Best wishes, >> Dominic >> >> ---------- Forwarded message ---------- >> Date: Fri, 5 Mar 2004 18:53:17 -0800 >> From: Shuji Yamaguchi <yam...@ya...> >> To: inf...@li... >> Subject: [infomap-nlp-users] Infomap. Can I choose and feed >> "content-bearing words" to "count_wordvec"? >> >> Hi InfoMap admin and users, >> >> I wonder whether I could choose the "content-bearing words" myself and feed >> them into the pre-processing of InfoMap. >> The count_wordvec appears to be the program that does it. According to its >> man page, the content words are chosen from the ones in "ranking 50-1049". >> Are there any way to customize this by use of options and/or parameters? >> >> Thank you for your support. >> Regards, Shuji >> >> Shuji Yamaguchi, >> Fellow, Reuters Digital Vision Program, CSLI, Stanford. >> >> >> >> >> ------------------------------------------------------- >> This SF.Net email is sponsored by: IBM Linux Tutorials >> Free Linux tutorial presented by Daniel Robbins, President and CEO of >> GenToo technologies. Learn everything from fundamentals to system >> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> _______________________________________________ >> infomap-nlp-users mailing list >> inf...@li... >> https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users > >-- >Scott Cederberg >Researcher > >Infomap Project >Computational Semantics Lab >Center for the Study of Language and Information (CSLI) >Stanford University > >http://infomap.stanford.edu/ > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Scott J. C. <ced...@cs...> - 2004-03-09 19:40:36
|
Dominic and Shugi, I'm CC'ing this reply to infomap-nlp-devel, because in theory the sort of discussion touched off by Dominic's message below (about how to add this feature) should take place there. I'm not familiar with where and how count_wordvec chooses the content-bearing words, but I think the easiest thing would be to modularize the part where it does that (e.g. into a separate function), and then create another function that instead read content bearing words from a file. Which function was called could be controlled by a command-line option. I've already got a bit of a backlog of reported but unfixed bugs; I'm hoping to dig my way out from under that by the end of the week. Hopefully next week I would then have time to add this feature. If anyone else wants to take it on, please let me know. Scott On Fri, Mar 05, 2004 at 07:00:19PM -0800, Dominic Widdows wrote: > > Hi Scott, > > I know we talked about this in the past - is it doable or shall we tell > people it's on the back burner? > > As far as I can tell, it's just a question of putting a different list of > words into memory and telling the count_wordvec program to look there. > Which could be a total can of worms in C. > > Best wishes, > Dominic > > ---------- Forwarded message ---------- > Date: Fri, 5 Mar 2004 18:53:17 -0800 > From: Shuji Yamaguchi <yam...@ya...> > To: inf...@li... > Subject: [infomap-nlp-users] Infomap. Can I choose and feed > "content-bearing words" to "count_wordvec"? > > Hi InfoMap admin and users, > > I wonder whether I could choose the "content-bearing words" myself and feed > them into the pre-processing of InfoMap. > The count_wordvec appears to be the program that does it. According to its > man page, the content words are chosen from the ones in "ranking 50-1049". > Are there any way to customize this by use of options and/or parameters? > > Thank you for your support. > Regards, Shuji > > Shuji Yamaguchi, > Fellow, Reuters Digital Vision Program, CSLI, Stanford. > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-users mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-users -- Scott Cederberg Researcher Infomap Project Computational Semantics Lab Center for the Study of Language and Information (CSLI) Stanford University http://infomap.stanford.edu/ |
From: Scott J. C. <ced...@cs...> - 2004-01-10 00:48:00
|
Hello Infomap NLP software developers, I hope this list will prove a valuable tool in our struggle to coordinate the efforts of our massive worldwide development team. Scott -- Scott Cederberg Researcher Infomap Project Computational Semantics Lab Center for the Study of Language and Information (CSLI) Stanford University http://infomap.stanford.edu/ |