From: Viktor T. <v....@ed...> - 2004-04-08 14:12:03
|
Hello all, Your software is great, but praises should be on the user list :-). I subsribed to the list now, because I suggest some changes to 0.8.4 If you are interested I send you the tarball or work it out with docs etc and commit in cvs. Story and summary of changes are below. Cheers Viktor It all started out yesterday. I wanted to use infomap on a Hungarian corpus. I soon figured out why things went wrong already at the tokenization step. The problem was: utils.c lines 46--53 /* This is a somewhat radical approach, in that it assumes ASCII for efficiency and will *break* with other character encodings. */ int my_isalpha( int c) { // configured to let underscore through for POS and tilda for indexing compounds return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c == '~'); } This function is used by the tokenizer to determine which are the non-word (breaking) characters. It views 8 bit ascii chars above 128 as non-word (breaking) characters, These characters happen to constitute a crucial part of most languages other than English usually encoded in ISO-8859-X coding with X>1. It is not that it is a 'radical approach' as someone appropriately described it, but actually makes the program entirely English-specific entirely unnecessarily. So I set out to fix it. The whole alpha test should be done directly by the tokenizer. This funciton actually says how to segment a stram of strings, which is an extremely important *meaningful* part of the tokenizer, not an auxiliary function like my_fopen, etc. Fortunately my_isalpha is indeed only used by tokenizer.c. To correctly handle all this, I introduced an extra resource file containing a string of legitimate characters considered valid in words. All other characters will be considered as breaking characters by the tokenizer and are skipped. The resource file is read in by initialize_tokenizer (appropriately together with the corpus filenames file) and used to initialize an array (details below). Then lookup from this array can conveniently replace all uses of the previous my_isalpha test. This should give sufficiently flexible and charset-independent control over simple text-based tokenization, which means it can be a proper multilingual software. Well, I checked and it worked for my Hungarian stuff. Surely I have further ideas of very simple extensions which would perform tokenization of already tokenized (e.g. xml) files directly. With this in place the solution with valid_chars would just be one of the two major tokenization modes. Also: read-in doesn't seem to me to be optimized (characters of a line are scanned over twice). Since with large corpora this takes up a great deal of time, we might want to consider to rewrite it. Details of the changes: nothing in the documentation yet. utils.{c,h}: function my_isalpha no longer exists, superseded by more configurable method in tokenizer tokenizer.{c,h}: introduced an int array: valid_chars[256] to look up for a character c, valid_chars[c] is nonzero iff it is a valid word-character if it is 0, it is considered as breaking (and skipped) by the tokenizer initialize_tokenizer: now also initializes valid_chars by reading from a file passed as an extra argument prepare_corpus.c: modified invocation of initialize_tokenizer accordingly added parsing code for extra option '-chfile' For proper invocation of prepare_corpus Makefile.data.in and informap-build.in needed to be modified and for proper configuration/installation, some further changes: admin/valid_chars.en: new file: contains the valid chars that exactly replicate the chars accepted as non-breaking by the now obsolete my_isalpha (utils.c) I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c == '~'); admin/default-params.in: line 13: added default value VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" admin/Makefile: line 216: added default valid chars file 'valid_chars.en' to EXTRA_DIST list to be copied into central data directory admin/Makefile.data.in: line 119-125: quotes supplied for all arguments (lack of quotes caused the build procedure to stop already at invoking prepare-corpus if some filenames were empty, rather than reaching the point where it could tell what is missing if at all a problem that it is missing.) line 125: added line for valid_chars admin/infomap-build.in: line 113: added line to dump value of VALID_CHARS_FILE line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this this morning) this dumps overriding command line settings (-D option) to an extra parameter file which is then sourced. cat expected actual setting strings (such as "STOPLIST_FILE=my_stop_list") to be filenames +------------------------------------------------------------------+ |Viktor Tron v....@ed...| |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| |European Postgraduate College www.coli.uni-sb.de/egk| |School of Informatics www.informatics.ed.ac.uk| |Theoretical and Applied Linguistics www.ling.ed.ac.uk| | @ University of Edinburgh, UK www.ed.ac.uk| |Dept of Computational Linguistics www.coli.uni-sb.de| | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| |use LINUX and FREE Software www.linux.org| +------------------------------------------------------------------+ |
From: Dominic W. <dwi...@cs...> - 2004-04-08 15:40:30
|
Dear Viktor, Thanks so much for doing all of this and documenting the changes for the list. I agree that the my_isalpha function was long overdue an overhaul. It sounds like your changes are much more far reaching than just this, though, and should enable the software to be much more language-general. For example, we've been hoping to enable support for Japanese and it sounds like this will be possible now? It definitely makes more sense to specify what characters you want the tokenizer to treat as alphabetic in a separate file. I'd definitely like to incorporate these changes to the software - would the best way be to add you to the project admins on SourceForge and allow you to commit the changes? If you sign up for an account at https://sourceforge.net/ (or if you have one already) we can add you as a project developer with the necessary permissions. Again, thanks so much for the feedback and the contributions. Best wishes, Dominic On Thu, 8 Apr 2004, Viktor Tron wrote: > Hello all, > > Your software is great, but praises should be on the user list :-). > I subsribed to the list now, because I suggest some changes to 0.8.4 > > If you are interested I send you the tarball or work it out with docs etc > and commit in cvs. > > Story and summary of changes are below. > Cheers > Viktor > > It all started out yesterday. I wanted to use infomap on a > Hungarian corpus. I soon figured out why things went wrong already at > the tokenization step. > > The problem was: > utils.c > lines 46--53 > > /* This is a somewhat radical approach, in that it assumes > ASCII for efficiency and will *break* with other character > encodings. */ > int my_isalpha( int c) { // configured to let underscore through for POS > and tilda for indexing compounds > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > == '~'); > } > > This function is used by the tokenizer to determine which are the non-word > (breaking) characters. > It views 8 bit ascii chars above 128 as non-word (breaking) characters, > These characters happen to constitute a crucial part of most languages > other than English > usually encoded in ISO-8859-X coding with X>1. > > It is not that it is a 'radical approach' as someone appropriately > described it, > but actually makes the program entirely English-specific entirely > unnecessarily. > So I set out to fix it. > > The whole alpha test should be done directly by the tokenizer. This > funciton actually > says how to segment a stram of strings, which is an extremely important > *meaningful* part of the tokenizer, not an auxiliary function like > my_fopen, etc. Fortunately my_isalpha is indeed only used by > tokenizer.c. > > To correctly handle all this, I introduced an extra resource file > containing > a string of legitimate characters considered valid in words. > All other characters will be considered as breaking characters by the > tokenizer > and are skipped. > > The resource file is read in by initialize_tokenizer (appropriately > together with the corpus filenames file) and used to initialize > an array (details below). Then lookup from this array can conveniently > replace > all uses of the previous my_isalpha test. > > This should give sufficiently flexible and charset-independent control > over simple text-based tokenization, which means it can be a proper > multilingual software. > Well, I checked and it worked for my Hungarian stuff. > > Surely I have further ideas of very simple extensions which would perform > tokenization of already tokenized (e.g. xml) files directly. > With this in place the solution with valid_chars would just be > one of the two major tokenization modes. > Also: read-in doesn't seem to me to be optimized (characters of a line are > scanned over twice). Since with large corpora this takes up a great deal > of time, we might want to consider to rewrite it. > > > Details of the changes: > nothing in the documentation yet. > > utils.{c,h}: > function my_isalpha no longer exists, superseded by > more configurable method in tokenizer > > tokenizer.{c,h}: > introduced an int array: valid_chars[256] to look up > for a character c, valid_chars[c] is nonzero iff it is a valid > word-character > if it is 0, it is considered as breaking (and skipped) by the tokenizer > > initialize_tokenizer: now also initializes valid_chars by > reading from a file passed as an extra argument > > prepare_corpus.c: > modified invocation of initialize_tokenizer accordingly > added parsing code for extra option '-chfile' > > For proper invocation of prepare_corpus Makefile.data.in and > informap-build.in > needed to be modified and for proper configuration/installation, some > further changes: > > admin/valid_chars.en: > new file: contains the valid chars that exactly replicate the chars > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c == > '~'); > > admin/default-params.in: > line 13: added default value > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > > admin/Makefile: > line 216: added default valid chars file 'valid_chars.en' to EXTRA_DIST > list > to be copied into central data directory > > admin/Makefile.data.in: > line 119-125: quotes supplied for all arguments > (lack of quotes caused the build procedure to stop already at > invoking prepare-corpus if some filenames were empty, > rather than reaching the point where it could tell what is missing > if at all a problem that it is missing.) > line 125: added line for valid_chars > > admin/infomap-build.in: > line 113: added line to dump value of VALID_CHARS_FILE > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this this > morning) > this dumps overriding command line settings (-D option) to an extra > parameter > file which is then sourced. > cat expected actual setting strings (such as "STOPLIST_FILE=my_stop_list") > to be filenames > > +------------------------------------------------------------------+ > |Viktor Tron v....@ed...| > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > |European Postgraduate College www.coli.uni-sb.de/egk| > |School of Informatics www.informatics.ed.ac.uk| > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > | @ University of Edinburgh, UK www.ed.ac.uk| > |Dept of Computational Linguistics www.coli.uni-sb.de| > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > |use LINUX and FREE Software www.linux.org| > +------------------------------------------------------------------+ > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |
From: Viktor T. <v....@ed...> - 2004-04-19 16:57:27
|
Yes finally I have uploaded the changes. It took me a while cause I wnated to document it so I extended the manpages. (Nothing to the tutorial, though) *Everything* should work like before out of the box. Please check this if you can with a clean temp checkout and compile, etc. Best Viktor On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > Dear Viktor, > > Did you manage to commit your changes to the infomap code to SourcForge at > all? > > Best wishes, > Dominic > > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> Hello Dominic >> I am viktron on Sourcefourge, if you want to add me. >> and then I can commit changes. >> Or maybe you want me to add changes to the documentation as well. >> But then again, that makes sense only if a proper >> conception is crystallized concerning what we want the tokenization >> to do. >> BTW, do you know Colin Bannard? >> Best >> Viktor >> >> >> Quoting Dominic Widdows <dwi...@cs...>: >> >> > >> > Dear Viktor, >> > >> > Thanks so much for doing all of this and documenting the changes for >> > the >> > list. I agree that the my_isalpha function was long overdue an >> > overhaul. >> > It sounds like your changes are much more far reaching than just this, >> > though, and should enable the software to be much more >> > language-general. >> > For example, we've been hoping to enable support for Japanese and it >> > sounds like this will be possible now? >> > >> > It definitely makes more sense to specify what characters you want the >> > tokenizer to treat as alphabetic in a separate file. >> > >> > I'd definitely like to incorporate these changes to the software - >> > would >> > the best way be to add you to the project admins on SourceForge and >> > allow >> > you to commit the changes? If you sign up for an account at >> > https://sourceforge.net/ (or if you have one already) >> > we can add you as a project developer with the necessary permissions. >> > >> > Again, thanks so much for the feedback and the contributions. >> > Best wishes, >> > Dominic >> > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> > >> > > Hello all, >> > > >> > > Your software is great, but praises should be on the user list :-). >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 >> > > >> > > If you are interested I send you the tarball or work it out with docs >> > etc >> > > and commit in cvs. >> > > >> > > Story and summary of changes are below. >> > > Cheers >> > > Viktor >> > > >> > > It all started out yesterday. I wanted to use infomap on a >> > > Hungarian corpus. I soon figured out why things went wrong already >> > at >> > > the tokenization step. >> > > >> > > The problem was: >> > > utils.c >> > > lines 46--53 >> > > >> > > /* This is a somewhat radical approach, in that it assumes >> > > ASCII for efficiency and will *break* with other character >> > > encodings. */ >> > > int my_isalpha( int c) { // configured to let underscore through for >> > POS >> > > and tilda for indexing compounds >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') >> > || c >> > > == '~'); >> > > } >> > > >> > > This function is used by the tokenizer to determine which are the >> > non-word >> > > (breaking) characters. >> > > It views 8 bit ascii chars above 128 as non-word (breaking) >> > characters, >> > > These characters happen to constitute a crucial part of most >> > languages >> > > other than English >> > > usually encoded in ISO-8859-X coding with X>1. >> > > >> > > It is not that it is a 'radical approach' as someone appropriately >> > > described it, >> > > but actually makes the program entirely English-specific entirely >> > > unnecessarily. >> > > So I set out to fix it. >> > > >> > > The whole alpha test should be done directly by the tokenizer. This >> > > funciton actually >> > > says how to segment a stram of strings, which is an extremely >> > important >> > > *meaningful* part of the tokenizer, not an auxiliary function like >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by >> > > tokenizer.c. >> > > >> > > To correctly handle all this, I introduced an extra resource file >> > > containing >> > > a string of legitimate characters considered valid in words. >> > > All other characters will be considered as breaking characters by >> > the >> > > tokenizer >> > > and are skipped. >> > > >> > > The resource file is read in by initialize_tokenizer (appropriately >> > > together with the corpus filenames file) and used to initialize >> > > an array (details below). Then lookup from this array can >> > conveniently >> > > replace >> > > all uses of the previous my_isalpha test. >> > > >> > > This should give sufficiently flexible and charset-independent >> > control >> > > over simple text-based tokenization, which means it can be a proper >> > > multilingual software. >> > > Well, I checked and it worked for my Hungarian stuff. >> > > >> > > Surely I have further ideas of very simple extensions which would >> > perform >> > > tokenization of already tokenized (e.g. xml) files directly. >> > > With this in place the solution with valid_chars would just be >> > > one of the two major tokenization modes. >> > > Also: read-in doesn't seem to me to be optimized (characters of a line >> > are >> > > scanned over twice). Since with large corpora this takes up a great >> > deal >> > > of time, we might want to consider to rewrite it. >> > > >> > > >> > > Details of the changes: >> > > nothing in the documentation yet. >> > > >> > > utils.{c,h}: >> > > function my_isalpha no longer exists, superseded by >> > > more configurable method in tokenizer >> > > >> > > tokenizer.{c,h}: >> > > introduced an int array: valid_chars[256] to look up >> > > for a character c, valid_chars[c] is nonzero iff it is a valid >> > > word-character >> > > if it is 0, it is considered as breaking (and skipped) by the >> > tokenizer >> > > >> > > initialize_tokenizer: now also initializes valid_chars by >> > > reading from a file passed as an extra argument >> > > >> > > prepare_corpus.c: >> > > modified invocation of initialize_tokenizer accordingly >> > > added parsing code for extra option '-chfile' >> > > >> > > For proper invocation of prepare_corpus Makefile.data.in and >> > > informap-build.in >> > > needed to be modified and for proper configuration/installation, >> > some >> > > further changes: >> > > >> > > admin/valid_chars.en: >> > > new file: contains the valid chars that exactly replicate the chars >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c >> > == >> > > '~'); >> > > >> > > admin/default-params.in: >> > > line 13: added default value >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" >> > > >> > > admin/Makefile: >> > > line 216: added default valid chars file 'valid_chars.en' to >> > EXTRA_DIST >> > > list >> > > to be copied into central data directory >> > > >> > > admin/Makefile.data.in: >> > > line 119-125: quotes supplied for all arguments >> > > (lack of quotes caused the build procedure to stop already >> > at >> > > invoking prepare-corpus if some filenames were empty, >> > > rather than reaching the point where it could tell what is missing >> > > if at all a problem that it is missing.) >> > > line 125: added line for valid_chars >> > > >> > > admin/infomap-build.in: >> > > line 113: added line to dump value of VALID_CHARS_FILE >> > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this >> > this >> > > morning) >> > > this dumps overriding command line settings (-D option) to an extra >> > > parameter >> > > file which is then sourced. >> > > cat expected actual setting strings (such as >> > "STOPLIST_FILE=my_stop_list") >> > > to be filenames >> > > >> > > +------------------------------------------------------------------+ >> > > |Viktor Tron v....@ed...| >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| >> > > |European Postgraduate College www.coli.uni-sb.de/egk| >> > > |School of Informatics www.informatics.ed.ac.uk| >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> > > | @ University of Edinburgh, UK www.ed.ac.uk| >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> > > |use LINUX and FREE Software www.linux.org| >> > > +------------------------------------------------------------------+ >> > > >> > > >> > > >> > > ------------------------------------------------------- >> > > This SF.Net email is sponsored by: IBM Linux Tutorials >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO >> > of >> > > GenToo technologies. Learn everything from fundamentals to system >> > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> > > _______________________________________________ >> > > infomap-nlp-devel mailing list >> > > inf...@li... >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> > > >> > >> > >> > ------------------------------------------------------- >> > This SF.Net email is sponsored by: IBM Linux Tutorials >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of >> > GenToo technologies. Learn everything from fundamentals to system >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> > _______________________________________________ >> > infomap-nlp-devel mailing list >> > inf...@li... >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> > >> >> >> >> +------------------------------------------------------------------+ >> |Viktor Tron v....@ed...| >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| >> |European Postgraduate College www.coli.uni-sb.de/egk| >> |School of Informatics www.informatics.ed.ac.uk| >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> | @ University of Edinburgh, UK www.ed.ac.uk| >> |Dept of Computational Linguistics www.coli.uni-sb.de| >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> |use LINUX and FREE Software www.linux.org| >> +------------------------------------------------------------------+ >> |
From: Dominic W. <dwi...@cs...> - 2004-04-19 17:19:34
|
Thanks so much, Victor. I'll check out your changes this afternoon and try my luck :) On Mon, 19 Apr 2004, Viktor Tron wrote: > Yes finally I have uploaded the changes. > > It took me a while cause I wnated to document it so I extended the manpages. > (Nothing to the tutorial, though) > > *Everything* should work like before out of the box. > > Please check this if you can with a clean temp checkout and compile, etc. > > Best > Viktor > > On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > > > > Dear Viktor, > > > > Did you manage to commit your changes to the infomap code to SourcForge at > > all? > > > > Best wishes, > > Dominic > > > > On Thu, 8 Apr 2004, Viktor Tron wrote: > > > >> Hello Dominic > >> I am viktron on Sourcefourge, if you want to add me. > >> and then I can commit changes. > >> Or maybe you want me to add changes to the documentation as well. > >> But then again, that makes sense only if a proper > >> conception is crystallized concerning what we want the tokenization > >> to do. > >> BTW, do you know Colin Bannard? > >> Best > >> Viktor > >> > >> > >> Quoting Dominic Widdows <dwi...@cs...>: > >> > >> > > >> > Dear Viktor, > >> > > >> > Thanks so much for doing all of this and documenting the changes for > >> > the > >> > list. I agree that the my_isalpha function was long overdue an > >> > overhaul. > >> > It sounds like your changes are much more far reaching than just this, > >> > though, and should enable the software to be much more > >> > language-general. > >> > For example, we've been hoping to enable support for Japanese and it > >> > sounds like this will be possible now? > >> > > >> > It definitely makes more sense to specify what characters you want the > >> > tokenizer to treat as alphabetic in a separate file. > >> > > >> > I'd definitely like to incorporate these changes to the software - > >> > would > >> > the best way be to add you to the project admins on SourceForge and > >> > allow > >> > you to commit the changes? If you sign up for an account at > >> > https://sourceforge.net/ (or if you have one already) > >> > we can add you as a project developer with the necessary permissions. > >> > > >> > Again, thanks so much for the feedback and the contributions. > >> > Best wishes, > >> > Dominic > >> > > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> > > >> > > Hello all, > >> > > > >> > > Your software is great, but praises should be on the user list :-). > >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 > >> > > > >> > > If you are interested I send you the tarball or work it out with docs > >> > etc > >> > > and commit in cvs. > >> > > > >> > > Story and summary of changes are below. > >> > > Cheers > >> > > Viktor > >> > > > >> > > It all started out yesterday. I wanted to use infomap on a > >> > > Hungarian corpus. I soon figured out why things went wrong already > >> > at > >> > > the tokenization step. > >> > > > >> > > The problem was: > >> > > utils.c > >> > > lines 46--53 > >> > > > >> > > /* This is a somewhat radical approach, in that it assumes > >> > > ASCII for efficiency and will *break* with other character > >> > > encodings. */ > >> > > int my_isalpha( int c) { // configured to let underscore through for > >> > POS > >> > > and tilda for indexing compounds > >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') > >> > || c > >> > > == '~'); > >> > > } > >> > > > >> > > This function is used by the tokenizer to determine which are the > >> > non-word > >> > > (breaking) characters. > >> > > It views 8 bit ascii chars above 128 as non-word (breaking) > >> > characters, > >> > > These characters happen to constitute a crucial part of most > >> > languages > >> > > other than English > >> > > usually encoded in ISO-8859-X coding with X>1. > >> > > > >> > > It is not that it is a 'radical approach' as someone appropriately > >> > > described it, > >> > > but actually makes the program entirely English-specific entirely > >> > > unnecessarily. > >> > > So I set out to fix it. > >> > > > >> > > The whole alpha test should be done directly by the tokenizer. This > >> > > funciton actually > >> > > says how to segment a stram of strings, which is an extremely > >> > important > >> > > *meaningful* part of the tokenizer, not an auxiliary function like > >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by > >> > > tokenizer.c. > >> > > > >> > > To correctly handle all this, I introduced an extra resource file > >> > > containing > >> > > a string of legitimate characters considered valid in words. > >> > > All other characters will be considered as breaking characters by > >> > the > >> > > tokenizer > >> > > and are skipped. > >> > > > >> > > The resource file is read in by initialize_tokenizer (appropriately > >> > > together with the corpus filenames file) and used to initialize > >> > > an array (details below). Then lookup from this array can > >> > conveniently > >> > > replace > >> > > all uses of the previous my_isalpha test. > >> > > > >> > > This should give sufficiently flexible and charset-independent > >> > control > >> > > over simple text-based tokenization, which means it can be a proper > >> > > multilingual software. > >> > > Well, I checked and it worked for my Hungarian stuff. > >> > > > >> > > Surely I have further ideas of very simple extensions which would > >> > perform > >> > > tokenization of already tokenized (e.g. xml) files directly. > >> > > With this in place the solution with valid_chars would just be > >> > > one of the two major tokenization modes. > >> > > Also: read-in doesn't seem to me to be optimized (characters of a line > >> > are > >> > > scanned over twice). Since with large corpora this takes up a great > >> > deal > >> > > of time, we might want to consider to rewrite it. > >> > > > >> > > > >> > > Details of the changes: > >> > > nothing in the documentation yet. > >> > > > >> > > utils.{c,h}: > >> > > function my_isalpha no longer exists, superseded by > >> > > more configurable method in tokenizer > >> > > > >> > > tokenizer.{c,h}: > >> > > introduced an int array: valid_chars[256] to look up > >> > > for a character c, valid_chars[c] is nonzero iff it is a valid > >> > > word-character > >> > > if it is 0, it is considered as breaking (and skipped) by the > >> > tokenizer > >> > > > >> > > initialize_tokenizer: now also initializes valid_chars by > >> > > reading from a file passed as an extra argument > >> > > > >> > > prepare_corpus.c: > >> > > modified invocation of initialize_tokenizer accordingly > >> > > added parsing code for extra option '-chfile' > >> > > > >> > > For proper invocation of prepare_corpus Makefile.data.in and > >> > > informap-build.in > >> > > needed to be modified and for proper configuration/installation, > >> > some > >> > > further changes: > >> > > > >> > > admin/valid_chars.en: > >> > > new file: contains the valid chars that exactly replicate the chars > >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > >> > == > >> > > '~'); > >> > > > >> > > admin/default-params.in: > >> > > line 13: added default value > >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > >> > > > >> > > admin/Makefile: > >> > > line 216: added default valid chars file 'valid_chars.en' to > >> > EXTRA_DIST > >> > > list > >> > > to be copied into central data directory > >> > > > >> > > admin/Makefile.data.in: > >> > > line 119-125: quotes supplied for all arguments > >> > > (lack of quotes caused the build procedure to stop already > >> > at > >> > > invoking prepare-corpus if some filenames were empty, > >> > > rather than reaching the point where it could tell what is missing > >> > > if at all a problem that it is missing.) > >> > > line 125: added line for valid_chars > >> > > > >> > > admin/infomap-build.in: > >> > > line 113: added line to dump value of VALID_CHARS_FILE > >> > > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this > >> > this > >> > > morning) > >> > > this dumps overriding command line settings (-D option) to an extra > >> > > parameter > >> > > file which is then sourced. > >> > > cat expected actual setting strings (such as > >> > "STOPLIST_FILE=my_stop_list") > >> > > to be filenames > >> > > > >> > > +------------------------------------------------------------------+ > >> > > |Viktor Tron v....@ed...| > >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > >> > > |European Postgraduate College www.coli.uni-sb.de/egk| > >> > > |School of Informatics www.informatics.ed.ac.uk| > >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> > > | @ University of Edinburgh, UK www.ed.ac.uk| > >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| > >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> > > |use LINUX and FREE Software www.linux.org| > >> > > +------------------------------------------------------------------+ > >> > > > >> > > > >> > > > >> > > ------------------------------------------------------- > >> > > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO > >> > of > >> > > GenToo technologies. Learn everything from fundamentals to system > >> > > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > > _______________________________________________ > >> > > infomap-nlp-devel mailing list > >> > > inf...@li... > >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > > >> > > >> > > >> > ------------------------------------------------------- > >> > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> > GenToo technologies. Learn everything from fundamentals to system > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > _______________________________________________ > >> > infomap-nlp-devel mailing list > >> > inf...@li... > >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > >> > >> > >> > >> +------------------------------------------------------------------+ > >> |Viktor Tron v....@ed...| > >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| > >> |European Postgraduate College www.coli.uni-sb.de/egk| > >> |School of Informatics www.informatics.ed.ac.uk| > >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> | @ University of Edinburgh, UK www.ed.ac.uk| > >> |Dept of Computational Linguistics www.coli.uni-sb.de| > >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> |use LINUX and FREE Software www.linux.org| > >> +------------------------------------------------------------------+ > >> > > > |
From: Dominic W. <dwi...@cs...> - 2004-04-19 23:49:29
|
Dear All, I checked out Viktor's changes and the new valid_chars file seems to work really well. I don't know if it will work for Japanese as well? Scott - did you manage to track down Beate's problem with getting a new version called 0.8.4? I think we should definitely get the changes we've made released. Beate - do you think you might be able to update the man pages to explain the COL_LABELS_FROM_FILE functionality? Thanks to everyone for what you've done so far. Best wishes, Dominic On Mon, 19 Apr 2004, Viktor Tron wrote: > Yes finally I have uploaded the changes. > > It took me a while cause I wnated to document it so I extended the manpages. > (Nothing to the tutorial, though) > > *Everything* should work like before out of the box. > > Please check this if you can with a clean temp checkout and compile, etc. > > Best > Viktor > > On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > > > > Dear Viktor, > > > > Did you manage to commit your changes to the infomap code to SourcForge at > > all? > > > > Best wishes, > > Dominic > > > > On Thu, 8 Apr 2004, Viktor Tron wrote: > > > >> Hello Dominic > >> I am viktron on Sourcefourge, if you want to add me. > >> and then I can commit changes. > >> Or maybe you want me to add changes to the documentation as well. > >> But then again, that makes sense only if a proper > >> conception is crystallized concerning what we want the tokenization > >> to do. > >> BTW, do you know Colin Bannard? > >> Best > >> Viktor > >> > >> > >> Quoting Dominic Widdows <dwi...@cs...>: > >> > >> > > >> > Dear Viktor, > >> > > >> > Thanks so much for doing all of this and documenting the changes for > >> > the > >> > list. I agree that the my_isalpha function was long overdue an > >> > overhaul. > >> > It sounds like your changes are much more far reaching than just this, > >> > though, and should enable the software to be much more > >> > language-general. > >> > For example, we've been hoping to enable support for Japanese and it > >> > sounds like this will be possible now? > >> > > >> > It definitely makes more sense to specify what characters you want the > >> > tokenizer to treat as alphabetic in a separate file. > >> > > >> > I'd definitely like to incorporate these changes to the software - > >> > would > >> > the best way be to add you to the project admins on SourceForge and > >> > allow > >> > you to commit the changes? If you sign up for an account at > >> > https://sourceforge.net/ (or if you have one already) > >> > we can add you as a project developer with the necessary permissions. > >> > > >> > Again, thanks so much for the feedback and the contributions. > >> > Best wishes, > >> > Dominic > >> > > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> > > >> > > Hello all, > >> > > > >> > > Your software is great, but praises should be on the user list :-). > >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 > >> > > > >> > > If you are interested I send you the tarball or work it out with docs > >> > etc > >> > > and commit in cvs. > >> > > > >> > > Story and summary of changes are below. > >> > > Cheers > >> > > Viktor > >> > > > >> > > It all started out yesterday. I wanted to use infomap on a > >> > > Hungarian corpus. I soon figured out why things went wrong already > >> > at > >> > > the tokenization step. > >> > > > >> > > The problem was: > >> > > utils.c > >> > > lines 46--53 > >> > > > >> > > /* This is a somewhat radical approach, in that it assumes > >> > > ASCII for efficiency and will *break* with other character > >> > > encodings. */ > >> > > int my_isalpha( int c) { // configured to let underscore through for > >> > POS > >> > > and tilda for indexing compounds > >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') > >> > || c > >> > > == '~'); > >> > > } > >> > > > >> > > This function is used by the tokenizer to determine which are the > >> > non-word > >> > > (breaking) characters. > >> > > It views 8 bit ascii chars above 128 as non-word (breaking) > >> > characters, > >> > > These characters happen to constitute a crucial part of most > >> > languages > >> > > other than English > >> > > usually encoded in ISO-8859-X coding with X>1. > >> > > > >> > > It is not that it is a 'radical approach' as someone appropriately > >> > > described it, > >> > > but actually makes the program entirely English-specific entirely > >> > > unnecessarily. > >> > > So I set out to fix it. > >> > > > >> > > The whole alpha test should be done directly by the tokenizer. This > >> > > funciton actually > >> > > says how to segment a stram of strings, which is an extremely > >> > important > >> > > *meaningful* part of the tokenizer, not an auxiliary function like > >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by > >> > > tokenizer.c. > >> > > > >> > > To correctly handle all this, I introduced an extra resource file > >> > > containing > >> > > a string of legitimate characters considered valid in words. > >> > > All other characters will be considered as breaking characters by > >> > the > >> > > tokenizer > >> > > and are skipped. > >> > > > >> > > The resource file is read in by initialize_tokenizer (appropriately > >> > > together with the corpus filenames file) and used to initialize > >> > > an array (details below). Then lookup from this array can > >> > conveniently > >> > > replace > >> > > all uses of the previous my_isalpha test. > >> > > > >> > > This should give sufficiently flexible and charset-independent > >> > control > >> > > over simple text-based tokenization, which means it can be a proper > >> > > multilingual software. > >> > > Well, I checked and it worked for my Hungarian stuff. > >> > > > >> > > Surely I have further ideas of very simple extensions which would > >> > perform > >> > > tokenization of already tokenized (e.g. xml) files directly. > >> > > With this in place the solution with valid_chars would just be > >> > > one of the two major tokenization modes. > >> > > Also: read-in doesn't seem to me to be optimized (characters of a line > >> > are > >> > > scanned over twice). Since with large corpora this takes up a great > >> > deal > >> > > of time, we might want to consider to rewrite it. > >> > > > >> > > > >> > > Details of the changes: > >> > > nothing in the documentation yet. > >> > > > >> > > utils.{c,h}: > >> > > function my_isalpha no longer exists, superseded by > >> > > more configurable method in tokenizer > >> > > > >> > > tokenizer.{c,h}: > >> > > introduced an int array: valid_chars[256] to look up > >> > > for a character c, valid_chars[c] is nonzero iff it is a valid > >> > > word-character > >> > > if it is 0, it is considered as breaking (and skipped) by the > >> > tokenizer > >> > > > >> > > initialize_tokenizer: now also initializes valid_chars by > >> > > reading from a file passed as an extra argument > >> > > > >> > > prepare_corpus.c: > >> > > modified invocation of initialize_tokenizer accordingly > >> > > added parsing code for extra option '-chfile' > >> > > > >> > > For proper invocation of prepare_corpus Makefile.data.in and > >> > > informap-build.in > >> > > needed to be modified and for proper configuration/installation, > >> > some > >> > > further changes: > >> > > > >> > > admin/valid_chars.en: > >> > > new file: contains the valid chars that exactly replicate the chars > >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > >> > == > >> > > '~'); > >> > > > >> > > admin/default-params.in: > >> > > line 13: added default value > >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > >> > > > >> > > admin/Makefile: > >> > > line 216: added default valid chars file 'valid_chars.en' to > >> > EXTRA_DIST > >> > > list > >> > > to be copied into central data directory > >> > > > >> > > admin/Makefile.data.in: > >> > > line 119-125: quotes supplied for all arguments > >> > > (lack of quotes caused the build procedure to stop already > >> > at > >> > > invoking prepare-corpus if some filenames were empty, > >> > > rather than reaching the point where it could tell what is missing > >> > > if at all a problem that it is missing.) > >> > > line 125: added line for valid_chars > >> > > > >> > > admin/infomap-build.in: > >> > > line 113: added line to dump value of VALID_CHARS_FILE > >> > > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this > >> > this > >> > > morning) > >> > > this dumps overriding command line settings (-D option) to an extra > >> > > parameter > >> > > file which is then sourced. > >> > > cat expected actual setting strings (such as > >> > "STOPLIST_FILE=my_stop_list") > >> > > to be filenames > >> > > > >> > > +------------------------------------------------------------------+ > >> > > |Viktor Tron v....@ed...| > >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > >> > > |European Postgraduate College www.coli.uni-sb.de/egk| > >> > > |School of Informatics www.informatics.ed.ac.uk| > >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> > > | @ University of Edinburgh, UK www.ed.ac.uk| > >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| > >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> > > |use LINUX and FREE Software www.linux.org| > >> > > +------------------------------------------------------------------+ > >> > > > >> > > > >> > > > >> > > ------------------------------------------------------- > >> > > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO > >> > of > >> > > GenToo technologies. Learn everything from fundamentals to system > >> > > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > > _______________________________________________ > >> > > infomap-nlp-devel mailing list > >> > > inf...@li... > >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > > >> > > >> > > >> > ------------------------------------------------------- > >> > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> > GenToo technologies. Learn everything from fundamentals to system > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > _______________________________________________ > >> > infomap-nlp-devel mailing list > >> > inf...@li... > >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > >> > >> > >> > >> +------------------------------------------------------------------+ > >> |Viktor Tron v....@ed...| > >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| > >> |European Postgraduate College www.coli.uni-sb.de/egk| > >> |School of Informatics www.informatics.ed.ac.uk| > >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> | @ University of Edinburgh, UK www.ed.ac.uk| > >> |Dept of Computational Linguistics www.coli.uni-sb.de| > >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> |use LINUX and FREE Software www.linux.org| > >> +------------------------------------------------------------------+ > >> > > > |
From: Viktor T. <v....@ed...> - 2004-04-20 08:56:06
|
On Mon, 19 Apr 2004 16:49:10 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > Dear All, > > I checked out Viktor's changes and the new valid_chars file seems to work > really well. I don't know if it will work for Japanese as well? Well, if it is encoded with some 8bit ascii, you can always compile your valid_chars file, but I guess eventually unicode seems inevitable... V > Scott - did you manage to track down Beate's problem with getting a new > version called 0.8.4? I think we should definitely get the changes we've > made released. > > Beate - do you think you might be able to update the man pages to explain > the COL_LABELS_FROM_FILE functionality? > > Thanks to everyone for what you've done so far. > Best wishes, > Dominic > > On Mon, 19 Apr 2004, Viktor Tron wrote: > >> Yes finally I have uploaded the changes. >> >> It took me a while cause I wnated to document it so I extended the manpages. >> (Nothing to the tutorial, though) >> >> *Everything* should work like before out of the box. >> >> Please check this if you can with a clean temp checkout and compile, etc. >> >> Best >> Viktor >> >> On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: >> >> > >> > Dear Viktor, >> > >> > Did you manage to commit your changes to the infomap code to SourcForge at >> > all? >> > >> > Best wishes, >> > Dominic >> > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> > >> >> Hello Dominic >> >> I am viktron on Sourcefourge, if you want to add me. >> >> and then I can commit changes. >> >> Or maybe you want me to add changes to the documentation as well. >> >> But then again, that makes sense only if a proper >> >> conception is crystallized concerning what we want the tokenization >> >> to do. >> >> BTW, do you know Colin Bannard? >> >> Best >> >> Viktor >> >> >> >> >> >> Quoting Dominic Widdows <dwi...@cs...>: >> >> >> >> > >> >> > Dear Viktor, >> >> > >> >> > Thanks so much for doing all of this and documenting the changes for >> >> > the >> >> > list. I agree that the my_isalpha function was long overdue an >> >> > overhaul. >> >> > It sounds like your changes are much more far reaching than just this, >> >> > though, and should enable the software to be much more >> >> > language-general. >> >> > For example, we've been hoping to enable support for Japanese and it >> >> > sounds like this will be possible now? >> >> > >> >> > It definitely makes more sense to specify what characters you want the >> >> > tokenizer to treat as alphabetic in a separate file. >> >> > >> >> > I'd definitely like to incorporate these changes to the software - >> >> > would >> >> > the best way be to add you to the project admins on SourceForge and >> >> > allow >> >> > you to commit the changes? If you sign up for an account at >> >> > https://sourceforge.net/ (or if you have one already) >> >> > we can add you as a project developer with the necessary permissions. >> >> > >> >> > Again, thanks so much for the feedback and the contributions. >> >> > Best wishes, >> >> > Dominic >> >> > >> >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> >> > >> >> > > Hello all, >> >> > > >> >> > > Your software is great, but praises should be on the user list :-). >> >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 >> >> > > >> >> > > If you are interested I send you the tarball or work it out with docs >> >> > etc >> >> > > and commit in cvs. >> >> > > >> >> > > Story and summary of changes are below. >> >> > > Cheers >> >> > > Viktor >> >> > > >> >> > > It all started out yesterday. I wanted to use infomap on a >> >> > > Hungarian corpus. I soon figured out why things went wrong already >> >> > at >> >> > > the tokenization step. >> >> > > >> >> > > The problem was: >> >> > > utils.c >> >> > > lines 46--53 >> >> > > >> >> > > /* This is a somewhat radical approach, in that it assumes >> >> > > ASCII for efficiency and will *break* with other character >> >> > > encodings. */ >> >> > > int my_isalpha( int c) { // configured to let underscore through for >> >> > POS >> >> > > and tilda for indexing compounds >> >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') >> >> > || c >> >> > > == '~'); >> >> > > } >> >> > > >> >> > > This function is used by the tokenizer to determine which are the >> >> > non-word >> >> > > (breaking) characters. >> >> > > It views 8 bit ascii chars above 128 as non-word (breaking) >> >> > characters, >> >> > > These characters happen to constitute a crucial part of most >> >> > languages >> >> > > other than English >> >> > > usually encoded in ISO-8859-X coding with X>1. >> >> > > >> >> > > It is not that it is a 'radical approach' as someone appropriately >> >> > > described it, >> >> > > but actually makes the program entirely English-specific entirely >> >> > > unnecessarily. >> >> > > So I set out to fix it. >> >> > > >> >> > > The whole alpha test should be done directly by the tokenizer. This >> >> > > funciton actually >> >> > > says how to segment a stram of strings, which is an extremely >> >> > important >> >> > > *meaningful* part of the tokenizer, not an auxiliary function like >> >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by >> >> > > tokenizer.c. >> >> > > >> >> > > To correctly handle all this, I introduced an extra resource file >> >> > > containing >> >> > > a string of legitimate characters considered valid in words. >> >> > > All other characters will be considered as breaking characters by >> >> > the >> >> > > tokenizer >> >> > > and are skipped. >> >> > > >> >> > > The resource file is read in by initialize_tokenizer (appropriately >> >> > > together with the corpus filenames file) and used to initialize >> >> > > an array (details below). Then lookup from this array can >> >> > conveniently >> >> > > replace >> >> > > all uses of the previous my_isalpha test. >> >> > > >> >> > > This should give sufficiently flexible and charset-independent >> >> > control >> >> > > over simple text-based tokenization, which means it can be a proper >> >> > > multilingual software. >> >> > > Well, I checked and it worked for my Hungarian stuff. >> >> > > >> >> > > Surely I have further ideas of very simple extensions which would >> >> > perform >> >> > > tokenization of already tokenized (e.g. xml) files directly. >> >> > > With this in place the solution with valid_chars would just be >> >> > > one of the two major tokenization modes. >> >> > > Also: read-in doesn't seem to me to be optimized (characters of a line >> >> > are >> >> > > scanned over twice). Since with large corpora this takes up a great >> >> > deal >> >> > > of time, we might want to consider to rewrite it. >> >> > > >> >> > > >> >> > > Details of the changes: >> >> > > nothing in the documentation yet. >> >> > > >> >> > > utils.{c,h}: >> >> > > function my_isalpha no longer exists, superseded by >> >> > > more configurable method in tokenizer >> >> > > >> >> > > tokenizer.{c,h}: >> >> > > introduced an int array: valid_chars[256] to look up >> >> > > for a character c, valid_chars[c] is nonzero iff it is a valid >> >> > > word-character >> >> > > if it is 0, it is considered as breaking (and skipped) by the >> >> > tokenizer >> >> > > >> >> > > initialize_tokenizer: now also initializes valid_chars by >> >> > > reading from a file passed as an extra argument >> >> > > >> >> > > prepare_corpus.c: >> >> > > modified invocation of initialize_tokenizer accordingly >> >> > > added parsing code for extra option '-chfile' >> >> > > >> >> > > For proper invocation of prepare_corpus Makefile.data.in and >> >> > > informap-build.in >> >> > > needed to be modified and for proper configuration/installation, >> >> > some >> >> > > further changes: >> >> > > >> >> > > admin/valid_chars.en: >> >> > > new file: contains the valid chars that exactly replicate the chars >> >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) >> >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c >> >> > == >> >> > > '~'); >> >> > > >> >> > > admin/default-params.in: >> >> > > line 13: added default value >> >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" >> >> > > >> >> > > admin/Makefile: >> >> > > line 216: added default valid chars file 'valid_chars.en' to >> >> > EXTRA_DIST >> >> > > list >> >> > > to be copied into central data directory >> >> > > >> >> > > admin/Makefile.data.in: >> >> > > line 119-125: quotes supplied for all arguments >> >> > > (lack of quotes caused the build procedure to stop already >> >> > at >> >> > > invoking prepare-corpus if some filenames were empty, >> >> > > rather than reaching the point where it could tell what is missing >> >> > > if at all a problem that it is missing.) >> >> > > line 125: added line for valid_chars >> >> > > >> >> > > admin/infomap-build.in: >> >> > > line 113: added line to dump value of VALID_CHARS_FILE >> >> > > >> >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this >> >> > this >> >> > > morning) >> >> > > this dumps overriding command line settings (-D option) to an extra >> >> > > parameter >> >> > > file which is then sourced. >> >> > > cat expected actual setting strings (such as >> >> > "STOPLIST_FILE=my_stop_list") >> >> > > to be filenames >> >> > > >> >> > > +------------------------------------------------------------------+ >> >> > > |Viktor Tron v....@ed...| >> >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| >> >> > > |European Postgraduate College www.coli.uni-sb.de/egk| >> >> > > |School of Informatics www.informatics.ed.ac.uk| >> >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> >> > > | @ University of Edinburgh, UK www.ed.ac.uk| >> >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| >> >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> >> > > |use LINUX and FREE Software www.linux.org| >> >> > > +------------------------------------------------------------------+ >> >> > > >> >> > > >> >> > > >> >> > > ------------------------------------------------------- >> >> > > This SF.Net email is sponsored by: IBM Linux Tutorials >> >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO >> >> > of >> >> > > GenToo technologies. Learn everything from fundamentals to system >> >> > > >> >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> >> > > _______________________________________________ >> >> > > infomap-nlp-devel mailing list >> >> > > inf...@li... >> >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> >> > > >> >> > >> >> > >> >> > ------------------------------------------------------- >> >> > This SF.Net email is sponsored by: IBM Linux Tutorials >> >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of >> >> > GenToo technologies. Learn everything from fundamentals to system >> >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> >> > _______________________________________________ >> >> > infomap-nlp-devel mailing list >> >> > inf...@li... >> >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> >> > >> >> >> >> >> >> >> >> +------------------------------------------------------------------+ >> >> |Viktor Tron v....@ed...| >> >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| >> >> |European Postgraduate College www.coli.uni-sb.de/egk| >> >> |School of Informatics www.informatics.ed.ac.uk| >> >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> >> | @ University of Edinburgh, UK www.ed.ac.uk| >> >> |Dept of Computational Linguistics www.coli.uni-sb.de| >> >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> >> |use LINUX and FREE Software www.linux.org| >> >> +------------------------------------------------------------------+ >> >> >> >> >> > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > infomap-nlp-devel mailing list > inf...@li... > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel |
From: Beate D. <do...@IM...> - 2004-04-20 05:22:39
|
Dear Dominic, I changed the man pages already to explain the COL_LABELS_FROM_FILE functionality, but let me know if it needs to be explained in more detail. Cheers, Beate On Mon, 19 Apr 2004, Dominic Widdows wrote: > >Dear All, > >I checked out Viktor's changes and the new valid_chars file seems to work >really well. I don't know if it will work for Japanese as well? > >Scott - did you manage to track down Beate's problem with getting a new >version called 0.8.4? I think we should definitely get the changes we've >made released. > >Beate - do you think you might be able to update the man pages to explain >the COL_LABELS_FROM_FILE functionality? > >Thanks to everyone for what you've done so far. >Best wishes, >Dominic > |
From: Dominic W. <dwi...@cs...> - 2004-04-20 07:55:21
|
> I changed the man pages already to explain the COL_LABELS_FROM_FILE > functionality, but let me know if it needs to be explained in more detail. Sorry Beate, I hadn't seen this before. One question, though - is there a default location for the COL_LABEL_FILE? It can be set in the default-params file and since it's a special option I guess it doesn't need a default (since the default is not to have one). Does this sound reasonable? Best wishes, Dominic |
From: Beate D. <do...@IM...> - 2004-04-20 09:23:41
|
Hi Dominic, >One question, though - is there a default location for the COL_LABEL_FILE? >It can be set in the default-params file and since it's a special option I >guess it doesn't need a default (since the default is not to have one). >Does this sound reasonable? The infomap-build script initializes a variable COL_LABELS_FROM_FILE with 0 and COL_LABEL_FILE with "". So unless the user specifies these variables otherwise (via the -D option of infomap-build or via the parameter file), column labels are "computed" automatically just as before. I think you are right, since a column label file is not necessary for the code, a default location probably doesn't make so much sense. What we could do is to initialize in analogy to the stoplist file: COL_LABELS_FROM_FILE=0 COL_LABEL_FILE="@pkgdatadir@/col.labels" If the user sets COL_LABELS_FROM_FILE to 1, then column labels will be read from the default location. It may however confuse the user that although the Boolean variable is set to 0, COL_LABEL_FILE is not empty. What do you think? Sleep well, Beate > >Best wishes, >Dominic > > >------------------------------------------------------- >This SF.Net email is sponsored by: IBM Linux Tutorials >Free Linux tutorial presented by Daniel Robbins, President and CEO of >GenToo technologies. Learn everything from fundamentals to system >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >_______________________________________________ >infomap-nlp-devel mailing list >inf...@li... >https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > |