From: Viktor T. <v....@ed...> - 2004-04-19 16:57:27
|
Yes finally I have uploaded the changes. It took me a while cause I wnated to document it so I extended the manpages. (Nothing to the tutorial, though) *Everything* should work like before out of the box. Please check this if you can with a clean temp checkout and compile, etc. Best Viktor On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > Dear Viktor, > > Did you manage to commit your changes to the infomap code to SourcForge at > all? > > Best wishes, > Dominic > > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> Hello Dominic >> I am viktron on Sourcefourge, if you want to add me. >> and then I can commit changes. >> Or maybe you want me to add changes to the documentation as well. >> But then again, that makes sense only if a proper >> conception is crystallized concerning what we want the tokenization >> to do. >> BTW, do you know Colin Bannard? >> Best >> Viktor >> >> >> Quoting Dominic Widdows <dwi...@cs...>: >> >> > >> > Dear Viktor, >> > >> > Thanks so much for doing all of this and documenting the changes for >> > the >> > list. I agree that the my_isalpha function was long overdue an >> > overhaul. >> > It sounds like your changes are much more far reaching than just this, >> > though, and should enable the software to be much more >> > language-general. >> > For example, we've been hoping to enable support for Japanese and it >> > sounds like this will be possible now? >> > >> > It definitely makes more sense to specify what characters you want the >> > tokenizer to treat as alphabetic in a separate file. >> > >> > I'd definitely like to incorporate these changes to the software - >> > would >> > the best way be to add you to the project admins on SourceForge and >> > allow >> > you to commit the changes? If you sign up for an account at >> > https://sourceforge.net/ (or if you have one already) >> > we can add you as a project developer with the necessary permissions. >> > >> > Again, thanks so much for the feedback and the contributions. >> > Best wishes, >> > Dominic >> > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: >> > >> > > Hello all, >> > > >> > > Your software is great, but praises should be on the user list :-). >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 >> > > >> > > If you are interested I send you the tarball or work it out with docs >> > etc >> > > and commit in cvs. >> > > >> > > Story and summary of changes are below. >> > > Cheers >> > > Viktor >> > > >> > > It all started out yesterday. I wanted to use infomap on a >> > > Hungarian corpus. I soon figured out why things went wrong already >> > at >> > > the tokenization step. >> > > >> > > The problem was: >> > > utils.c >> > > lines 46--53 >> > > >> > > /* This is a somewhat radical approach, in that it assumes >> > > ASCII for efficiency and will *break* with other character >> > > encodings. */ >> > > int my_isalpha( int c) { // configured to let underscore through for >> > POS >> > > and tilda for indexing compounds >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') >> > || c >> > > == '~'); >> > > } >> > > >> > > This function is used by the tokenizer to determine which are the >> > non-word >> > > (breaking) characters. >> > > It views 8 bit ascii chars above 128 as non-word (breaking) >> > characters, >> > > These characters happen to constitute a crucial part of most >> > languages >> > > other than English >> > > usually encoded in ISO-8859-X coding with X>1. >> > > >> > > It is not that it is a 'radical approach' as someone appropriately >> > > described it, >> > > but actually makes the program entirely English-specific entirely >> > > unnecessarily. >> > > So I set out to fix it. >> > > >> > > The whole alpha test should be done directly by the tokenizer. This >> > > funciton actually >> > > says how to segment a stram of strings, which is an extremely >> > important >> > > *meaningful* part of the tokenizer, not an auxiliary function like >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by >> > > tokenizer.c. >> > > >> > > To correctly handle all this, I introduced an extra resource file >> > > containing >> > > a string of legitimate characters considered valid in words. >> > > All other characters will be considered as breaking characters by >> > the >> > > tokenizer >> > > and are skipped. >> > > >> > > The resource file is read in by initialize_tokenizer (appropriately >> > > together with the corpus filenames file) and used to initialize >> > > an array (details below). Then lookup from this array can >> > conveniently >> > > replace >> > > all uses of the previous my_isalpha test. >> > > >> > > This should give sufficiently flexible and charset-independent >> > control >> > > over simple text-based tokenization, which means it can be a proper >> > > multilingual software. >> > > Well, I checked and it worked for my Hungarian stuff. >> > > >> > > Surely I have further ideas of very simple extensions which would >> > perform >> > > tokenization of already tokenized (e.g. xml) files directly. >> > > With this in place the solution with valid_chars would just be >> > > one of the two major tokenization modes. >> > > Also: read-in doesn't seem to me to be optimized (characters of a line >> > are >> > > scanned over twice). Since with large corpora this takes up a great >> > deal >> > > of time, we might want to consider to rewrite it. >> > > >> > > >> > > Details of the changes: >> > > nothing in the documentation yet. >> > > >> > > utils.{c,h}: >> > > function my_isalpha no longer exists, superseded by >> > > more configurable method in tokenizer >> > > >> > > tokenizer.{c,h}: >> > > introduced an int array: valid_chars[256] to look up >> > > for a character c, valid_chars[c] is nonzero iff it is a valid >> > > word-character >> > > if it is 0, it is considered as breaking (and skipped) by the >> > tokenizer >> > > >> > > initialize_tokenizer: now also initializes valid_chars by >> > > reading from a file passed as an extra argument >> > > >> > > prepare_corpus.c: >> > > modified invocation of initialize_tokenizer accordingly >> > > added parsing code for extra option '-chfile' >> > > >> > > For proper invocation of prepare_corpus Makefile.data.in and >> > > informap-build.in >> > > needed to be modified and for proper configuration/installation, >> > some >> > > further changes: >> > > >> > > admin/valid_chars.en: >> > > new file: contains the valid chars that exactly replicate the chars >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c >> > == >> > > '~'); >> > > >> > > admin/default-params.in: >> > > line 13: added default value >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" >> > > >> > > admin/Makefile: >> > > line 216: added default valid chars file 'valid_chars.en' to >> > EXTRA_DIST >> > > list >> > > to be copied into central data directory >> > > >> > > admin/Makefile.data.in: >> > > line 119-125: quotes supplied for all arguments >> > > (lack of quotes caused the build procedure to stop already >> > at >> > > invoking prepare-corpus if some filenames were empty, >> > > rather than reaching the point where it could tell what is missing >> > > if at all a problem that it is missing.) >> > > line 125: added line for valid_chars >> > > >> > > admin/infomap-build.in: >> > > line 113: added line to dump value of VALID_CHARS_FILE >> > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this >> > this >> > > morning) >> > > this dumps overriding command line settings (-D option) to an extra >> > > parameter >> > > file which is then sourced. >> > > cat expected actual setting strings (such as >> > "STOPLIST_FILE=my_stop_list") >> > > to be filenames >> > > >> > > +------------------------------------------------------------------+ >> > > |Viktor Tron v....@ed...| >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| >> > > |European Postgraduate College www.coli.uni-sb.de/egk| >> > > |School of Informatics www.informatics.ed.ac.uk| >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> > > | @ University of Edinburgh, UK www.ed.ac.uk| >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> > > |use LINUX and FREE Software www.linux.org| >> > > +------------------------------------------------------------------+ >> > > >> > > >> > > >> > > ------------------------------------------------------- >> > > This SF.Net email is sponsored by: IBM Linux Tutorials >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO >> > of >> > > GenToo technologies. Learn everything from fundamentals to system >> > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> > > _______________________________________________ >> > > infomap-nlp-devel mailing list >> > > inf...@li... >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> > > >> > >> > >> > ------------------------------------------------------- >> > This SF.Net email is sponsored by: IBM Linux Tutorials >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of >> > GenToo technologies. Learn everything from fundamentals to system >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click >> > _______________________________________________ >> > infomap-nlp-devel mailing list >> > inf...@li... >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel >> > >> >> >> >> +------------------------------------------------------------------+ >> |Viktor Tron v....@ed...| >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| >> |European Postgraduate College www.coli.uni-sb.de/egk| >> |School of Informatics www.informatics.ed.ac.uk| >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| >> | @ University of Edinburgh, UK www.ed.ac.uk| >> |Dept of Computational Linguistics www.coli.uni-sb.de| >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| >> |use LINUX and FREE Software www.linux.org| >> +------------------------------------------------------------------+ >> |