| 
      
      
      From: Dominic W. <dwi...@cs...> - 2004-04-19 23:49:29
      
     | 
| Dear All, I checked out Viktor's changes and the new valid_chars file seems to work really well. I don't know if it will work for Japanese as well? Scott - did you manage to track down Beate's problem with getting a new version called 0.8.4? I think we should definitely get the changes we've made released. Beate - do you think you might be able to update the man pages to explain the COL_LABELS_FROM_FILE functionality? Thanks to everyone for what you've done so far. Best wishes, Dominic On Mon, 19 Apr 2004, Viktor Tron wrote: > Yes finally I have uploaded the changes. > > It took me a while cause I wnated to document it so I extended the manpages. > (Nothing to the tutorial, though) > > *Everything* should work like before out of the box. > > Please check this if you can with a clean temp checkout and compile, etc. > > Best > Viktor > > On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote: > > > > > Dear Viktor, > > > > Did you manage to commit your changes to the infomap code to SourcForge at > > all? > > > > Best wishes, > > Dominic > > > > On Thu, 8 Apr 2004, Viktor Tron wrote: > > > >> Hello Dominic > >> I am viktron on Sourcefourge, if you want to add me. > >> and then I can commit changes. > >> Or maybe you want me to add changes to the documentation as well. > >> But then again, that makes sense only if a proper > >> conception is crystallized concerning what we want the tokenization > >> to do. > >> BTW, do you know Colin Bannard? > >> Best > >> Viktor > >> > >> > >> Quoting Dominic Widdows <dwi...@cs...>: > >> > >> > > >> > Dear Viktor, > >> > > >> > Thanks so much for doing all of this and documenting the changes for > >> > the > >> > list. I agree that the my_isalpha function was long overdue an > >> > overhaul. > >> > It sounds like your changes are much more far reaching than just this, > >> > though, and should enable the software to be much more > >> > language-general. > >> > For example, we've been hoping to enable support for Japanese and it > >> > sounds like this will be possible now? > >> > > >> > It definitely makes more sense to specify what characters you want the > >> > tokenizer to treat as alphabetic in a separate file. > >> > > >> > I'd definitely like to incorporate these changes to the software - > >> > would > >> > the best way be to add you to the project admins on SourceForge and > >> > allow > >> > you to commit the changes? If you sign up for an account at > >> > https://sourceforge.net/ (or if you have one already) > >> > we can add you as a project developer with the necessary permissions. > >> > > >> > Again, thanks so much for the feedback and the contributions. > >> > Best wishes, > >> > Dominic > >> > > >> > On Thu, 8 Apr 2004, Viktor Tron wrote: > >> > > >> > > Hello all, > >> > > > >> > > Your software is great, but praises should be on the user list :-). > >> > > I subsribed to the list now, because I suggest some changes to 0.8.4 > >> > > > >> > > If you are interested I send you the tarball or work it out with docs > >> > etc > >> > > and commit in cvs. > >> > > > >> > > Story and summary of changes are below. > >> > > Cheers > >> > > Viktor > >> > > > >> > > It all started out yesterday. I wanted to use infomap on a > >> > > Hungarian corpus. I soon figured out why things went wrong already > >> > at > >> > > the tokenization step. > >> > > > >> > > The problem was: > >> > > utils.c > >> > > lines 46--53 > >> > > > >> > > /* This is a somewhat radical approach, in that it assumes > >> > > ASCII for efficiency and will *break* with other character > >> > > encodings. */ > >> > > int my_isalpha( int c) { // configured to let underscore through for > >> > POS > >> > > and tilda for indexing compounds > >> > > return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') > >> > || c > >> > > == '~'); > >> > > } > >> > > > >> > > This function is used by the tokenizer to determine which are the > >> > non-word > >> > > (breaking) characters. > >> > > It views 8 bit ascii chars above 128 as non-word (breaking) > >> > characters, > >> > > These characters happen to constitute a crucial part of most > >> > languages > >> > > other than English > >> > > usually encoded in ISO-8859-X coding with X>1. > >> > > > >> > > It is not that it is a 'radical approach' as someone appropriately > >> > > described it, > >> > > but actually makes the program entirely English-specific entirely > >> > > unnecessarily. > >> > > So I set out to fix it. > >> > > > >> > > The whole alpha test should be done directly by the tokenizer. This > >> > > funciton actually > >> > > says how to segment a stram of strings, which is an extremely > >> > important > >> > > *meaningful* part of the tokenizer, not an auxiliary function like > >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by > >> > > tokenizer.c. > >> > > > >> > > To correctly handle all this, I introduced an extra resource file > >> > > containing > >> > > a string of legitimate characters considered valid in words. > >> > > All other characters will be considered as breaking characters by > >> > the > >> > > tokenizer > >> > > and are skipped. > >> > > > >> > > The resource file is read in by initialize_tokenizer (appropriately > >> > > together with the corpus filenames file) and used to initialize > >> > > an array (details below). Then lookup from this array can > >> > conveniently > >> > > replace > >> > > all uses of the previous my_isalpha test. > >> > > > >> > > This should give sufficiently flexible and charset-independent > >> > control > >> > > over simple text-based tokenization, which means it can be a proper > >> > > multilingual software. > >> > > Well, I checked and it worked for my Hungarian stuff. > >> > > > >> > > Surely I have further ideas of very simple extensions which would > >> > perform > >> > > tokenization of already tokenized (e.g. xml) files directly. > >> > > With this in place the solution with valid_chars would just be > >> > > one of the two major tokenization modes. > >> > > Also: read-in doesn't seem to me to be optimized (characters of a line > >> > are > >> > > scanned over twice). Since with large corpora this takes up a great > >> > deal > >> > > of time, we might want to consider to rewrite it. > >> > > > >> > > > >> > > Details of the changes: > >> > > nothing in the documentation yet. > >> > > > >> > > utils.{c,h}: > >> > > function my_isalpha no longer exists, superseded by > >> > > more configurable method in tokenizer > >> > > > >> > > tokenizer.{c,h}: > >> > > introduced an int array: valid_chars[256] to look up > >> > > for a character c, valid_chars[c] is nonzero iff it is a valid > >> > > word-character > >> > > if it is 0, it is considered as breaking (and skipped) by the > >> > tokenizer > >> > > > >> > > initialize_tokenizer: now also initializes valid_chars by > >> > > reading from a file passed as an extra argument > >> > > > >> > > prepare_corpus.c: > >> > > modified invocation of initialize_tokenizer accordingly > >> > > added parsing code for extra option '-chfile' > >> > > > >> > > For proper invocation of prepare_corpus Makefile.data.in and > >> > > informap-build.in > >> > > needed to be modified and for proper configuration/installation, > >> > some > >> > > further changes: > >> > > > >> > > admin/valid_chars.en: > >> > > new file: contains the valid chars that exactly replicate the chars > >> > > accepted as non-breaking by the now obsolete my_isalpha (utils.c) > >> > > I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c > >> > == > >> > > '~'); > >> > > > >> > > admin/default-params.in: > >> > > line 13: added default value > >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en" > >> > > > >> > > admin/Makefile: > >> > > line 216: added default valid chars file 'valid_chars.en' to > >> > EXTRA_DIST > >> > > list > >> > > to be copied into central data directory > >> > > > >> > > admin/Makefile.data.in: > >> > > line 119-125: quotes supplied for all arguments > >> > > (lack of quotes caused the build procedure to stop already > >> > at > >> > > invoking prepare-corpus if some filenames were empty, > >> > > rather than reaching the point where it could tell what is missing > >> > > if at all a problem that it is missing.) > >> > > line 125: added line for valid_chars > >> > > > >> > > admin/infomap-build.in: > >> > > line 113: added line to dump value of VALID_CHARS_FILE > >> > > > >> > > line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this > >> > this > >> > > morning) > >> > > this dumps overriding command line settings (-D option) to an extra > >> > > parameter > >> > > file which is then sourced. > >> > > cat expected actual setting strings (such as > >> > "STOPLIST_FILE=my_stop_list") > >> > > to be filenames > >> > > > >> > > +------------------------------------------------------------------+ > >> > > |Viktor Tron v....@ed...| > >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414| > >> > > |European Postgraduate College www.coli.uni-sb.de/egk| > >> > > |School of Informatics www.informatics.ed.ac.uk| > >> > > |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> > > | @ University of Edinburgh, UK www.ed.ac.uk| > >> > > |Dept of Computational Linguistics www.coli.uni-sb.de| > >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> > > |use LINUX and FREE Software www.linux.org| > >> > > +------------------------------------------------------------------+ > >> > > > >> > > > >> > > > >> > > ------------------------------------------------------- > >> > > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO > >> > of > >> > > GenToo technologies. Learn everything from fundamentals to system > >> > > > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > > _______________________________________________ > >> > > infomap-nlp-devel mailing list > >> > > inf...@li... > >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > > >> > > >> > > >> > ------------------------------------------------------- > >> > This SF.Net email is sponsored by: IBM Linux Tutorials > >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of > >> > GenToo technologies. Learn everything from fundamentals to system > >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > >> > _______________________________________________ > >> > infomap-nlp-devel mailing list > >> > inf...@li... > >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel > >> > > >> > >> > >> > >> +------------------------------------------------------------------+ > >> |Viktor Tron v....@ed...| > >> |3fl Rm8. 2 Buccleuch Place Edinburgh Tel +44 131 650 4414| > >> |European Postgraduate College www.coli.uni-sb.de/egk| > >> |School of Informatics www.informatics.ed.ac.uk| > >> |Theoretical and Applied Linguistics www.ling.ed.ac.uk| > >> | @ University of Edinburgh, UK www.ed.ac.uk| > >> |Dept of Computational Linguistics www.coli.uni-sb.de| > >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de| > >> |use LINUX and FREE Software www.linux.org| > >> +------------------------------------------------------------------+ > >> > > > |