Re: [infomap-nlp-devel] changes in tokenization

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks so much, Victor. I'll check out your changes this afternoon and try
my luck :)

On Mon, 19 Apr 2004, Viktor Tron wrote:

> Yes finally I have uploaded the changes.
>
> It took me a while cause I wnated to document it so I extended the manpages.
> (Nothing to the tutorial, though)
>
> *Everything* should work like before out of the box.
>
> Please check this if you can with a clean temp checkout and compile, etc.
>
> Best
> Viktor
>
> On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote:
>
> >
> > Dear Viktor,
> >
> > Did you manage to commit your changes to the infomap code to SourcForge at
> > all?
> >
> > Best wishes,
> > Dominic
> >
> > On Thu, 8 Apr 2004, Viktor Tron wrote:
> >
> >> Hello Dominic
> >> I am viktron on Sourcefourge, if you want to add me.
> >> and then I can commit changes.
> >> Or maybe you want me to add changes to the documentation as well.
> >> But then again, that makes sense only if a proper
> >> conception is crystallized concerning what we want the tokenization
> >> to do.
> >> BTW, do you know Colin Bannard?
> >> Best
> >> Viktor
> >>
> >>
> >> Quoting Dominic Widdows <dwi...@cs...>:
> >>
> >> >
> >> > Dear Viktor,
> >> >
> >> > Thanks so much for doing all of this and documenting the changes for
> >> > the
> >> > list. I agree that the my_isalpha function was long overdue an
> >> > overhaul.
> >> > It sounds like your changes are much more far reaching than just this,
> >> > though, and should enable the software to be much more
> >> > language-general.
> >> > For example, we've been hoping to enable support for Japanese and it
> >> > sounds like this will be possible now?
> >> >
> >> > It definitely makes more sense to specify what characters you want the
> >> > tokenizer to treat as alphabetic in a separate file.
> >> >
> >> > I'd definitely like to incorporate these changes to the software -
> >> > would
> >> > the best way be to add you to the project admins on SourceForge and
> >> > allow
> >> > you to commit the changes? If you sign up for an account at
> >> > https://sourceforge.net/ (or if you have one already)
> >> > we can add you as a project developer with the necessary permissions.
> >> >
> >> > Again, thanks so much for the feedback and the contributions.
> >> > Best wishes,
> >> > Dominic
> >> >
> >> > On Thu, 8 Apr 2004, Viktor Tron wrote:
> >> >
> >> > > Hello all,
> >> > >
> >> > > Your software is great, but praises should be on the user list :-).
> >> > > I subsribed to the list now, because I suggest some changes to 0.8.4
> >> > >
> >> > > If you are interested I send you the tarball or work it out with docs
> >> > etc
> >> > > and commit in cvs.
> >> > >
> >> > > Story and summary of changes are below.
> >> > > Cheers
> >> > > Viktor
> >> > >
> >> > > It all started out yesterday. I  wanted to use infomap on a
> >> > > Hungarian corpus. I soon figured out why things went wrong already
> >> > at
> >> > > the tokenization step.
> >> > >
> >> > > The problem was:
> >> > > utils.c
> >> > > lines 46--53
> >> > >
> >> > > /* This is a somewhat radical approach, in that it assumes
> >> > >      ASCII for efficiency and will *break* with other character
> >> > >      encodings.  */
> >> > > int my_isalpha( int c) {  // configured to let underscore through for
> >> > POS
> >> > > and tilda for indexing compounds
> >> > >     return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_')
> >> > || c
> >> > > == '~');
> >> > > }
> >> > >
> >> > > This function is used by the tokenizer to determine which are the
> >> > non-word
> >> > > (breaking) characters.
> >> > > It views 8 bit ascii chars above 128 as non-word (breaking)
> >> > characters,
> >> > > These characters happen to constitute a crucial part of most
> >> > languages
> >> > > other than English
> >> > > usually encoded in ISO-8859-X coding with X>1.
> >> > >
> >> > > It is not that it is a 'radical approach' as someone appropriately
> >> > > described it,
> >> > > but actually makes the program entirely English-specific entirely
> >> > > unnecessarily.
> >> > > So I set out to fix it.
> >> > >
> >> > > The whole alpha test should be done directly by the tokenizer. This
> >> > > funciton actually
> >> > > says how to segment a stram of strings, which is an extremely
> >> > important
> >> > > *meaningful* part of the tokenizer, not an auxiliary function like
> >> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by
> >> > > tokenizer.c.
> >> > >
> >> > > To correctly handle all this, I introduced an extra resource file
> >> > > containing
> >> > > a string of legitimate characters considered valid in words.
> >> > > All other characters will be considered as breaking characters by
> >> > the
> >> > > tokenizer
> >> > > and are skipped.
> >> > >
> >> > > The resource file is read in by initialize_tokenizer (appropriately
> >> > > together with the corpus filenames file) and used to initialize
> >> > > an array (details below). Then lookup from this array can
> >> > conveniently
> >> > > replace
> >> > > all uses of the previous my_isalpha test.
> >> > >
> >> > > This should give sufficiently flexible and charset-independent
> >> > control
> >> > > over simple text-based tokenization, which means it can be a proper
> >> > > multilingual software.
> >> > > Well, I checked and it worked for my Hungarian stuff.
> >> > >
> >> > > Surely I have further ideas of very simple extensions which would
> >> > perform
> >> > > tokenization of already tokenized (e.g. xml) files directly.
> >> > > With this in place the solution with valid_chars would just be
> >> > > one of the two major tokenization modes.
> >> > > Also: read-in doesn't seem to me to be optimized (characters of a line
> >> > are
> >> > >    scanned over twice). Since with large corpora this takes up a great
> >> > deal
> >> > > of time, we might want to consider to rewrite it.
> >> > >
> >> > >
> >> > > Details of the changes:
> >> > > nothing in the documentation yet.
> >> > >
> >> > > utils.{c,h}:
> >> > > 	function my_isalpha no longer exists, superseded by
> >> > > 	more configurable method in tokenizer
> >> > >
> >> > > tokenizer.{c,h}:
> >> > > 	introduced an int array: valid_chars[256] to look up
> >> > > 	for a character c, valid_chars[c] is nonzero iff it is a valid
> >> > > word-character
> >> > > 	if it is 0, it is considered as breaking (and skipped) by the
> >> > tokenizer
> >> > >
> >> > > 	initialize_tokenizer: now also initializes valid_chars by
> >> > > 	reading from a file passed as an extra argument
> >> > >
> >> > > prepare_corpus.c:
> >> > > 	modified invocation of initialize_tokenizer accordingly
> >> > > 	added parsing code for extra option '-chfile'
> >> > >
> >> > > For proper invocation of prepare_corpus Makefile.data.in and
> >> > > informap-build.in
> >> > > needed to be modified and for proper configuration/installation,
> >> > some
> >> > > further changes:
> >> > >
> >> > > admin/valid_chars.en:
> >> > > 	new file: contains the valid chars that exactly replicate the chars
> >> > > 	accepted as non-breaking by the now obsolete my_isalpha (utils.c)
> >> > > 	I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c
> >> > ==
> >> > > '~');
> >> > >
> >> > > admin/default-params.in:
> >> > > 	line 13: added default value
> >> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en"
> >> > >
> >> > > admin/Makefile:
> >> > > 	line 216: added default valid chars file 'valid_chars.en' to
> >> > EXTRA_DIST
> >> > > list
> >> > > 	to be copied into central data directory
> >> > >
> >> > > admin/Makefile.data.in:
> >> > > 	line 119-125: quotes supplied for all arguments
> >> > >           (lack of quotes caused the build procedure to stop already
> >> > at
> >> > > 	invoking prepare-corpus if some filenames were empty,
> >> > > 	rather than reaching the point where it could tell what is missing
> >> > > 	if at all a problem that it is missing.)
> >> > > 	line 125: added line for valid_chars
> >> > >
> >> > > admin/infomap-build.in:
> >> > > 	line 113: added line to dump value of VALID_CHARS_FILE
> >> > >
> >> > > 	line 44:  'cat' corrected to 'echo' (sorry I see sy spotted this
> >> > this
> >> > > morning)
> >> > > 	this dumps overriding command line settings (-D option) to an extra
> >> > > parameter
> >> > > 	file which is then sourced.
> >> > > 	cat expected actual setting strings (such as
> >> > "STOPLIST_FILE=my_stop_list")
> >> > > 	to be filenames
> >> > >
> >> > > +------------------------------------------------------------------+
> >> > > |Viktor Tron                                        v....@ed...|
> >> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh      Tel +44 131 650 4414|
> >> > > |European Postgraduate College               www.coli.uni-sb.de/egk|
> >> > > |School of Informatics                     www.informatics.ed.ac.uk|
> >> > > |Theoretical and Applied Linguistics              www.ling.ed.ac.uk|
> >> > > | @ University of Edinburgh, UK                        www.ed.ac.uk|
> >> > > |Dept of Computational Linguistics               www.coli.uni-sb.de|
> >> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de|
> >> > > |use LINUX and FREE Software                          www.linux.org|
> >> > > +------------------------------------------------------------------+
> >> > >
> >> > >
> >> > >
> >> > > -------------------------------------------------------
> >> > > This SF.Net email is sponsored by: IBM Linux Tutorials
> >> > > Free Linux tutorial presented by Daniel Robbins, President and CEO
> >> > of
> >> > > GenToo technologies. Learn everything from fundamentals to system
> >> > >
> >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >> > > _______________________________________________
> >> > > infomap-nlp-devel mailing list
> >> > > inf...@li...
> >> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
> >> > >
> >> >
> >> >
> >> > -------------------------------------------------------
> >> > This SF.Net email is sponsored by: IBM Linux Tutorials
> >> > Free Linux tutorial presented by Daniel Robbins, President and CEO of
> >> > GenToo technologies. Learn everything from fundamentals to system
> >> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >> > _______________________________________________
> >> > infomap-nlp-devel mailing list
> >> > inf...@li...
> >> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
> >> >
> >>
> >>
> >>
> >> +------------------------------------------------------------------+
> >> |Viktor Tron                                        v....@ed...|
> >> |3fl Rm8. 2 Buccleuch Place Edinburgh          Tel +44 131 650 4414|
> >> |European Postgraduate College               www.coli.uni-sb.de/egk|
> >> |School of Informatics                     www.informatics.ed.ac.uk|
> >> |Theoretical and Applied Linguistics              www.ling.ed.ac.uk|
> >> | @ University of Edinburgh, UK                        www.ed.ac.uk|
> >> |Dept of Computational Linguistics               www.coli.uni-sb.de|
> >> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de|
> >> |use LINUX and FREE Software                          www.linux.org|
> >> +------------------------------------------------------------------+
> >>
>
>
>