Re: [infomap-nlp-devel] changes in tokenization

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Yes finally I have uploaded the changes.

It took me a while cause I wnated to document it so I extended the manpages.
(Nothing to the tutorial, though)

*Everything* should work like before out of the box.

Please check this if you can with a clean temp checkout and compile, etc.

Best
Viktor

On Thu, 15 Apr 2004 11:40:14 -0700 (PDT), Dominic Widdows <dwi...@cs...> wrote:

>
> Dear Viktor,
>
> Did you manage to commit your changes to the infomap code to SourcForge at
> all?
>
> Best wishes,
> Dominic
>
> On Thu, 8 Apr 2004, Viktor Tron wrote:
>
>> Hello Dominic
>> I am viktron on Sourcefourge, if you want to add me.
>> and then I can commit changes.
>> Or maybe you want me to add changes to the documentation as well.
>> But then again, that makes sense only if a proper
>> conception is crystallized concerning what we want the tokenization
>> to do.
>> BTW, do you know Colin Bannard?
>> Best
>> Viktor
>>
>>
>> Quoting Dominic Widdows <dwi...@cs...>:
>>
>> >
>> > Dear Viktor,
>> >
>> > Thanks so much for doing all of this and documenting the changes for
>> > the
>> > list. I agree that the my_isalpha function was long overdue an
>> > overhaul.
>> > It sounds like your changes are much more far reaching than just this,
>> > though, and should enable the software to be much more
>> > language-general.
>> > For example, we've been hoping to enable support for Japanese and it
>> > sounds like this will be possible now?
>> >
>> > It definitely makes more sense to specify what characters you want the
>> > tokenizer to treat as alphabetic in a separate file.
>> >
>> > I'd definitely like to incorporate these changes to the software -
>> > would
>> > the best way be to add you to the project admins on SourceForge and
>> > allow
>> > you to commit the changes? If you sign up for an account at
>> > https://sourceforge.net/ (or if you have one already)
>> > we can add you as a project developer with the necessary permissions.
>> >
>> > Again, thanks so much for the feedback and the contributions.
>> > Best wishes,
>> > Dominic
>> >
>> > On Thu, 8 Apr 2004, Viktor Tron wrote:
>> >
>> > > Hello all,
>> > >
>> > > Your software is great, but praises should be on the user list :-).
>> > > I subsribed to the list now, because I suggest some changes to 0.8.4
>> > >
>> > > If you are interested I send you the tarball or work it out with docs
>> > etc
>> > > and commit in cvs.
>> > >
>> > > Story and summary of changes are below.
>> > > Cheers
>> > > Viktor
>> > >
>> > > It all started out yesterday. I  wanted to use infomap on a
>> > > Hungarian corpus. I soon figured out why things went wrong already
>> > at
>> > > the tokenization step.
>> > >
>> > > The problem was:
>> > > utils.c
>> > > lines 46--53
>> > >
>> > > /* This is a somewhat radical approach, in that it assumes
>> > >      ASCII for efficiency and will *break* with other character
>> > >      encodings.  */
>> > > int my_isalpha( int c) {  // configured to let underscore through for
>> > POS
>> > > and tilda for indexing compounds
>> > >     return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_')
>> > || c
>> > > == '~');
>> > > }
>> > >
>> > > This function is used by the tokenizer to determine which are the
>> > non-word
>> > > (breaking) characters.
>> > > It views 8 bit ascii chars above 128 as non-word (breaking)
>> > characters,
>> > > These characters happen to constitute a crucial part of most
>> > languages
>> > > other than English
>> > > usually encoded in ISO-8859-X coding with X>1.
>> > >
>> > > It is not that it is a 'radical approach' as someone appropriately
>> > > described it,
>> > > but actually makes the program entirely English-specific entirely
>> > > unnecessarily.
>> > > So I set out to fix it.
>> > >
>> > > The whole alpha test should be done directly by the tokenizer. This
>> > > funciton actually
>> > > says how to segment a stram of strings, which is an extremely
>> > important
>> > > *meaningful* part of the tokenizer, not an auxiliary function like
>> > > my_fopen, etc. Fortunately my_isalpha is indeed only used by
>> > > tokenizer.c.
>> > >
>> > > To correctly handle all this, I introduced an extra resource file
>> > > containing
>> > > a string of legitimate characters considered valid in words.
>> > > All other characters will be considered as breaking characters by
>> > the
>> > > tokenizer
>> > > and are skipped.
>> > >
>> > > The resource file is read in by initialize_tokenizer (appropriately
>> > > together with the corpus filenames file) and used to initialize
>> > > an array (details below). Then lookup from this array can
>> > conveniently
>> > > replace
>> > > all uses of the previous my_isalpha test.
>> > >
>> > > This should give sufficiently flexible and charset-independent
>> > control
>> > > over simple text-based tokenization, which means it can be a proper
>> > > multilingual software.
>> > > Well, I checked and it worked for my Hungarian stuff.
>> > >
>> > > Surely I have further ideas of very simple extensions which would
>> > perform
>> > > tokenization of already tokenized (e.g. xml) files directly.
>> > > With this in place the solution with valid_chars would just be
>> > > one of the two major tokenization modes.
>> > > Also: read-in doesn't seem to me to be optimized (characters of a line
>> > are
>> > >    scanned over twice). Since with large corpora this takes up a great
>> > deal
>> > > of time, we might want to consider to rewrite it.
>> > >
>> > >
>> > > Details of the changes:
>> > > nothing in the documentation yet.
>> > >
>> > > utils.{c,h}:
>> > > 	function my_isalpha no longer exists, superseded by
>> > > 	more configurable method in tokenizer
>> > >
>> > > tokenizer.{c,h}:
>> > > 	introduced an int array: valid_chars[256] to look up
>> > > 	for a character c, valid_chars[c] is nonzero iff it is a valid
>> > > word-character
>> > > 	if it is 0, it is considered as breaking (and skipped) by the
>> > tokenizer
>> > >
>> > > 	initialize_tokenizer: now also initializes valid_chars by
>> > > 	reading from a file passed as an extra argument
>> > >
>> > > prepare_corpus.c:
>> > > 	modified invocation of initialize_tokenizer accordingly
>> > > 	added parsing code for extra option '-chfile'
>> > >
>> > > For proper invocation of prepare_corpus Makefile.data.in and
>> > > informap-build.in
>> > > needed to be modified and for proper configuration/installation,
>> > some
>> > > further changes:
>> > >
>> > > admin/valid_chars.en:
>> > > 	new file: contains the valid chars that exactly replicate the chars
>> > > 	accepted as non-breaking by the now obsolete my_isalpha (utils.c)
>> > > 	I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c
>> > ==
>> > > '~');
>> > >
>> > > admin/default-params.in:
>> > > 	line 13: added default value
>> > > VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en"
>> > >
>> > > admin/Makefile:
>> > > 	line 216: added default valid chars file 'valid_chars.en' to
>> > EXTRA_DIST
>> > > list
>> > > 	to be copied into central data directory
>> > >
>> > > admin/Makefile.data.in:
>> > > 	line 119-125: quotes supplied for all arguments
>> > >           (lack of quotes caused the build procedure to stop already
>> > at
>> > > 	invoking prepare-corpus if some filenames were empty,
>> > > 	rather than reaching the point where it could tell what is missing
>> > > 	if at all a problem that it is missing.)
>> > > 	line 125: added line for valid_chars
>> > >
>> > > admin/infomap-build.in:
>> > > 	line 113: added line to dump value of VALID_CHARS_FILE
>> > >
>> > > 	line 44:  'cat' corrected to 'echo' (sorry I see sy spotted this
>> > this
>> > > morning)
>> > > 	this dumps overriding command line settings (-D option) to an extra
>> > > parameter
>> > > 	file which is then sourced.
>> > > 	cat expected actual setting strings (such as
>> > "STOPLIST_FILE=my_stop_list")
>> > > 	to be filenames
>> > >
>> > > +------------------------------------------------------------------+
>> > > |Viktor Tron                                        v....@ed...|
>> > > |3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh      Tel +44 131 650 4414|
>> > > |European Postgraduate College               www.coli.uni-sb.de/egk|
>> > > |School of Informatics                     www.informatics.ed.ac.uk|
>> > > |Theoretical and Applied Linguistics              www.ling.ed.ac.uk|
>> > > | @ University of Edinburgh, UK                        www.ed.ac.uk|
>> > > |Dept of Computational Linguistics               www.coli.uni-sb.de|
>> > > | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de|
>> > > |use LINUX and FREE Software                          www.linux.org|
>> > > +------------------------------------------------------------------+
>> > >
>> > >
>> > >
>> > > -------------------------------------------------------
>> > > This SF.Net email is sponsored by: IBM Linux Tutorials
>> > > Free Linux tutorial presented by Daniel Robbins, President and CEO
>> > of
>> > > GenToo technologies. Learn everything from fundamentals to system
>> > >
>> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
>> > > _______________________________________________
>> > > infomap-nlp-devel mailing list
>> > > inf...@li...
>> > > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>> > >
>> >
>> >
>> > -------------------------------------------------------
>> > This SF.Net email is sponsored by: IBM Linux Tutorials
>> > Free Linux tutorial presented by Daniel Robbins, President and CEO of
>> > GenToo technologies. Learn everything from fundamentals to system
>> > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
>> > _______________________________________________
>> > infomap-nlp-devel mailing list
>> > inf...@li...
>> > https://lists.sourceforge.net/lists/listinfo/infomap-nlp-devel
>> >
>>
>>
>>
>> +------------------------------------------------------------------+
>> |Viktor Tron                                        v....@ed...|
>> |3fl Rm8. 2 Buccleuch Place Edinburgh          Tel +44 131 650 4414|
>> |European Postgraduate College               www.coli.uni-sb.de/egk|
>> |School of Informatics                     www.informatics.ed.ac.uk|
>> |Theoretical and Applied Linguistics              www.ling.ed.ac.uk|
>> | @ University of Edinburgh, UK                        www.ed.ac.uk|
>> |Dept of Computational Linguistics               www.coli.uni-sb.de|
>> | @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de|
>> |use LINUX and FREE Software                          www.linux.org|
>> +------------------------------------------------------------------+
>>