|
From: Viktor T. <v....@ed...> - 2004-04-08 14:12:03
|
Hello all,
Your software is great, but praises should be on the user list :-).
I subsribed to the list now, because I suggest some changes to 0.8.4
If you are interested I send you the tarball or work it out with docs etc
and commit in cvs.
Story and summary of changes are below.
Cheers
Viktor
It all started out yesterday. I wanted to use infomap on a
Hungarian corpus. I soon figured out why things went wrong already at
the tokenization step.
The problem was:
utils.c
lines 46--53
/* This is a somewhat radical approach, in that it assumes
ASCII for efficiency and will *break* with other character
encodings. */
int my_isalpha( int c) { // configured to let underscore through for POS
and tilda for indexing compounds
return( ( c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c
== '~');
}
This function is used by the tokenizer to determine which are the non-word
(breaking) characters.
It views 8 bit ascii chars above 128 as non-word (breaking) characters,
These characters happen to constitute a crucial part of most languages
other than English
usually encoded in ISO-8859-X coding with X>1.
It is not that it is a 'radical approach' as someone appropriately
described it,
but actually makes the program entirely English-specific entirely
unnecessarily.
So I set out to fix it.
The whole alpha test should be done directly by the tokenizer. This
funciton actually
says how to segment a stram of strings, which is an extremely important
*meaningful* part of the tokenizer, not an auxiliary function like
my_fopen, etc. Fortunately my_isalpha is indeed only used by
tokenizer.c.
To correctly handle all this, I introduced an extra resource file
containing
a string of legitimate characters considered valid in words.
All other characters will be considered as breaking characters by the
tokenizer
and are skipped.
The resource file is read in by initialize_tokenizer (appropriately
together with the corpus filenames file) and used to initialize
an array (details below). Then lookup from this array can conveniently
replace
all uses of the previous my_isalpha test.
This should give sufficiently flexible and charset-independent control
over simple text-based tokenization, which means it can be a proper
multilingual software.
Well, I checked and it worked for my Hungarian stuff.
Surely I have further ideas of very simple extensions which would perform
tokenization of already tokenized (e.g. xml) files directly.
With this in place the solution with valid_chars would just be
one of the two major tokenization modes.
Also: read-in doesn't seem to me to be optimized (characters of a line are
scanned over twice). Since with large corpora this takes up a great deal
of time, we might want to consider to rewrite it.
Details of the changes:
nothing in the documentation yet.
utils.{c,h}:
function my_isalpha no longer exists, superseded by
more configurable method in tokenizer
tokenizer.{c,h}:
introduced an int array: valid_chars[256] to look up
for a character c, valid_chars[c] is nonzero iff it is a valid
word-character
if it is 0, it is considered as breaking (and skipped) by the tokenizer
initialize_tokenizer: now also initializes valid_chars by
reading from a file passed as an extra argument
prepare_corpus.c:
modified invocation of initialize_tokenizer accordingly
added parsing code for extra option '-chfile'
For proper invocation of prepare_corpus Makefile.data.in and
informap-build.in
needed to be modified and for proper configuration/installation, some
further changes:
admin/valid_chars.en:
new file: contains the valid chars that exactly replicate the chars
accepted as non-breaking by the now obsolete my_isalpha (utils.c)
I.e.: (c > 64 && c < 91) || ( c > 96 && c < 123) || ( c == '_') || c ==
'~');
admin/default-params.in:
line 13: added default value
VALID_CHARS_FILE="@pkgdatadir@/valid_chars.en"
admin/Makefile:
line 216: added default valid chars file 'valid_chars.en' to EXTRA_DIST
list
to be copied into central data directory
admin/Makefile.data.in:
line 119-125: quotes supplied for all arguments
(lack of quotes caused the build procedure to stop already at
invoking prepare-corpus if some filenames were empty,
rather than reaching the point where it could tell what is missing
if at all a problem that it is missing.)
line 125: added line for valid_chars
admin/infomap-build.in:
line 113: added line to dump value of VALID_CHARS_FILE
line 44: 'cat' corrected to 'echo' (sorry I see sy spotted this this
morning)
this dumps overriding command line settings (-D option) to an extra
parameter
file which is then sourced.
cat expected actual setting strings (such as "STOPLIST_FILE=my_stop_list")
to be filenames
+------------------------------------------------------------------+
|Viktor Tron v....@ed...|
|3fl Rm8 2 Buccleuch Pl EH8 9LW Edinburgh Tel +44 131 650 4414|
|European Postgraduate College www.coli.uni-sb.de/egk|
|School of Informatics www.informatics.ed.ac.uk|
|Theoretical and Applied Linguistics www.ling.ed.ac.uk|
| @ University of Edinburgh, UK www.ed.ac.uk|
|Dept of Computational Linguistics www.coli.uni-sb.de|
| @ Saarland University (Saarbruecken, Germany) www.uni-saarland.de|
|use LINUX and FREE Software www.linux.org|
+------------------------------------------------------------------+
|