Czech language tokenizer and segmenter News
Status: Pre-Alpha
Brought to you by:
kveton
finished support for external definitions of rules, achieved PDT compatibility again.
I'm studying Unicode issues and possibilities to migrate from iswalpha etc. to Unicode Properties
latest release before migrating to external definition of non-trivial tokenizer and segmenter rules.
Achieved final compatibility with PDT tokenization. Fixed an end-of-doc bug.
including the comparison of the tokenizer with the PDT 2.0
trivial tokenizer is working and it is almost PDT-compliant