The tokenization and segmentation for the Czech language.
Be the first to post a text review of Czech language tokenizer and segmenter. Rate and review a project by clicking thumbs up or thumbs down in the right column.
The main change is the external definition of rules that allows the binary executable to work with multiple sets of rules (e.g. for multiple languages) without re-compilation.
finished support for external definitions of rules, achieved PDT compatibility again.
I'm studying Unicode issues and possibilities to migrate from iswalpha etc. to Unicode Properties
Bitwise interpreation of token type. Roman numerals flag is set up directly when the token is created. Upper/Lower/Titlecase is a token property. Segmenter processes all rules on every position (unlike tokenizer). LSSEQ introduced. lc() check supported.
latest release before migrating to external definition of non-trivial tokenizer and segmenter rules.
See czechtok-0_3
<ul> <li>ct_numerals changed to ct_pdtcompat.</li> <li>removed PDT compat rules for decimal comma and a space-separated numbers</li> <li>fixed end-of-doc bugs</li> <li>in PDT-like output, fixed difference in <f> vs. <d> markup </li> <li>-n changed to --no_pdt</li> </ul>
Achieved final compatibility with PDT tokenization. Fixed an end-of-doc bug.
Be the first person to add a text review.
Copyright © 2009 Geeknet, Inc. All rights reserved. Terms of Use
Thanks for your rating!
Would you also like to write a review?
Thanks for your review!
Get credit for your review by logging in via OpenID. Click your account provider: