From: Jarek <lo...@ro...> - 2010-11-20 11:56:34
|
Hi All, I have converted sentence tokenization for German to SRX rules. My aim was to stay as close to the original tokenizer as possible, so the code will pass the regression tests and can be later improved (some improved rules can be copied from other languages, more rules added, etc.) The current SRX ruleset passes LanguageTool unit tests. I also compared the results of the current German tokenizer with SRX on a large text (Crime and Punishment, 1.5MB of plaintext, link at the bottom), adding couple of unit test cases in the process of bugfixing, and they are now almost identical. As a bonus tokenization with SRX rules is about 20 times faster than with legacy code (1.5 sec. against 30 sec. for the mentioned book). Most of the differences are in favor of SRX (spaces are at the end of sentence instead of beginning - I think that's because segment tool is more robust on handling sentence boundaries than legacy mechanism), but there is one case where I am not sure. What was the intention in this rule? Does it treat every fullstop after blank character (not directly after word character) as no sentence break? <!-- e.g. "Das ist . so." - assume one sentence. --> <rule break="no"> <beforebreak>\s([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s</beforebreak> <afterbreak/> </rule> The difference with original tokenizer is as follows (# denotes sentence boundary). Please tell me which one do you think is correct (or better). Input: Aber, wissen Sie, die Scherereien mit dem Umzug! ... Ich habe außerdem gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. Result with legacy code: #Aber, wissen Sie, die Scherereien mit dem Umzug! #... #Ich habe außerdem gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. # Result with SRX: #Aber, wissen Sie, die Scherereien mit dem Umzug! #... Ich habe außerdem gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. # Currently to retain maximum backwards-compatibiliy with legacy code I forced the behaviour to be the same as it was before by not allowing to apply this rule directly after end of sentence by doing this: <rule break="no"> <beforebreak>(?<!([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s{0,3})\s{1,3}([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s</beforebreak> <afterbreak></afterbreak> </rule> But it violates SRX standard and LanguageTool rules of writing SRX (http://languagetool.wikidot.com/customizing-sentence-segmentation-in-srx-rules) by using lookbehind construct - it should not be done because it is only supported by Java and not ICU regexes. Another question: how else these rules can be tested, does someone have standard set of texts new tokenizer should be tested against? I am attaching modified SRX file, updated unit tests and a piece of code that (together with diff command) can be used to compare tokenization results. You can also download the test book (http://pub.loomchild.rootnode.net/crime_de.txt.zip) and test differences as diff output (http://pub.loomchild.rootnode.net/crime_de.diff.zip). Thanks, Jarek Lipski |