[Languagetool] German SRX sentence tokenization rules

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi All,

I have converted sentence tokenization for German to SRX rules. My aim 
was to stay as close to the original tokenizer as possible, so the code 
will pass the regression tests and can be later improved (some improved 
rules can be copied from other languages, more rules added, etc.)

The current SRX ruleset passes LanguageTool unit tests. I also compared 
the results of the current German tokenizer with SRX on a large text 
(Crime and Punishment, 1.5MB of plaintext, link at the bottom), adding 
couple of unit test cases in the process of bugfixing, and they are now 
almost identical. As a bonus tokenization with SRX rules is about 20 
times faster than with legacy code (1.5 sec. against 30 sec. for the 
mentioned book). Most of the differences are in favor of SRX (spaces are 
at the end of sentence instead of beginning - I think that's because 
segment tool is more robust on handling sentence boundaries than legacy 
mechanism), but there is one case where I am not sure.

What was the intention in this rule? Does it treat every fullstop after 
blank character (not directly after word character) as no sentence break?

<!-- e.g. "Das ist . so." - assume one sentence. -->
<rule break="no">
   <beforebreak>\s([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s</beforebreak>
   <afterbreak/>
</rule>

The difference with original tokenizer is as follows (# denotes sentence 
boundary). Please tell me which one do you think is correct (or better).

Input:
Aber, wissen Sie, die Scherereien mit dem Umzug! ... Ich habe außerdem 
gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen.

Result with legacy code:
#Aber, wissen Sie, die Scherereien mit dem Umzug! #... #Ich habe 
außerdem gerade als Advokat eine sehr wichtige Sache im Senat zu 
erledigen. #

Result with SRX:
#Aber, wissen Sie, die Scherereien mit dem Umzug! #... Ich habe außerdem 
gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. #

Currently to retain maximum backwards-compatibiliy with legacy code I 
forced the behaviour to be the same as it was before by not allowing to 
apply this rule directly after end of sentence by doing this:

<rule break="no">

<beforebreak>(?&lt;!([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s{0,3})\s{1,3}([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s</beforebreak>
   <afterbreak></afterbreak>
</rule>

But it violates SRX standard and LanguageTool rules of writing SRX 
(http://languagetool.wikidot.com/customizing-sentence-segmentation-in-srx-rules) 
by using lookbehind construct - it should not be done because it is only 
supported by Java and not ICU regexes.

Another question: how else these rules can be tested, does someone have 
standard set of texts new tokenizer should be tested against?

I am attaching modified SRX file, updated unit tests and a piece of code 
that (together with diff command) can be used to compare tokenization 
results. You can also download the test book 
(http://pub.loomchild.rootnode.net/crime_de.txt.zip) and test 
differences as diff output 
(http://pub.loomchild.rootnode.net/crime_de.diff.zip).

Thanks,
Jarek Lipski

[Languagetool] German SRX sentence tokenization rules

Proofreading Software for 20+ Languages

[Languagetool] German SRX sentence tokenization rules