From: Marcin M. <lis...@wp...> - 2010-11-29 21:43:50
|
Hi all, I committed the rules to the repository. It seems that unly Danish and Czech are using the old sentence tokenizer. I think we could get rid of them as well for the next release. SRX splitters are easier to maintain and much faster (around 20 times). To answer Daniel's query, yes, the abbreviations are on different lines only for clarity. SRX tokenizer glues them anyway to a single regex while reading from XML. Regards Marcin 2010/11/28 Jarek <lo...@ro...> > Hi All, > > I have converted sentence tokenization for German to SRX rules. My aim was > to stay as close to the original tokenizer as possible, so the code will > pass the regression tests and can be later improved (some improved rules can > be copied from other languages, more rules added, etc.) > > The current SRX ruleset passes LanguageTool unit tests. I also compared the > results of the current German tokenizer with SRX on a large text (Crime and > Punishment, 1.5MB of plaintext, link at the bottom), adding couple of unit > test cases in the process of bugfixing, and they are now almost identical. > As a bonus tokenization with SRX rules is about 20 times faster than with > legacy code (1.5 sec. against 30 sec. for the mentioned book). Most of the > differences are in favor of SRX (spaces are at the end of sentence instead > of beginning - I think that's because segment tool is more robust on > handling sentence boundaries than legacy mechanism), but there is one case > where I am not sure. > > What was the intention in this rule? Does it treat every fullstop after > blank character (not directly after word character) as no sentence break? > > <!-- e.g. "Das ist . so." - assume one sentence. --> > <rule break="no"> > <beforebreak>\s([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s</beforebreak> > <afterbreak/> > </rule> > > The difference with original tokenizer is as follows (# denotes sentence > boundary). Please tell me which one do you think is correct (or better). > > Input: > Aber, wissen Sie, die Scherereien mit dem Umzug! ... Ich habe außerdem > gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. > > Result with legacy code: > #Aber, wissen Sie, die Scherereien mit dem Umzug! #... #Ich habe außerdem > gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. # > > Result with SRX: > #Aber, wissen Sie, die Scherereien mit dem Umzug! #... Ich habe außerdem > gerade als Advokat eine sehr wichtige Sache im Senat zu erledigen. # > > Currently to retain maximum backwards-compatibiliy with legacy code I > forced the behaviour to be the same as it was before by not allowing to > apply this rule directly after end of sentence by doing this: > > <rule break="no"> > > > <beforebreak>(?<!([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s{0,3})\s{1,3}([\.!?]{1,3}|…)['|"|«|\)|\]|\}]?\s</beforebreak> > <afterbreak></afterbreak> > </rule> > > But it violates SRX standard and LanguageTool rules of writing SRX ( > http://languagetool.wikidot.com/customizing-sentence-segmentation-in-srx-rules) > by using lookbehind construct - it should not be done because it is only > supported by Java and not ICU regexes. > > Another question: how else these rules can be tested, does someone have > standard set of texts new tokenizer should be tested against? > > I am attaching SRX file containing only German rule (to reduce email size), > updated unit tests and a piece of code that (together with diff command) can > be used to compare tokenization results. You can also download the test book > (http://pub.loomchild.rootnode.net/crime_de.txt.zip) and test differences > as diff output (http://pub.loomchild.rootnode.net/crime_de.diff.zip). > > Thanks, > Jarek Lipski > |