Lucene tokenizers have the ability to behave in different ways depending on a compatibility version parameter such as LUCENE_29 for version 2.9, etc. The aggressiveness of stemming, and probably other things as well, can depend on this parameter.
Different behaviors may be more or less useful for OmegaT's purposes, so it would be convenient to be able to select a particular Lucene version.
In this RFE I will add "Tokenizer Behavior" selectors to the project property window next to the Tokenizer selectors. Each tokenizer can have its own preferred version (default is LUCENE_CURRENT), but if the user overrides this then their selection is remembered on a per-tokenizer basis.
Behaviors can also be specified on the command line via the flags --ITokenizerBehavior and --ITokenizerTargetBehavior. The values must be one of (as of Lucene 3.6.2):
LUCENE_20LUCENE_21LUCENE_22LUCENE_23LUCENE_24LUCENE_29LUCENE_30LUCENE_31LUCENE_32LUCENE_33LUCENE_34LUCENE_35LUCENE_36LUCENE_CURRENTThe exact meaning of each version differs by tokenizer. At the moment the only way to determine this is to inspect the Lucene source code (example: GermanAnalyzer).
Volunteers who wish to help document this are welcome. Please provide a list of relevant versions (i.e. versions in which a major behavior change occurred), and a very short description of that change. E.g. for German:
These labels will be added to the behavior selector dropdowns.
Also the default behavior for the tokenizer is prefixed with "*" in the selector.
For French, the different algorithms are:
2.0 = Porter
3.1 = Snowball
3.6 = UniNE
Didier
Thanks, Didier.
Do you have an opinion as to which version should be default? I would suggest the highest version using the best algorithm (e.g. if you like Porter, it would be 3.0; if Snowball, 3.5).
I replied in the development mailing list.
I have not set any default (i.e., Lucene_current is the default), because I don't have any preference (we need the opinion of translators translated from French). Porter was not bad, but was by no means perfect, which means Snowball or UniNE might be better.
Didier
Implemented in the released version 3.0.2 of OmegaT.
Didier