Menu

#866 Allow user to adjust tokenizer behavior

3.0
closed-fixed
5
2013-05-20
2013-05-20
No

Lucene tokenizers have the ability to behave in different ways depending on a compatibility version parameter such as LUCENE_29 for version 2.9, etc. The aggressiveness of stemming, and probably other things as well, can depend on this parameter.

Different behaviors may be more or less useful for OmegaT's purposes, so it would be convenient to be able to select a particular Lucene version.

In this RFE I will add "Tokenizer Behavior" selectors to the project property window next to the Tokenizer selectors. Each tokenizer can have its own preferred version (default is LUCENE_CURRENT), but if the user overrides this then their selection is remembered on a per-tokenizer basis.

Behaviors can also be specified on the command line via the flags --ITokenizerBehavior and --ITokenizerTargetBehavior. The values must be one of (as of Lucene 3.6.2):

  • LUCENE_20
  • LUCENE_21
  • LUCENE_22
  • LUCENE_23
  • LUCENE_24
  • LUCENE_29
  • LUCENE_30
  • LUCENE_31
  • LUCENE_32
  • LUCENE_33
  • LUCENE_34
  • LUCENE_35
  • LUCENE_36
  • LUCENE_CURRENT

The exact meaning of each version differs by tokenizer. At the moment the only way to determine this is to inspect the Lucene source code (example: GermanAnalyzer).

Volunteers who wish to help document this are welcome. Please provide a list of relevant versions (i.e. versions in which a major behavior change occurred), and a very short description of that change. E.g. for German:

  • 2.0 = Caumanns
  • 3.1 = Snowball
  • 3.6 = UniNE

These labels will be added to the behavior selector dropdowns.

Discussion

  • Didier Briel

    Didier Briel - 2013-05-20
    • Group: future --> 3.0
     
  • Aaron Madlon-Kay

    Also the default behavior for the tokenizer is prefixed with "*" in the selector.

     
  • Didier Briel

    Didier Briel - 2013-05-20

    For French, the different algorithms are:
    2.0 = Porter
    3.1 = Snowball
    3.6 = UniNE

    Didier

     
  • Aaron Madlon-Kay

    Thanks, Didier.

    Do you have an opinion as to which version should be default? I would suggest the highest version using the best algorithm (e.g. if you like Porter, it would be 3.0; if Snowball, 3.5).

     
    • Didier Briel

      Didier Briel - 2013-05-20

      I replied in the development mailing list.

      I have not set any default (i.e., Lucene_current is the default), because I don't have any preference (we need the opinion of translators translated from French). Porter was not bad, but was by no means perfect, which means Snowball or UniNE might be better.

      Didier

       
  • Didier Briel

    Didier Briel - 2013-05-20
    • labels: --> OmegaT Application
    • status: open-fixed --> closed-fixed
     
  • Didier Briel

    Didier Briel - 2013-05-20

    Implemented in the released version 3.0.2 of OmegaT.

    Didier

     

Log in to post a comment.

MongoDB Logo MongoDB