#991 Strip special strings before passing text to spellchecker script


Right now, the spellchecker is producing too many false positives to be useful for UI translations. This is mainly because OT doesn't strip any special strings before passing the words to the tokenizer/spellchecker. Without this stripping we end up with thousands of errors like 'nSomething' (because the text contains 'Something\nSomething'), or 'some' 'thing' (because there is 'Some&thing') etc. etc.

What I mean by special strings is:
- any user specified tags (Options - Tag validation - custom tags re)
- escape sequences, ie. \n, \r, \t etc.
- mnemonic chars - ie. &, _, ~ etc. used to specify access keys


  • cienislaw

    cienislaw - 2014-05-07

    Custom tags are covered:

         * Breaks a string into word-only tokens. Numbers, tags, and other non-word
         * tokens are NOT included in the result. Stemming must NOT be used.
         * This method used to tokenize string to check spelling and to switch case.
         * There is no sense to cache results.
        Token[] tokenizeWordsForSpelling(String str);

    As for other suggestions there is no problem to remove those characters and escape sequences. Can you provide complete list of mnemonic chars you want to be removed?

  • khagaroth

    khagaroth - 2014-05-07

    &, _ and ~ should cover 99% of all cases. I have also seen combinations of _& or ~&, but those are rare.

    As for the custom tags. Despite the quoted comment, they seem to be still included in the spellcheck as I have set \n as custom tag (and it's correctly highlighted and not editable, so it is detected as a tag) and it still ends up being spellchecked as part of the next word. Ie \nSomething is spellchecked as nSomething.

  • cienislaw

    cienislaw - 2014-05-07

    hmmm strange, will have to look into closer as it works very well for editor spell errors marking.

    your requests are implemented in /trunk.

  • Didier Briel

    Didier Briel - 2014-05-08

    your requests are implemented in /trunk.

    Are you sure there isn't some sort of confusion here? I see changes in spellcheck.groovy, but I don't see any change in OmegaT internal spell checking, when I would be most interested in seeing at least '&' ignored (for instance, when translating OmegaT). For \n, etc., that's less important, because I usually use segmentation rules.

    I don't see khagaroth having talked about the spell checking script.


  • khagaroth

    khagaroth - 2014-05-08

    Actually I did mean the spellchecking script mostly, but the inline spellchecking is important as well.

    Now for the changes in spellcheck.groovy. It doesn't work at all. Worse, it breaks things that previously kind of worked, ie <b>Someting now ends up as bSometing, where previously it still wasn't ignored completely as it should, but at least it was spellchecked on its own as an orphan 'b'. Looks like the stripping order is wrong. It should first remove complete tags and only then any remaining special strings/chars. Right now it seems to strip chars like < > \ etc. and then it obviously doesn't see any tags that use these chars.

  • Kos Ivantsov

    Kos Ivantsov - 2014-05-08

    I've been puzzled too, because the request led me to think it's going to affect the Editor. With that said, I do want to thank cienislaw for a great GUI-configurable script!

  • cienislaw

    cienislaw - 2014-05-08

    There is nothing wrong with stripping order - there is not yet working remove of OmegaT and user defined tags. Stuff is working for internal spellchecker so I did not touch it yet. I agree that with active remove of escaped and mnemonic chars script can produce worse result than before - just turn off those features in script or via gui, and/or edit list of removed stuff in script code. I'm traveling now and can't address stripping tags, but will add it during the weekend.

  • cienislaw

    cienislaw - 2014-05-14

    After closer checking I have change my previous statement - custom tags are not covered in spellchecking - not in OmegaT itself and the same goes for script v0.2 and earlier. Actually PatternConsts.getCustomTagPattern() is called only one time in Tag Validation and indirectly in PatternConsts.getPlaceholderPattern() at project load in RealProject. Brief look at tokenizers shows that DefaultTokenizer and Lucene based are aware of OmegaT-like tags. It does not mean it's 100% true what I've stated above - it's a new filed of interest for me in OmegaT and for sure I did not find all info.

    Probably you ask yourself - ok, enough this mumbo jumbo, what about making script doing what I want? Now cleaning up part looks like this:
    1. Remove fragments defined for remove in Tag Validation.
    2. Remove OmegaT tag groups which are preceded and followed by small unicode letter - that way tags which split proper words in half making them errors.
    3. Replace all other OmegaT tags by space. (this is currently disabled as tokenizers can handle such tags properly)
    4. Replace custom tags by space.
    5. Replace escaped chars by space.
    6. Remove mnemonics chars.

    p2 and (or just) p3 could be upgraded to OmegaT tags, normal and simple printf vars, simple java messages. With some more work I think those could include custom tags also. And here is tricky part - this steps are based on my experience and what is good for me doesn't have to be good for you. So even with fancy GUI where you can turn on and off some features, there can be situation when source code modification of script will be a must. I've cleaned up mnemonics chars just to &~, because it contained too much chars - most of chars from previous set are properly handled by tokenizers and are used in very rare situations inside of one word. Of course you can add more if needed just be aware that putting too much, especially with other switches combination, may produce 'wrong' results. For example with source W.E.N.D.I.G.0 with . removal it produces non-existing WENDIG (digit is removed) word which is not what I've wanted.

    I don't know yet how to translate script behavior to OmegaT, but for sure removal or replace by space is not an option. Target string length and its substrings placement in it can not be changed and replace by space just gives us more tokens when in some situations should glue strings before and after it. If it won't be handled, markers and other stuff will just broke.

    If someone is curious how tokenization works for different tokenizers below is example:

    SourceText pl-PL: Wypus<x11/>zczam proto&typ W.E.N.D.I.G.0. %s \ntest\n [[CR]] (jeszcze)

    DefaultTokenizer: Wypus zczam proto&typ W.E.N.D.I.G s ntest n CR jeszcze

    HunspellTokenizer: Wypus zczam proto typ W.E.N.D.I.G s ntest n CR jeszcze

    As you see above tokenizers know how to handle <tag/>, Default leaves & in place, when Hunspell breaks at it. Both break at % and \ which gives two alone tokens 's' and 'n', and glue n from \n to test. Single letters are always ok for spellcheck so typical printf variables (%d %s and so on) will be not a problem too. Any []{}() also are break points. Probably Hunspell and other Lucene tokenizers can be tweaked to mimic DefaultTokenizer behavior for & and other important mnemonic chars, which means that removal of those in OmegaT could be done after tokenization but before word spellcheck - that way it won't make a problem with markers.

    Last edit: cienislaw 2014-05-14
  • khagaroth

    khagaroth - 2014-05-16
    ESCAPED_CHARACTERS_REGEX = "\\[abfnrtv]"

    should be

    ESCAPED_CHARACTERS_REGEX = "\\\\[abfnrtv]"
  • cienislaw

    cienislaw - 2014-05-16

    You are right, thanks. I've got \n in custom tags and overlooked it while testing. fixed in /trunk.

  • cienislaw

    cienislaw - 2014-05-16

    Backreferences in OmegaT tags removal also weren't working properly. I need to triple check everything next time. Fixed in /trunk.

  • Didier Briel

    Didier Briel - 2014-05-19
    • summary: Strip special strings before pasing text to spellchecker --> Strip special strings before pasing text to spellchecker script
    • assigned_to: cienislaw
    • Group: future --> 3.1
  • Didier Briel

    Didier Briel - 2014-05-19

    Implemented in SVN (/trunk).


  • Didier Briel

    Didier Briel - 2014-05-19
    • status: open --> open-fixed
  • Didier Briel

    Didier Briel - 2014-05-23


    Fixed in the released version 3.1.1 update 1 of OmegaT.


    Last edit: Didier Briel 2014-05-23
  • Didier Briel

    Didier Briel - 2014-05-23
    • status: open-fixed --> closed-fixed
  • Didier Briel

    Didier Briel - 2014-12-02
    • summary: Strip special strings before pasing text to spellchecker script --> Strip special strings before passing text to spellchecker script

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks