Users have complained that when the source language is Chinese the fuzzy match display shows differences in too coarse a fashion: a single differing character can cause large chunks of text to be shown as different.
The issue is the following: Tokenizing for display purposes requires "verbatim" tokenization (meaning, in pseudo-Python: "".join(tokens) == original_string; in OmegaT ITokenizer.tokenizeAllExactly() must obey this property). Most tokenizers cannot do this in any mode (they swallow whitespace, punctuation, etc.) so they fall back to a BreakIterator-based solution that tokenizes on character-type boundaries.
For space-delimited languages (or Japanese, which has frequent character-type boundaries) this works OK, but it fails hard on Chinese.
An easy solution is to have Chinese tokenizers do a unigram tokenization for tokenizeAllExactly(). This would be too granular for most languages, but I believe it to be acceptable for Chinese.
tokenizeAllExactly() example for Lucene(Smart)ChineseTokenizer:
BreakIterator):
I have a patch ready for this; I am waiting for the next beta branch.
I think for English words (including numbers) in between Chinese characters, it may be not necessary to tokenize in too smaller units, i.e., by character. I don't know if this factor should be considered in Chinese tokenizer and if so, judging whether the character is English or Chinese is necessary.
True, it's not necessary, but it doesn't hurt either, and it would be much more complicated to have different behavior for Chinese and non-Chinese tokens.
Thanks, Aaron for feedback quickly
This is addressed in trunk, r7103.
Implemented in the released version 3.4 of OmegaT.
Didier