Menu

#1077 Improve fuzzy match difference granularity in Chinese

3.4
closed-fixed
None
5
2015-04-22
2015-03-25
No

Users have complained that when the source language is Chinese the fuzzy match display shows differences in too coarse a fashion: a single differing character can cause large chunks of text to be shown as different.

The issue is the following: Tokenizing for display purposes requires "verbatim" tokenization (meaning, in pseudo-Python: "".join(tokens) == original_string; in OmegaT ITokenizer.tokenizeAllExactly() must obey this property). Most tokenizers cannot do this in any mode (they swallow whitespace, punctuation, etc.) so they fall back to a BreakIterator-based solution that tokenizes on character-type boundaries.

For space-delimited languages (or Japanese, which has frequent character-type boundaries) this works OK, but it fails hard on Chinese.

An easy solution is to have Chinese tokenizers do a unigram tokenization for tokenizeAllExactly(). This would be too granular for most languages, but I believe it to be acceptable for Chinese.

tokenizeAllExactly() example for Lucene(Smart)ChineseTokenizer:

  • Input:
    • 我们都在同一个地球上(英文当中说“a planet”)生活,而我们的全部是其生态之1.5部分。
  • Before (BreakIterator):
    • 我们都在同一个地球上, (, 英文当中说, “, a, , planet, ”, ), 生活, ,, 而我们的全部是其生态之, 1.5, 部分, 。
  • After (unigram):
    • 我, 们, 都, 在, 同, 一, 个, 地, 球, 上, (, 英, 文, 当, 中, 说, “, a, , p, l, a, n, e, t, ”, ), 生, 活, ,, 而, 我, 们, 的, 全, 部, 是, 其, 生, 态, 之, 1, ., 5, 部, 分, 。

Discussion

  • Aaron Madlon-Kay

    I have a patch ready for this; I am waiting for the next beta branch.

     
  • Jason

    Jason - 2015-03-27

    I think for English words (including numbers) in between Chinese characters, it may be not necessary to tokenize in too smaller units, i.e., by character. I don't know if this factor should be considered in Chinese tokenizer and if so, judging whether the character is English or Chinese is necessary.

     
  • Aaron Madlon-Kay

    True, it's not necessary, but it doesn't hurt either, and it would be much more complicated to have different behavior for Chinese and non-Chinese tokens.

     
    • Jason

      Jason - 2015-03-27

      Thanks, Aaron for feedback quickly

       
  • Aaron Madlon-Kay

    • status: open --> open-fixed
     
  • Aaron Madlon-Kay

    This is addressed in trunk, r7103.

     
  • Didier Briel

    Didier Briel - 2015-04-07
    • summary: Fuzzy match differences not granular enough in Chinese --> Improve fuzzy match difference granularity in Chinese
     
  • Didier Briel

    Didier Briel - 2015-04-22

    Implemented in the released version 3.4 of OmegaT.

    Didier

     
  • Didier Briel

    Didier Briel - 2015-04-22
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB