OmegaT - multiplatform CAT tool / Feature Requests / #1077 Improve fuzzy match difference granularity in Chinese

The free computer aided translation (CAT) tool for professionals

#1077 Improve fuzzy match difference granularity in Chinese

Milestone: 3.4

Status: closed-fixed

Owner: Aaron Madlon-Kay

Labels: None

Priority: 5

Updated: 2015-04-22

Created: 2015-03-25

Creator: Aaron Madlon-Kay

Private: No

Users have complained that when the source language is Chinese the fuzzy match display shows differences in too coarse a fashion: a single differing character can cause large chunks of text to be shown as different.

The issue is the following: Tokenizing for display purposes requires "verbatim" tokenization (meaning, in pseudo-Python: "".join(tokens) == original_string; in OmegaT ITokenizer.tokenizeAllExactly() must obey this property). Most tokenizers cannot do this in any mode (they swallow whitespace, punctuation, etc.) so they fall back to a BreakIterator-based solution that tokenizes on character-type boundaries.

For space-delimited languages (or Japanese, which has frequent character-type boundaries) this works OK, but it fails hard on Chinese.

An easy solution is to have Chinese tokenizers do a unigram tokenization for tokenizeAllExactly(). This would be too granular for most languages, but I believe it to be acceptable for Chinese.

tokenizeAllExactly() example for Lucene(Smart)ChineseTokenizer:

Input:
- 我们都在同一个地球上（英文当中说“a planet”）生活，而我们的全部是其生态之1.5部分。
Before (BreakIterator):
- 我们都在同一个地球上, （, 英文当中说, “, a, , planet, ”, ）, 生活, ，, 而我们的全部是其生态之, 1.5, 部分, 。
After (unigram):
- 我, 们, 都, 在, 同, 一, 个, 地, 球, 上, （, 英, 文, 当, 中, 说, “, a, , p, l, a, n, e, t, ”, ）, 生, 活, ，, 而, 我, 们, 的, 全, 部, 是, 其, 生, 态, 之, 1, ., 5, 部, 分, 。

Discussion

Aaron Madlon-Kay - 2015-03-25

I have a patch ready for this; I am waiting for the next beta branch.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jason - 2015-03-27

I think for English words (including numbers) in between Chinese characters, it may be not necessary to tokenize in too smaller units, i.e., by character. I don't know if this factor should be considered in Chinese tokenizer and if so, judging whether the character is English or Chinese is necessary.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-03-27

True, it's not necessary, but it doesn't hurt either, and it would be much more complicated to have different behavior for Chinese and non-Chinese tokens.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jason - 2015-03-27
  
  Thanks, Aaron for feedback quickly
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-04-07

status: open --> open-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Aaron Madlon-Kay - 2015-04-07

This is addressed in trunk, r7103.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-04-07

summary: Fuzzy match differences not granular enough in Chinese --> Improve fuzzy match difference granularity in Chinese
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-04-22

Implemented in the released version 3.4 of OmegaT.

Didier

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Didier Briel - 2015-04-22

status: open-fixed --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Improve fuzzy match difference granularity in Chinese

The free computer aided translation (CAT) tool for professionals

Group

Searches

Help

#1077 Improve fuzzy match difference granularity in Chinese

Discussion