KDiff3 / Bugs / #198 Corruption when merging some Chinese characters in Unicode files.

#198 Corruption when merging some Chinese characters in Unicode files.

Milestone: v1.0_(example)

Status: open

Owner: nobody

Labels: None

Priority: 1

Updated: 2015-01-08

Created: 2015-01-07

Creator: Benjamin Marty

Private: No

I have 2 files containing the same series of bytes:
FF FE CB 8A 0D 7A 85 5F
The first two bytes identify the file as Unicode. the next 6 bytes represent 3 Chinese characters that translate to "Please Wait". If I run kdiff using the 2 files as input and output to a third file, the bytes in the output file are as follows:
FF FE CB 8A 0A 7A 85 5F
I expect the output file should be the same as the input files, but it is not. This problem exists in KDiff3 version 0.9.98 (64 bit), but not in version 0.9.96a, which I think is also 64-bit, but doesn't indicate this in the about dialog.

1 Attachments

Chinese1.txt

Discussion

Benjamin Marty - 2015-01-07

FYI, the actual Chinese characters - 請稍待 becomes 請稊待

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benjamin Marty - 2015-01-07

I have found that 0.9.96 and 0.9.97 exhibit the same problem. But 0.9.96a delivered with TortoiseHg does not.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joachim Eibl - 2015-01-08

Yes, I broke that by trying to fix issues with old Mac line endings (0D) and converting them to 0A.
Perhaps you can workaround by first converting the files to utf8, but I will try to fix this soon.
Joachim

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Benjamin Marty - 2015-01-08

Looks like 0.9.95 is the last version that excluded this problem. I will switch back to that until a new release is available (on systems where TortoiseHg is not installed). Normally I would prefer UTF-8, but I think it's sub-optimal in this case where the files I'm merging are large and contain mostly Chinese text. Converting to UTF-8 might also interfere with the continuous history of the file in source control. Not all source control systems support storing/comparing files with different encodings well. I also worry that UTF-8 could theoretically still exhibit the problem, because I suspect 0D is a valid byte in UTF-8 sometimes (besides CR) too. Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Corruption when merging some Chinese characters in Unicode files.

A graphical text difference analyzer

Group

Searches

Help

#198 Corruption when merging some Chinese characters in Unicode files.

Discussion