Menu

#198 Corruption when merging some Chinese characters in Unicode files.

v1.0_(example)
open
nobody
None
1
2015-01-08
2015-01-07
No

I have 2 files containing the same series of bytes:
FF FE CB 8A 0D 7A 85 5F
The first two bytes identify the file as Unicode. the next 6 bytes represent 3 Chinese characters that translate to "Please Wait". If I run kdiff using the 2 files as input and output to a third file, the bytes in the output file are as follows:
FF FE CB 8A 0A 7A 85 5F
I expect the output file should be the same as the input files, but it is not. This problem exists in KDiff3 version 0.9.98 (64 bit), but not in version 0.9.96a, which I think is also 64-bit, but doesn't indicate this in the about dialog.

1 Attachments

Discussion

  • Benjamin Marty

    Benjamin Marty - 2015-01-07

    FYI, the actual Chinese characters - 請稍待 becomes 請稊待

     
  • Benjamin Marty

    Benjamin Marty - 2015-01-07

    I have found that 0.9.96 and 0.9.97 exhibit the same problem. But 0.9.96a delivered with TortoiseHg does not.

     
  • Joachim Eibl

    Joachim Eibl - 2015-01-08

    Yes, I broke that by trying to fix issues with old Mac line endings (0D) and converting them to 0A.
    Perhaps you can workaround by first converting the files to utf8, but I will try to fix this soon.
    Joachim

     
  • Benjamin Marty

    Benjamin Marty - 2015-01-08

    Looks like 0.9.95 is the last version that excluded this problem. I will switch back to that until a new release is available (on systems where TortoiseHg is not installed). Normally I would prefer UTF-8, but I think it's sub-optimal in this case where the files I'm merging are large and contain mostly Chinese text. Converting to UTF-8 might also interfere with the continuous history of the file in source control. Not all source control systems support storing/comparing files with different encodings well. I also worry that UTF-8 could theoretically still exhibit the problem, because I suspect 0D is a valid byte in UTF-8 sometimes (besides CR) too. Thanks.