#2159 UTF-8 encoding not reliably detected for files over 4K

Branch
open
nobody
None
5
2013-12-21
2013-12-18
Jim Maloy
No

Using WinMerge 2.14.0.0 (Unicode)

In the attached set of three text files (all UTF-8 encoded), there are at most three lines different between any two of them. However, comparing file C-UTF8WithFirstUCAfter4096.txt to either A-UTF8WithFirstUCBefore4096.txt or B-UTF8WithFirstUCBefore4096.txt shows several spurious differences. This happens because WinMerge incorrectly classifies the file as using CP1252 encoding vs. UTF-8. This in turn is apparently due to the first unicode character appearing after byte 4096 in the file.

Discussion

  • Jim Maloy
    Jim Maloy
    2013-12-18

    Comparison files, plus configuration.

     
    Attachments
  • Jochen Tucht
    Jochen Tucht
    2013-12-21

    The situation is somewhat better in WinMerge 2011:
    The status bar shows the wrong encoding, but editing works correctly.

    https://bitbucket.org/jtuc/winmerge2011/commits/eb29a0a fixes the status bar issue.