#161 Unicode detection fails

open
nobody
None
5
2014-08-07
2011-10-12
jpstotz
No

Comparing two UTF-8 encoded files (without BOM) KDiff3 fails to detect that both files are UTF-8 files.

The comparison is only correct, when I manually change the regional settings to UTF-8.
I would have expected that this is the job of the "Autodetect Unicode Function".

Affected versions: Version 0.9.96 - 2011-09-02 from Sourceforge as well as the version that comes with TortoiseHg 2.1.3 -> "kdiff3 Version 0.9.96a".

Discussion

  • jpstotz

    jpstotz - 2011-10-12
     
  • Joachim Eibl

    Joachim Eibl - 2011-10-15

    Hi,
    Let me clarify the current state (0.9.96).
    Automatic Unicode detection currently requires a BOM or for xml or html files some encoding specification.
    If UTF-8 is selected manually then if the file seems to contain invalid characters, then a warning is shown via message box.
    If some other 8 bit encoding is selected manually, then this will be always used instead.

    Even if this is not ideal it should normally prevent data loss.

    I have already received some suggestions on what would be a more intuitive approach (e.g. assume unicode first and automatically fall back to the default only if invalid characters are found.) and will consider them for a future version.

    Joachim

     
  • Joachim Eibl

    Joachim Eibl - 2011-10-15

    By the way: 0.9.96 is the new official version whereas 0.9.96a is probably a much older inofficial beta version.
    Joachim

     
  • Sebastian Auriol

    UTF-16 encoding detection fails on file saved from Textpad 5.3.1 and earlier (and untested on later). Textpad is not inserting the BOM using the default settings (I've just found an option to enable it). Nethertheless, it is quite obvious the file is UTF-16 if you look at the hex because every other column is all 00.
    As the encoding detections fails, the entire file is marked as different compared with the ANSI (windows-1252) version I've just saved.
    I upgraded my kdiff3 to 0.9.97 to get the feature that enables me to choose the encoding manually so I can verify that there are indeed no changes between the previously committed version and the current version.
    It would be good to improve the automatic encoding detection!

     

Log in to post a comment.