#13 UTF-16 detection problem

v1.0 (example)
closed-fixed
nobody
None
5
2013-05-30
2012-04-14
khagaroth
No

Looks like diffuse does have some issues with encoding detection. The algorithm it uses is ether pretty dumb (looking at the bug in the Feature request tracker this seem to be the case) or pretty much broken.
The problem I have is that it doesn't correctly detect UTF-16 files (the file content looks like this A@B@C@D@..., ie the 00 hex sequence is replaced by @). If I add uft_16_le as the first entry to the regional settings (same as done for utf-8 in the other bug), then uft-16 files open correctly, but the problem is that doing this causes other (non utf-16) files to open as garbage randomly (some open correctly, some open as one long line of random chars, mostly chinese).
One would thing that detecting utf-16 would be the simplest thing, because the file always starts with a BOM (FFFE in the case of utf-16le), so there is no need to guess.

Discussion

  • Derrick Moser

    Derrick Moser - 2012-04-16

    I have just committed a fix for this. Users will still need to explicitly add utf_16 to the list of auto-detect codecs. Without it, text will likely be identified as valid latin_1.

    Diffuse now requires text to contain a BOM sequence before identifying it as utf_16. This will prevent non-utf_16 files from being incorrectly identified as utf_16. Diffuse generates a BOM sequence when saving as utf_16. Use the utf_16_le or utf_16_be codecs (not to be confused with the utf_16 codec) only if you don't want to write BOM sequences. Of course, auto-detection will encounter ambiguous cases with the utf_16_le and utf_16_be codecs.

     
  • Derrick Moser

    Derrick Moser - 2012-04-16

    I wasn't aware of how extensively UTF-16 was used in development on Windows. utf_16 is now included in the default list of auto-detect codecs.

     
  • khagaroth

    khagaroth - 2012-04-28

    A follow-up on this. As suggested here I replaced utf_16_le with just utf_16, but this introduced another bug. With utf_16 as the first encoding, the BOM is written at the beginning of every single line in the file instead of just at the very beginning of the file.
    Unless this was fixed by the change made while fixing this bugreport it should also be fixed before releasing a version with utf_16 set by default.

     
  • Derrick Moser

    Derrick Moser - 2012-04-30

    The problem with a BOM being written at the beginning of every line should now be fixed.

     
  • Derrick Moser

    Derrick Moser - 2013-05-30
    • status: open --> closed-fixed
    • Group: --> v1.0 (example)
     

Log in to post a comment.