#513 UTF-8 without BOM Detection bug, for certain characters

Don HO

If a UTF-8 (no BOM) file contains a certain character, it will always be detected as 8 bit ASCII, as the code is marked as invalid. Thai baht (U+0E3F) is one example, but I think anything from U+0E00 to U+0FFF would trigger it.

This patch fixes the valid UTF-8 tests.

For 3 byte codes, the ONLY thing we can say about the first byte, is that [byte] & 0xE0 == 0xE0, anding with 0x0F is completely dependant on the code point, and fails for codes that fall into a range 0E00-0FFF (I think). Thai baht (U+0E3F) is a good example. See https://sourceforge.net/p/notepad-plus/discussion/331754/thread/0fac5f5e/?limit=25

This patch also corrects the two byte check to check for byte & 0xC0 == 0xC0 - the previous test worked, but only due to all characters lying in the requisite range where byte & 0x1F would always be 0. The new test (byte & 0xC0 == 0xC0) is more "correct", and says exactly what we should be checking for.

The patch contains a patch that can be applied using the standard patch.exe, and the complete Utf8_16.cpp file, which can simply be replaced if nothing else has changed.

1 Attachments


  • Don HO

    Don HO - 2013-07-31
    • status: open --> accepted
    • Priority: 5 --> 7
  • Anonymous - 2013-08-26

    Was there (and is there still) a reason for not accepting 4-byte UTF-8 variants as valid? In Utf8_16_Read::utf8_7bits_8bits() there's a comment saying that encoding values with unicode code points into UTF-8 with "more than 16 bits are not allowed here", but it doesn't explain why.
    RFC 3629 says that valid UTF-8 bytes are:

    UTF8 with 1 octet:

    byte 1

    UTF8 with 2 octets:

    byte 1 | byte 2
    C2-DF | 80-BF

    UTF8 with 3 octets:

    byte 1 | byte 2 | byte 3
    E0 | A0-BF | 80-BF
    E1-EC | 80-BF | 80-BF
    ED | 80-9F | 80-BF
    EE-EF | 80-BF | 80-BF

    UTF8 with 4 octets:

    byte 1 | byte 2 | byte 3 | byte 4
    F0 | 90-BF | 80-BF | 80-BF
    F1-F3 | 80-BF | 80-BF | 80-BF
    F4 | 80-8F | 80-BF | 80-BF

    Source: http://tools.ietf.org/html/rfc3629#page-5

    It also seems that utf8_7bits_8bits() isn't restrictive in some places (for example C0 and C1 being invalid lead bytes and the exceptions for the second byte in 3-byte UTF8). I take it that's so slightly malformed UTF-8 still gets recognised as such, with invalid characters hopefully being handled properly at a later stage (by removal or usage of replacements characters)?

    I propose to extend utf8_7bits_8bits() to accept 4-byte UTF8 code so that UTF-8 text with SMP code points like in the attachment doesn't get erroneously classified as ANSI.

    Last edit: Anonymous 2013-08-27
  • Don HO

    Don HO - 2013-08-29
    • status: accepted --> closed

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks