Menu

UTF-8 support is broken

2008-02-23
2013-05-28
  • Matt Wozniski

    Matt Wozniski - 2008-02-23

    The UTF-8 handling in src/mbchar.c contains both false positives and false negatives in validating UTF-8 sequences.  As an example:

    No warning with an ill-formed string literal:
    ~>LC_ALL=en_US.UTF-8 echo \"$'\x80'\"
    "�"
    ~>LC_ALL=en_US.UTF-8 echo \"$'\x80'\" | iconv
    "iconv: illegal input sequence at position 1
    ~>LC_ALL=en_US.UTF-8 echo \"$'\x80'\" | mcpp
    #line 1 "<stdin>"
    "�"

    Warning with a well-formed string literal:
    ~>LC_ALL=en_US.UTF-8 echo \"$'\xE2\x82\x80'\"
    "₀"
    ~>LC_ALL=en_US.UTF-8 echo \"$'\xE2\x82\x80'\" | iconv
    "₀"
    ~>LC_ALL=en_US.UTF-8 echo \"$'\xE2\x82\x80'\" | mcpp
    #line 1 "<stdin>"
    <stdin>:1: warning: Illegal multi-byte character sequence "�" in quotation
        "₀"
    "₀"

    Also, no support is provided whatsoever for 4-byte sequences.

    A patch that should correct both of these types of errors, as well as enforce the strict checking required by the Unicode standard for overlong and underlong sequences and use of utf-16 surrogate pairs, can be found at

    http://www.cs.drexel.edu/~mjw452/mcpp-utf8.diff

    Please let me know if you have any questions or comments.

    ~Matt

     
    • Kiyoshi Matsui

      Kiyoshi Matsui - 2008-02-25

      Thank you for the debugging and patching on mcpp!

      I can't remember why I excluded 0x80-0x9f from the second byte of 3-
      bytes sequence.  Anyway, it is broken.  Also it does not check illegal
      sequences other than the range checking of the 2 or 3 bytes.

      I have read some documents on UTF-8, and have checked the patch.  Your
      patch is perfect!

      May I take the patch into mcpp source?

      Also, I'm going to make all the mb_read_*() routines to check all the 8-
      bits characters as new mb_read_utf8() does.

       
      • Matt Wozniski

        Matt Wozniski - 2008-02-25

        > Thank you for the debugging and patching on mcpp!
        No problem.  :)

        > I have read some documents on UTF-8, and have checked the patch.
        > Your patch is perfect!
        Glad to hear it.  I tested it with some fairly complicated input.

        > May I take the patch into mcpp source?
        Sure, that's why I gave it in patch form.  :)

        > Also, I'm going to make all the mb_read_*() routines to check all
        > the 8-bits characters as new mb_read_utf8() does.

        Extra checking is always nice, particularly since poor UTF-8 validation has been the center of some recent security exploits (Microsoft IIS 4 and 5 had a bug handling overlong UTF-8 sequences that allowed for access to paths outside of the virtual root, see http://seclists.org/bugtraq/2000/Oct/0264.html\)

        Good luck with mcpp! :)

         

Log in to post a comment.