mcpp / Discussion / Open Discussion: UTF-8 support is broken

Matt Wozniski - 2008-02-23

The UTF-8 handling in src/mbchar.c contains both false positives and false negatives in validating UTF-8 sequences. As an example:

No warning with an ill-formed string literal:
~>LC_ALL=en_US.UTF-8 echo \"$'\x80'\"
"�"
~>LC_ALL=en_US.UTF-8 echo \"$'\x80'\" | iconv
"iconv: illegal input sequence at position 1
~>LC_ALL=en_US.UTF-8 echo \"$'\x80'\" | mcpp
#line 1 "<stdin>"
"�"

Warning with a well-formed string literal:
~>LC_ALL=en_US.UTF-8 echo \"$'\xE2\x82\x80'\"
"₀"
~>LC_ALL=en_US.UTF-8 echo \"$'\xE2\x82\x80'\" | iconv
"₀"
~>LC_ALL=en_US.UTF-8 echo \"$'\xE2\x82\x80'\" | mcpp
#line 1 "<stdin>"
<stdin>:1: warning: Illegal multi-byte character sequence "�" in quotation
"₀"
"₀"

Also, no support is provided whatsoever for 4-byte sequences.

A patch that should correct both of these types of errors, as well as enforce the strict checking required by the Unicode standard for overlong and underlong sequences and use of utf-16 surrogate pairs, can be found at

http://www.cs.drexel.edu/~mjw452/mcpp-utf8.diff

Please let me know if you have any questions or comments.

~Matt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Kiyoshi Matsui - 2008-02-25
  
  Thank you for the debugging and patching on mcpp!
  
  I can't remember why I excluded 0x80-0x9f from the second byte of 3-
  bytes sequence. Anyway, it is broken. Also it does not check illegal
  sequences other than the range checking of the 2 or 3 bytes.
  
  I have read some documents on UTF-8, and have checked the patch. Your
  patch is perfect!
  
  May I take the patch into mcpp source?
  
  Also, I'm going to make all the mb_read_*() routines to check all the 8-
  bits characters as new mb_read_utf8() does.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Matt Wozniski - 2008-02-25
    
    > Thank you for the debugging and patching on mcpp!
    No problem. :)
    
    > I have read some documents on UTF-8, and have checked the patch.
    > Your patch is perfect!
    Glad to hear it. I tested it with some fairly complicated input.
    
    > May I take the patch into mcpp source?
    Sure, that's why I gave it in patch form. :)
    
    > Also, I'm going to make all the mb_read_*() routines to check all
    > the 8-bits characters as new mb_read_utf8() does.
    
    Extra checking is always nice, particularly since poor UTF-8 validation has been the center of some recent security exploits (Microsoft IIS 4 and 5 had a bug handling overlong UTF-8 sequences that allowed for access to paths outside of the virtual root, see http://seclists.org/bugtraq/2000/Oct/0264.html\)
    
    Good luck with mcpp! :)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8 support is broken

Forums

Help

UTF-8 support is broken

UTF-8 support is broken

Forums

Help

UTF-8 support is broken document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

UTF-8 support is broken