Menu

#34 pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8

open
nobody
None
5
2007-04-05
2007-04-05
No

I found some invalid UTF8 chars that do not trigger a PCRE_ERROR_BADUTF8 error when found in the subject string :

\xF4\x92\x94\x95
\xED\xB2\x94

Tested with verion 6.6 and 7.0

See table 3-6 of http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for valid UTF8 chars.

Because of this bug in libpcre and other implementations I tried, I went and implemented my own is_utf8() function (plus unit test). I'd be happy to send you the function or the unit test if you want it.

Discussion

  • Nobody/Anonymous

    Logged In: NO

    I don't know who set up this buglist for PCRE, but it was not the official maintainer.

    This bug is invalid. I'm afraid you have confused "invalid UTF-8" with "unassigned Unicode point". PCRE checks for the former, not for the latter. The strings you list above are valid UTF-8 sequences for unassigned code points. Incidentally, PCRE 7.0 uses Unicode 5.

     
  • Nobody/Anonymous

    Logged In: NO

    Oh, forgot to add that the above comment came from Philip Hazel.

     
  • Vincent de Phily

    Logged In: YES
    user_id=1144855
    Originator: YES

    Please check again, I dont think I got this wrong. I am not talking about unassigned code points.

    Unless unicode 5 changed the definition of a "well-formed UTF-8 byte sequence", but as far as I know new unicode versions only assign new code points or refine character properties.
    I couldn't find apropriate version 5 documentation, but I'll assume in the rest of my comment that the v4 pdf I gave is still valid for v5 (if you can point me to complete version 5 documentation that proves me wrong, go ahead).

    According to the table mentioned above :
    * the byte sequence \xED\xB2\x94 is invalid because the second byte is not between \x80 and \x9F.
    * The byte sequence \xF4\x92\x94\x95 is invalid because the second byte is not between \x80 and \x8F.

    Looks like you're just checking wether the second byte is between \x80 and \xBF, but sometimes it is more restricted !

    BTW, I came upon this bug because postgres rejects this data, while you accept it. The postgres UTF8 functions are too complicated for my needs, though.

     
  • Magnus Holmgren

    Magnus Holmgren - 2007-08-07

    Logged In: YES
    user_id=669310
    Originator: NO

    Further follow-ups to this bug should be, and has been, directed to the "official" PCRE bugtracker at http://bugs.exim.org/530

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.