pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8

PERL 5 regular expression pattern matching

Brought to you by: mish_the_fish

#34 pcre_exec doesn't return PCRE_ERROR_BADUTF8 for invalid UTF8

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2007-04-05

Created: 2007-04-05

Creator: Vincent de Phily

Private: No

I found some invalid UTF8 chars that do not trigger a PCRE_ERROR_BADUTF8 error when found in the subject string :

\xF4\x92\x94\x95
\xED\xB2\x94

Tested with verion 6.6 and 7.0

See table 3-6 of http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf for valid UTF8 chars.

Because of this bug in libpcre and other implementations I tried, I went and implemented my own is_utf8() function (plus unit test). I'd be happy to send you the function or the unit test if you want it.

Discussion

Nobody/Anonymous - 2007-04-17

Logged In: NO

I don't know who set up this buglist for PCRE, but it was not the official maintainer.

This bug is invalid. I'm afraid you have confused "invalid UTF-8" with "unassigned Unicode point". PCRE checks for the former, not for the latter. The strings you list above are valid UTF-8 sequences for unassigned code points. Incidentally, PCRE 7.0 uses Unicode 5.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2007-04-17

Logged In: NO

Oh, forgot to add that the above comment came from Philip Hazel.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vincent de Phily - 2007-04-19

Logged In: YES
user_id=1144855
Originator: YES

Please check again, I dont think I got this wrong. I am not talking about unassigned code points.

Unless unicode 5 changed the definition of a "well-formed UTF-8 byte sequence", but as far as I know new unicode versions only assign new code points or refine character properties.
I couldn't find apropriate version 5 documentation, but I'll assume in the rest of my comment that the v4 pdf I gave is still valid for v5 (if you can point me to complete version 5 documentation that proves me wrong, go ahead).

According to the table mentioned above :
* the byte sequence \xED\xB2\x94 is invalid because the second byte is not between \x80 and \x9F.
* The byte sequence \xF4\x92\x94\x95 is invalid because the second byte is not between \x80 and \x8F.

Looks like you're just checking wether the second byte is between \x80 and \xBF, but sometimes it is more restricted !

BTW, I came upon this bug because postgres rejects this data, while you accept it. The postgres UTF8 functions are too complicated for my needs, though.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Magnus Holmgren - 2007-08-07

Logged In: YES
user_id=669310
Originator: NO

Further follow-ups to this bug should be, and has been, directed to the "official" PCRE bugtracker at http://bugs.exim.org/530

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.