#46 Detect invalid UTF-8 sequences

closed-accepted
None
5
2002-05-30
2002-05-29
No

This patch is based on the patch attached to bug #
477667.

I have updated the patch to Unicode 3.2 and
made some optimizing (hopefully) modifications
to the code.

This is the table it is now based on:

Table 3.1B. Legal UTF-8 Byte Sequences in Unicode 3.2
Code Points 1st Byte 2nd 3rd 4th
U+0000..U+007F 00..7F

U+0080..U+07FF C2..DF 80..BF

U+0800..U+0FFF E0 A0..BF 80..BF

U+1000..U+CFFF E1..EC 80..BF 80..BF

U+D000..U+D7FF ED 80..9F 80..BF

U+D800..U+DFFF ill-formed

U+E000..U+FFFF EE..EF 80..BF 80..BF

U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Optimization 1)

Analyzing the code in xmltok.c, I found that the
functions utf8_isInvalid2,3,4 are called only when
the first byte of the UTF-8 sequence maps to
BT_LEAD2,3,4 respectively in the table in utf8tab.h
(I looked at the pre-processed output for that).

This means for the first byte p[0]:
BT_LEAD2 <==> p[0] >= 0xC0 and p[0] <= 0xDF,
therefore we don't have to check for p[0] > 0xDF
BT_LEAD3 <==> p[0] >= 0xE0 and p[0] <= 0xEF,
therefore we don't have to check for p[0] < 0xE0
and p[0] > 0xEF
BT_LEAD4 <==> p[0] >= 0xF0 and p[0] <= 0xF4,
therefore we don't have to check for p[0] < 0xF0
and p[0] > 0xF4

so, our checks for an invalid UTF-8 sequence are:

BT_LEAD2:
p[0] < 0xC2 || p[1] < 0x80 || p[1] > 0xBF

BT_LEAD3:
p[2] < 0x80 || p[2] > 0xBF ||
if p[0] == 0xE0 then p[1] < 0xA0 || p[1] > 0xBF
if p[0] == 0xED then p[1] < 0x80 || p[1] > 0x9F
otherwise p[1] < 0x80 || p[1] > 0xBF

BT_LEAD4:
p[3] < 0x80 || p[3] > 0xBF ||
p[2] < 0x80 || p[2] > 0xBF ||
if p[0] == 0xF0 then p[1] < 0x90 || p[1] > 0xBF
if p[0] == 0xF4 then p[1] < 0x80 || p[1] > 0x8F
otherwise p[1] < 0x80 || p[1] > 0xBF

Optimization 2)

Use conditional expressions, i.e. ( ? : )

Optimization 3)

In theory, it should be more efficient to write

(A & 0x80) == 0 instead of A < 0x80
and
(A & 0xC0) == 0xC0 instead of A > 0xBF

Check the attached file xmltok.c for the actual
implementation. The patch is based on
revision 1.15 of xmltok.c.

Karl

Discussion

  • Karl Waclawek

    Karl Waclawek - 2002-05-29

    Patched xmltok.c

     
  • Karl Waclawek

    Karl Waclawek - 2002-05-29

    Diff against rev. 1.15 of xmltok.c

     
  • Fred L. Drake, Jr.

    Logged In: YES
    user_id=3066

    Works for me and causes the checked-in tests to pass again.
    Check it in & close the bug & patch reports!

     
  • Fred L. Drake, Jr.

    • assigned_to: nobody --> kwaclaw
    • status: open --> open-accepted
     
  • Karl Waclawek

    Karl Waclawek - 2002-05-30

    Logged In: YES
    user_id=290026

    Patch checked in - rev. 1.16 of xmltok.c

    Karl

     
  • Karl Waclawek

    Karl Waclawek - 2002-05-30
    • status: open-accepted --> closed-accepted
     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks