Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#196 error decoding UTF-8 triplet

Test Required
closed-fixed
None
6
2002-08-27
2002-08-26
Anonymous
No

On Windows, when reading the UTF-8 sequence "EF
BA BF", utf8_isInvalid3 returns TRUE, when it should
return FALSE. This UTF-8 sequence encodes to "FEBF"
as UCS-2 (Unicode), but as a result of utf8_isInvalid3
returning TRUE, an error results and the character isn't
decoded properly.

This is using expat 1.95.4.

Attached is a simple XML file which illustrates the
problem.

Discussion

  • An xml file which includes this character. When it is parsed with expat 1.95.4, the parse will fail

     
    Attachments
  • Karl Waclawek
    Karl Waclawek
    2002-08-27

    Logged In: YES
    user_id=290026

    Yes, this is a bug.
    utf8_isInvalid3 tries to detect the invalid XML sequences
    (*not* invalid unicode) EF BF BE and EF BF BF, but
    only checks the first and third byte, not the second one.

    Fix alread checked into CVS (xmltok.c 1.23).
    Please check out and test.

     
  • Karl Waclawek
    Karl Waclawek
    2002-08-27

    • status: open --> open-fixed
     
    • labels: 436817 -->
    • milestone: --> Test Required
    • priority: 5 --> 6
     
  • Logged In: NO

    Thanks Karl. I applied your fix to the define UTF8_INVALID3
    with the expat 1.95.4 tarball (xmltok.c) and this worked fine,
    however, when I tried using what was in CVS, everything blew
    up on me.

    I can pass you some further test cases and possibly some
    patches, if you like.

     
  • Karl Waclawek
    Karl Waclawek
    2002-08-27

    Logged In: YES
    user_id=290026

    Yes, please give us test cases that blow everything.
    Only from mistakes we can learn ...

     
  • Logged In: YES
    user_id=3066

    I've got a test case ready to checkin for the specific
    reported character that caused problems in the original
    report, but will hold off checking it in until we have the
    additional failure information, so I can generalize the test.

     
  • Logged In: NO

    Sorry, I should be more specific: the problems I am having
    with the source from CVS do not relate specifically to UTF-8
    encodings, but lots of different problems like segmentation
    violations, etcetera. These problems did not occur with
    1.95.4, except the UTF-8 problem. I will isolate the specifics
    for you over the next couple of days.

    I haven't been using any new functionality. The code I have is
    using the 1.95.2 API.

     
  • Logged In: NO

    Sorry, I should be more specific: the problems I am having
    with the source from CVS do not relate specifically to UTF-8
    encodings, but lots of different problems like segmentation
    violations, etcetera. These problems did not occur with
    1.95.4, except the UTF-8 problem. I will isolate the specifics
    for you over the next couple of days.

    I haven't been using any new functionality. The code I have is
    using the 1.95.2 API.

     
  • Karl Waclawek
    Karl Waclawek
    2002-08-27

    Logged In: YES
    user_id=290026

    Maybe you checked out from CVS in between changes
    we made. Could you please re-try with the newest?

     
  • Logged In: YES
    user_id=3066

    Ok, I've commited the regression test for this as
    tests/runtests.c revision 1.34. The other bugs should be
    filed in new reports.

     
    • status: open-fixed --> closed-fixed