Menu

#539 Expat 2.2.0 doesn't support the XML tags larger than 15 characters in Unicode

Feature Request
closed-fixed
2.2.0 (1)
5
2017-09-03
2016-07-05
Andy Wang
No

Hi Sebastian,

There might be an issue in the latest expat release 2.2.0 that it doesn't support parsing the XML tags whose length is larger than 15 characters for Unicode and 31 characters for none Unicode.

I was trying to use the expat and our C++ expat wrapper to parse the XML piece like "<advanced-file-transfer> ... " on Windows platform with Unicode, only the name 'advanced-file-t' is recognized. After debugging into the source code of expat and looking into the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c. </advanced-file-transfer>

Could you please take a look?

Thanks
Andy

Related

Bugs: #539

Discussion

1 2 > >> (Page 1 of 2)
  • Sebastian Pipping

    Hello Andy,

    I cannot confirm the issue so far. xmlwf seems to handle a tag with 40 characters length fine:

    # cat input.xml
    <a123456789b123456789c123456789d123456789 />
    
    # mkdir -p output && xmlwf -d output/ input.xml && cat output/input.xml 
    <a123456789b123456789c123456789d123456789></a123456789b123456789c123456789d123456789>
    

    Can you share an XML file that triggers the bug with xmlwf?

    Best, Sebastian

     
  • Andy Wang

    Andy Wang - 2016-07-06

    Hi Sebastian,

    Thanks for your quick response.

    And I am sorry it is my mistake that I didn't check the implementation of utf8_toUtf8(). The expat works well with the utf8 encoding. It seems the issue I found only exists on Windows platform with libexpatw.dll which supports Unicode.

    Here is the XML snippet in our unit test case:

    <advanced-file-transfer xmlns='http://jabber.com/schemas/advanced-file-transfer'>
    <size-bytes>7292928</size-bytes>
    <url>xxx</url>
    <screen-capture/>
    </advanced-file-transfer>
    

    Our test case failed to parse the tag 'advanced-file-transfer', it only returns the tag 'advanced-file-t' on Windows platform with Unicode encoding.

    And when I debugging into the utf8_toUtf16() in xmltok.c, the switch block always went into the below 'default' block and finally the function utf_toUtf16() returns XML_CONVERT_COMPLETED even the tag has not been converted fully.

    default:
      to++ = from++;
      break;
    

    Thanks
    Andy

     

    Last edit: Andy Wang 2016-07-06
    • Karl Waclawek

      Karl Waclawek - 2016-07-06

      xmlwf always uses the UTF-8 version of libexpat, even on Windows.
      So this issue can't be checked using xmlwf.

      It might actually be present on Linux as well, if one uses the UTF-16
      version of the library.

      On Tue, Jul 5, 2016 at 11:13 PM, Andy Wang applewwhua@users.sf.net wrote:

      Hi Sebastian,

      Thanks for your quick response.

      And I am sorry it is my mistake that I didn't check the implementation of
      utf8_toUtf8(). The expat works well with the utf8 encoding. It seems the
      issue I found only exists on Windows platform with libexpatw.dll which
      supports Unicode.

      Here is the XML snippet in our unit test case:

      <advanced-file-transfer xmlns="http://jabber.com/schemas/advanced-file-transfer &lt;a href=" http:="" jabber.com="" schemas="" advanced-file-transfer"="">http://jabber.com/schemas/advanced-file-transfer">
      <size-bytes>7292928</size-bytes> <url>xxx</url> <screen-capture>
      </screen-capture></advanced-file-transfer>

      Our test case failed to parse the tag 'advanced-file-transfer', it only
      returns the tag 'advanced-file-t' on Windows platform with Unicode
      encoding.

      And when I debugging into the utf8_toUtf16() in xmltok.c, the switch block
      always went into the below 'default' block and finally the function
      utf_toUtf16() returns XML_CONVERT_COMPLETED even the tag has not been
      converted fully.

      default: to++ = from++; break;

      Thanks
      Andy


      Status: open
      Group: Feature Request
      Labels: 2.2.0
      Created: Tue Jul 05, 2016 12:24 AM UTC by Andy Wang
      Last Updated: Tue Jul 05, 2016 08:25 AM UTC
      Owner: Sebastian Pipping

      Hi Sebastian,

      There might be an issue in the latest expat release 2.2.0 that it doesn't
      support parsing the XML tags whose length is larger than 15 characters for
      Unicode and 31 characters for none Unicode.

      I was trying to use the expat and our C++ expat wrapper to parse the XML
      piece like "<advanced-file-transfer> ... " on
      Windows platform with Unicode, only the name 'advanced-file-t' is
      recognized. After debugging into the source code of expat and looking into
      the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd
      Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the
      changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c.</advanced-file-transfer>

      Could you please take a look?

      Thanks
      Andy


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/expat/bugs/539/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Bugs: #539

  • Sebastian Pipping

    Hello Andy,

    could you attach it as a file (and re-verify that that very file still hits the issue) so we are sure talking about the very same bytes to parse?

    Do you have xmlwf built, too? Does xmlwf fail on that file for you?

    XML_CONVERT_COMPLETED means the encoding conversion was able to convert all input bytes. If your file is encoded well, that's to be expected. (We can still have a bug on another layer, though.)

    The commit you mentioned earlier is a merge commit. In case you use git-bisect once more, I would be curious about the non-merge commit you end up with.

    I don't have any Windows in sight so I will have to tweak my build settings on Linux to match yours on Windows somehow: should be doable. Are you using any of the bundled build systems or do you pass defines yourself if any? If so: which ones?

    Best, Sebastian

    PS: If you feel like calling you can whois pipping.org and call me on that number, ending on 17. I'm on UTC+2 over here. Just if you want.

     
  • Andy Wang

    Andy Wang - 2016-07-07

    Hi Sebastian,

    I have tried to build the xmlwf with Unicode supporting on Windows, but failed. So I don't know how to prove my findings. I attached my XML document as the attached test.xml with the file encoding utf8.

    So can I list my debug steps when I met the issue when using the libexpatw.dll to parse the XML document, so that we can verify if this issue exists or not with code review:

    1. Store the XML document "<advanced-file-transfer xmlns="http://jabber.com/schemas/advanced-file-transfer"><size-bytes>7292928</size-bytes><url>xxx</url><screen-capture></screen-capture></advanced-file-transfer>" to a variable with std::string type.

    2. Pass the std::string variable to our APIs which will call the APIs of expat to parse.

    3. The execution of the code went into the API doContnet() in xmlparse.c:

    2430     case XML_TOK_START_TAG_NO_ATTS:
     2431       /* fall through */
     2432     case XML_TOK_START_TAG_WITH_ATTS:
     2433       {   
     2434         TAG *tag;
     2435         enum XML_Error result;
     2436         XML_Char *toPtr;
     2437         if (freeTagList) {
     2438           tag = freeTagList;
     2439           freeTagList = freeTagList->parent;
     2440         } 
     2441         else {
     2442           tag = (TAG *)MALLOC(sizeof(TAG));
     2443           if (!tag)
     2444             return XML_ERROR_NO_MEMORY;
     2445           **tag->buf = (char *)MALLOC(INIT_TAG_BUF_SIZE);**
     2446           if (!tag->buf) {
     2447             FREE(tag);
     2448             return XML_ERROR_NO_MEMORY;
     2449           }
     2450           tag->bufEnd = tag->buf + INIT_TAG_BUF_SIZE;
     2451         }
     2452         tag->bindings = NULL;
     2453         tag->parent = tagStack;
     2454         tagStack = tag;
     2455         tag->name.localPart = NULL;
     2456         tag->name.prefix = NULL;
     2457         tag->rawName = s + enc->minBytesPerChar;
     2458         tag->rawNameLength = XmlNameLength(enc, tag->rawName);
     2459         ++tagLevel;
     2460         {
     2461           const char *rawNameEnd = tag->rawName + tag->rawNameLength;
     2462           const char *fromPtr = tag->rawName;
     2463           toPtr = (XML_Char *)tag->buf;
     2464           for (;;) { 
     2465             int bufSize;
     2466             int convLen;
     2467             const enum XML_Convert_Result convert_res = **XmlConvert**(enc,
     2468                        &fromPtr, rawNameEnd,
     2469                        (ICHAR **)&toPtr, (ICHAR *)tag->bufEnd - 1);
     2470             convLen = (int)(toPtr - (XML_Char *)tag->buf);
    ** 2471             if ((convert_res == XML_CONVERT_COMPLETED) || (convert_res == XML_CONVERT_INPUT_INCOMPLETE)) {
     2472               tag->name.strLen = convLen;
     2473               break;
     2474             } **
     2475             bufSize = (int)(tag->bufEnd - tag->buf) << 1;
     2476             { 
     2477               char *temp = (char *)REALLOC(tag->buf, bufSize);
     2478               if (temp == NULL)
     2479                 return XML_ERROR_NO_MEMORY;
     2480               tag->buf = temp;
     2481               tag->bufEnd = temp + bufSize;
     2482               toPtr = (XML_Char *)temp + convLen;
     2483             } 
     2484           }   
     2485         }   
     2486         tag->name.str = (XML_Char *)tag->buf;
     2487         *toPtr = XML_T('\0');
     2488         result = storeAtts(parser, enc, s, &(tag->name), &(tag->bindings));
     2489         if (result)
     2490           return result;
     2491         if (startElementHandler)
     2492           startElementHandler(handlerArg, tag->name.str,
     2493                               (const XML_Char **)atts);
     2494         else if (defaultHandler)
     2495           reportDefault(parser, enc, s, next);
     2496         poolClear(&tempPool);
     2497         break;
     2498       }
    

    At the line of 2467, the call of XmlConvert will actually call the utf8_toUtf16().

    1. In the utf8_toUtf16(), the switch block always goes into the 'default' block which will incrase both the to and from pointers. And looking back to the doContent(), the variables passed to utf8_toUtf16() are the following:

      1) fromP is the start of the tag 'advanced-file-transfer'
      2) fromLim is the end of the tag 'advanced-file-transfer', it is the address of the space character.
      3) toP is the the start of the tag->buf which points to a preallocated 32 bytes memory.
      4) toLim is the last unsigned short address of the preallocated 32 bytes memory.

    Since the switch block always goes into the 'default' block, there will be only the 15 characters (advanced-file-t) converted and stored in the tag->buf, and the utf_toUtf16() returns the value in res which is XML_CONVERT_COMPLETED by default. This will trigger the execution of line 2471 to 2474 and break the parsing of the whole XML tag.

    392 static enum XML_Convert_Result PTRCALL
    393 utf8_toUtf16(const ENCODING *enc,
    394              const char **fromP, const char *fromLim,
    395              unsigned short **toP, const unsigned short *toLim)
    396 {
    397   enum XML_Convert_Result res = **XML_CONVERT_COMPLETED**;
    398   unsigned short *to = *toP;
    399   const char *from = *fromP;
    400   while (from < fromLim && to < toLim) {
    401     switch (((struct normal_encoding *)enc)->type[(unsigned char)*from]) {
    402     case BT_LEAD2:
    403       if (fromLim - from < 2) {
    404         res = XML_CONVERT_INPUT_INCOMPLETE;
    405         break;
    406       }
    407       *to++ = (unsigned short)(((from[0] & 0x1f) << 6) | (from[1] & 0x3f));
    408       from += 2;
    409       break;
    410     case BT_LEAD3:
    411       if (fromLim - from < 3) {
    412         res = XML_CONVERT_INPUT_INCOMPLETE;
    413         break;
    414       }
    415       *to++ = (unsigned short)(((from[0] & 0xf) << 12)
    416                                | ((from[1] & 0x3f) << 6) | (from[2] & 0x3f));
    417       from += 3;
    418       break;
    419     case BT_LEAD4:
    420       {
    421         unsigned long n;
    422         if (toLim - to < 2) {
    423           res = XML_CONVERT_OUTPUT_EXHAUSTED;
    424           goto after;
    425         }
    426         if (fromLim - from < 4) {
    427           res = XML_CONVERT_INPUT_INCOMPLETE;
    428           goto after;
    429         }
    430         n = ((from[0] & 0x7) << 18) | ((from[1] & 0x3f) << 12)
    431             | ((from[2] & 0x3f) << 6) | (from[3] & 0x3f);
    432         n -= 0x10000;
    433         to[0] = (unsigned short)((n >> 10) | 0xD800);
    434         to[1] = (unsigned short)((n & 0x3FF) | 0xDC00);
    435         to += 2;
    436         from += 4;
    437       }
    438       break;
    **439     default:
    440       *to++ = *from++;
    441       break;**
    442     }
    443   }
    444 after:
    445   *fromP = from;
    446   *toP = to;
    447   return res;
    448 }
    
     
    • Karl Waclawek

      Karl Waclawek - 2016-07-07

      With the attached test.xml file (encoded in UTF-8) I can duplicate the
      issue using libexpatw.dll.
      When I convert the file to UCS2-BE or LE encoding the issue disappears.

      So it seems the problems is parsing UTF-8 input and outputting as UTF-16
      (utf8_to_utf16 as observed).

      I would think the same issue shows in other platforms when using the wide
      character version of libexpat.

      On Thu, Jul 7, 2016 at 9:38 AM, Andy Wang applewwhua@users.sf.net wrote:

      Hi Sebastian,

      I have tried to build the xmlwf with Unicode supporting on Windows, but
      failed. So I don't know how to prove my findings. I attached my XML
      document as the attached test.xml with the file encoding utf8.

       

      Last edit: Karl Waclawek 2016-07-07
  • Andy Wang

    Andy Wang - 2016-07-07

    Attach the test.xml

     
  • Andy Wang

    Andy Wang - 2016-07-08

    Thanks Karl and Sebastian.

    Could you please evaluate and suggest a fix for it?

    And now can I make this issue as public so that we can let other users know this issue?

    And thanks again for all your helps.

    Regards
    Andy

     
  • Andy Wang

    Andy Wang - 2016-07-11

    Hi Karl,

    Any comments?

    Regards
    Andy

     
  • Sebastian Pipping

     
  • Sebastian Pipping

    I need a few more days to have time for a closer look.

    No objections from my side to removing the "private bit" from this ticket.

    Best, Sebastian

     
    • Andy Wang

      Andy Wang - 2016-07-11

      Thanks Sebastian.

       
  • Sebastian Pipping

    • private: Yes --> No
     
  • Sebastian Pipping

    Hello again,

    I followed an idea by Karl with some custom debugging code and found a fix (https://sourceforge.net/p/expat/code_git/ci/af507cef2c93cb8d40062a0abe43a4f4e9158fb2) to the bug you reported.

    To reproduce the bug off a Git clone for an unfixed version on a Linux shell using your test.xml linked above and the tagwalk.c debugging code (attached to this reply) one would run:

    ./configure CPPFLAGS="-DXML_UNICODE"
    make clean buildlib
    cc -Wall -Wextra -DXML_UNICODE -Ilib -L.libs -lexpat -o tagwalk{,.c}
    LD_LIBRARY_PATH=.libs ./tagwalk test.xml ; echo $?
    

    To see it with libexpat installed system-wide, it would be something like:

    cc -Wall -Wextra -DXML_UNICODE -DXML_UNICODE_WCHAR_T -lexpatw -o tagwalk{,.c}  # a)
    cc -Wall -Wextra -DXML_UNICODE -lexpatu -o tagwalk{,.c}                        # b)
    ./tagwalk test.xml ; echo $?
    

    Please note that libexpatw and libexpatu may be different things across distributions of Linux.

    Please let me know if the bug can be closed.
    Thanks and best

    Sebastian

     
  • Sebastian Pipping

    PS: It may not be done yet, I noticed with -DXML_UNICODE_WCHAR_T tag names come out as "advanced-file-ransfer" (with "t" missing) for me.

     
    • Sebastian Pipping

      If I am not mistaken Expat really needs -fshort-wchar for -DXML_UNICODE_WCHAR_T (and the case without it would then be unsupported. If so we could catch that at runtime some way). With -fshort-wchar, tag names seem to get through fine. Maybe Karl knows more about it.

      If that's a bug it affects Expat 2.1.1 already which the rest of this report does not. So it's a different thing.

      PS: Attached is tagwalk.c with support for -fshort-wchar (and a wide wchar libc).

       
      • Karl Waclawek

        Karl Waclawek - 2016-07-17

        Well, -fshort-wchar does not apply to Visual Studio compiling, as far as i
        remember.

        I will take a closer look soon.

        On Jul 17, 2016 16:28, "Sebastian Pipping" hartwork@users.sf.net wrote:

        If I am not mistaken Expat really needs -fshort-wchar for
        -DXML_UNICODE_WCHAR_T (and the case without it would then be unsupported.
        If so we could catch that at runtime some way). With -fshort-wchar, tag
        names seem to get through fine. Maybe Karl knows more about it.

        If that's a bug it affects Expat 2.1.1 already which the rest of this
        report does not. So it's a different thing.

        PS: Attached is tagwalk.c with support for -fshort-wchar (and a wide wchar
        libc).


        Status: open
        Group: Feature Request
        Labels: 2.2.0
        Created: Tue Jul 05, 2016 12:24 AM UTC by Andy Wang
        Last Updated: Sun Jul 17, 2016 07:35 PM UTC
        Owner: Sebastian Pipping

        Hi Sebastian,

        There might be an issue in the latest expat release 2.2.0 that it doesn't
        support parsing the XML tags whose length is larger than 15 characters for
        Unicode and 31 characters for none Unicode.

        I was trying to use the expat and our C++ expat wrapper to parse the XML
        piece like "<advanced-file-transfer> ... " on
        Windows platform with Unicode, only the name 'advanced-file-t' is
        recognized. After debugging into the source code of expat and looking into
        the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd
        Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the
        changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c.</advanced-file-transfer>

        Could you please take a look?

        Thanks
        Andy


        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/expat/bugs/539/

        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/

         

        Related

        Bugs: #539

        • Sebastian Pipping

          Thank you!

          Maybe -fshort-wchar is enabled on Windows all the time. I think it's 32bit on Linux but 16bit by default on Windows. Maybe not the best source but: https://msdn.microsoft.com/library/windows/desktop/aa367308(v=vs.85).aspx

           
          • Karl Waclawek

            Karl Waclawek - 2016-07-18

            Yes, on Windows WCHAR_T is always 16 bits. That's why the MS compiler does
            not understand the -fshort-wchar flag. Visual Studio only
            needs XML_UNICODE_WCHAR_T defined without any additional flags.

            On Sun, Jul 17, 2016 at 6:19 PM, Sebastian Pipping hartwork@users.sf.net
            wrote:

            Thank you!

            Maybe -fshort-wchar is enabled on Windows all the time. I think it's 32bit
            on Linux but 16bit by default on Windows. Maybe not the best source but:
            https://msdn.microsoft.com/library/windows/desktop/aa367308(v=vs.85).aspx


            Status: open
            Group: Feature Request
            Labels: 2.2.0
            Created: Tue Jul 05, 2016 12:24 AM UTC by Andy Wang
            Last Updated: Sun Jul 17, 2016 07:35 PM UTC
            Owner: Sebastian Pipping

            Hi Sebastian,

            There might be an issue in the latest expat release 2.2.0 that it doesn't
            support parsing the XML tags whose length is larger than 15 characters for
            Unicode and 31 characters for none Unicode.

            I was trying to use the expat and our C++ expat wrapper to parse the XML
            piece like "<advanced-file-transfer> ... " on
            Windows platform with Unicode, only the name 'advanced-file-t' is
            recognized. After debugging into the source code of expat and looking into
            the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd
            Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the
            changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c.</advanced-file-transfer>

            Could you please take a look?

            Thanks
            Andy


            Sent from sourceforge.net because you indicated interest in
            https://sourceforge.net/p/expat/bugs/539/

            To unsubscribe from further messages, please visit
            https://sourceforge.net/auth/subscriptions/

             

            Related

            Bugs: #539

  • Sebastian Pipping

    Hello Andy,
    many stable Linux distributions have that bug unfixed and I would like to push the fix to them in a few days. Please report back if https://sourceforge.net/p/expat/code_git/ci/af507cef2c93cb8d40062a0abe43a4f4e9158fb2 fixed the issue for you.

    Many thanks and best

    Sebastian

     
    • Don Lewis

      Don Lewis - 2016-08-05

      I ran into this issue when attempting to update the version of expat that is bundled with Apache OpenOffice. It is compiled with -DXML_UNICODE and I saw the truncation to 15 character problem on FreeBSD. I believe that the problem is actually in utf8_toUtf16(), which is incorrectly returning XML_CONVERT_COMPLETED, when it should be returning XML_CONVERT_OUTPUT_EXHAUSTED because the output buffer is too small for the result. The attached patch fixes the problem for me.

      I think most of the rest of the conversion routines work properly, but they should be audited.

       
      • Sebastian Pipping

        Looks like you are right. Please share with me (up here or by mail to firstname@lastname.org):

        • A mail address of yours for Git commit authorship
        • An (ideally minimal) XML file triggering that very bug with Git master
         

        Last edit: Sebastian Pipping 2016-08-05
        • Don Lewis

          Don Lewis - 2016-08-05

          When I wrote that patch, my concern was with the mixed use of break vs goto in the error checks inside the loop. That's why I added the (res == XML_CONVERT_COMPLETED) condition to avoid overwriting an error status that was set insdie the loop. After looking at the code some more, I noticed that all of those are inside a switch block, so the error checks that use break don't even terminate the loop!

          In this new patch, I changed those break statements to goto statements. Then if execution falls off the bottom of the loop, we know that it was due to the condition at the top of the loop evaluating to false. If (from < fromLim) is true, then we know that the reason that the loop terminated must be due to running out of output space.

          As I previously mentioned, the other toUtf16() routines should be checked. In particular, DEFINE_UTF16_TO_UTF16(E) looks wrong to me. I don't have any way of testing it, though.

           
          • Sebastian Pipping

            Many thanks for that patch of yours! It's 896b6c1fd3b842f377d1b62135dccf0a579cf65d in Git now.
            My understanding is that:

            • utf8_toUtf16 is used with -DXML_UNICODE / expatw, only
            • UTF-8 input with incomplete two-byte and/or three-byte characters would be trouble to the unpatched code
            • .. if such bytes get that far in.

            I have not been able to get input of such kind though to utf8_toUtf16 myself, yet. If it is possible, it should be an infinite loop. (If you manage to get it through, please contact us off-web with details.)

            Please elaborate on how DEFINE_UTF16_TO_UTF16 looks wrong to you. Maybe in a new ticket, ideally. This one is getting too large already. Thanks!

             
            • Don Lewis

              Don Lewis - 2016-08-12

              Thanks!

              • Yes, though In my case, I have an application that cherry picks a few files out of expat and compiles them with -DXML_UNICODE.

              • I don't think incomplete UTF-8 would cause problems with the unpatched code and the patch should not change anything there. It should return XML_CONVERT_INPUT_INCOMPLETE, which should break out of the for (;;) loop.

              • I don't know if it is possible to get that as input.

              The earlier change to doContent() did have me concerned about the possiblity of an infinite loop because the (fromPtr >= rawNameEnd) test will be false in this case, and without the XML_CONVERT_INPUT_INCOMPLETE, there would be an infinite loop until all memory was consumed. I think it's safe, but suboptimal because the pointer test is redundant with what all the other converters do to generate the XML_CONVERT_COMPLETED return value.

              It looks like what happens in the XML_CONVERT_INPUT_INCOMPLETE is that tag->name-str just gets set to the part of the name formed by the complete characters and the incomplete character at the end gets dropped.

               
1 2 > >> (Page 1 of 2)

Log in to post a comment.

MongoDB Logo MongoDB