Expat XML Parser / Bugs / #539 Expat 2.2.0 doesn't support the XML tags larger than 15 characters in Unicode

Sebastian Pipping - 2016-07-05

Hello Andy,

I cannot confirm the issue so far. xmlwf seems to handle a tag with 40 characters length fine:

# cat input.xml <a123456789b123456789c123456789d123456789 /> # mkdir -p output && xmlwf -d output/ input.xml && cat output/input.xml <a123456789b123456789c123456789d123456789></a123456789b123456789c123456789d123456789>

Can you share an XML file that triggers the bug with xmlwf?

Best, Sebastian
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Wang - 2016-07-06

Hi Sebastian,

Thanks for your quick response.

And I am sorry it is my mistake that I didn't check the implementation of utf8_toUtf8(). The expat works well with the utf8 encoding. It seems the issue I found only exists on Windows platform with libexpatw.dll which supports Unicode.

Here is the XML snippet in our unit test case:

<advanced-file-transfer xmlns='http://jabber.com/schemas/advanced-file-transfer'> <size-bytes>7292928</size-bytes> <url>xxx</url> <screen-capture/> </advanced-file-transfer>

Our test case failed to parse the tag 'advanced-file-transfer', it only returns the tag 'advanced-file-t' on Windows platform with Unicode encoding.

And when I debugging into the utf8_toUtf16() in xmltok.c, the switch block always went into the below 'default' block and finally the function utf_toUtf16() returns XML_CONVERT_COMPLETED even the tag has not been converted fully.

default: to++ = from++; break;

Thanks
Andy

Last edit: Andy Wang 2016-07-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Karl Waclawek - 2016-07-06
  
  xmlwf always uses the UTF-8 version of libexpat, even on Windows.
  So this issue can't be checked using xmlwf.
  
  It might actually be present on Linux as well, if one uses the UTF-16
  version of the library.
  
  On Tue, Jul 5, 2016 at 11:13 PM, Andy Wang applewwhua@users.sf.net wrote:
  
  Hi Sebastian,
  
  Thanks for your quick response.
  
  And I am sorry it is my mistake that I didn't check the implementation of
  utf8_toUtf8(). The expat works well with the utf8 encoding. It seems the
  issue I found only exists on Windows platform with libexpatw.dll which
  supports Unicode.
  
  Here is the XML snippet in our unit test case:
  
  <advanced-file-transfer xmlns="http://jabber.com/schemas/advanced-file-transfer <a href=" http:="" jabber.com="" schemas="" advanced-file-transfer"="">http://jabber.com/schemas/advanced-file-transfer">
  <size-bytes>7292928</size-bytes> <url>xxx</url> <screen-capture>
  </screen-capture></advanced-file-transfer>
  Our test case failed to parse the tag 'advanced-file-transfer', it only
  returns the tag 'advanced-file-t' on Windows platform with Unicode
  encoding.
  
  And when I debugging into the utf8_toUtf16() in xmltok.c, the switch block
  always went into the below 'default' block and finally the function
  utf_toUtf16() returns XML_CONVERT_COMPLETED even the tag has not been
  converted fully.
  
  default: to++ = from++; break;
  
  Thanks
  Andy
  
  [bugs:#539] https://sourceforge.net/p/expat/bugs/539/ Expat 2.2.0
  doesn't support the XML tags larger than 15 characters in Unicode*
  
  Status: open
  Group: Feature Request
  Labels: 2.2.0
  Created: Tue Jul 05, 2016 12:24 AM UTC by Andy Wang
  Last Updated: Tue Jul 05, 2016 08:25 AM UTC
  Owner: Sebastian Pipping
  
  Hi Sebastian,
  
  There might be an issue in the latest expat release 2.2.0 that it doesn't
  support parsing the XML tags whose length is larger than 15 characters for
  Unicode and 31 characters for none Unicode.
  
  I was trying to use the expat and our C++ expat wrapper to parse the XML
  piece like "<advanced-file-transfer> ... " on
  Windows platform with Unicode, only the name 'advanced-file-t' is
  recognized. After debugging into the source code of expat and looking into
  the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd
  Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the
  changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c.</advanced-file-transfer>
  
  Could you please take a look?
  
  Thanks
  Andy
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/expat/bugs/539/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: ~~#539~~
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-06

Hello Andy,

could you attach it as a file (and re-verify that that very file still hits the issue) so we are sure talking about the very same bytes to parse?

Do you have xmlwf built, too? Does xmlwf fail on that file for you?

XML_CONVERT_COMPLETED means the encoding conversion was able to convert all input bytes. If your file is encoded well, that's to be expected. (We can still have a bug on another layer, though.)

The commit you mentioned earlier is a merge commit. In case you use git-bisect once more, I would be curious about the non-merge commit you end up with.

I don't have any Windows in sight so I will have to tweak my build settings on Linux to match yours on Windows somehow: should be doable. Are you using any of the bundled build systems or do you pass defines yourself if any? If so: which ones?

Best, Sebastian

PS: If you feel like calling you can whois pipping.org and call me on that number, ending on 17. I'm on UTC+2 over here. Just if you want.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Sebastian,

I have tried to build the xmlwf with Unicode supporting on Windows, but failed. So I don't know how to prove my findings. I attached my XML document as the attached test.xml with the file encoding utf8.

So can I list my debug steps when I met the issue when using the libexpatw.dll to parse the XML document, so that we can verify if this issue exists or not with code review:

Store the XML document "<advanced-file-transfer xmlns="http://jabber.com/schemas/advanced-file-transfer"><size-bytes>7292928</size-bytes><url>xxx</url><screen-capture></screen-capture></advanced-file-transfer>" to a variable with std::string type.
Pass the std::string variable to our APIs which will call the APIs of expat to parse.
The execution of the code went into the API doContnet() in xmlparse.c:

2430     case XML_TOK_START_TAG_NO_ATTS:
 2431       /* fall through */
 2432     case XML_TOK_START_TAG_WITH_ATTS:
 2433       {   
 2434         TAG *tag;
 2435         enum XML_Error result;
 2436         XML_Char *toPtr;
 2437         if (freeTagList) {
 2438           tag = freeTagList;
 2439           freeTagList = freeTagList->parent;
 2440         } 
 2441         else {
 2442           tag = (TAG *)MALLOC(sizeof(TAG));
 2443           if (!tag)
 2444             return XML_ERROR_NO_MEMORY;
 2445           **tag->buf = (char *)MALLOC(INIT_TAG_BUF_SIZE);**
 2446           if (!tag->buf) {
 2447             FREE(tag);
 2448             return XML_ERROR_NO_MEMORY;
 2449           }
 2450           tag->bufEnd = tag->buf + INIT_TAG_BUF_SIZE;
 2451         }
 2452         tag->bindings = NULL;
 2453         tag->parent = tagStack;
 2454         tagStack = tag;
 2455         tag->name.localPart = NULL;
 2456         tag->name.prefix = NULL;
 2457         tag->rawName = s + enc->minBytesPerChar;
 2458         tag->rawNameLength = XmlNameLength(enc, tag->rawName);
 2459         ++tagLevel;
 2460         {
 2461           const char *rawNameEnd = tag->rawName + tag->rawNameLength;
 2462           const char *fromPtr = tag->rawName;
 2463           toPtr = (XML_Char *)tag->buf;
 2464           for (;;) { 
 2465             int bufSize;
 2466             int convLen;
 2467             const enum XML_Convert_Result convert_res = **XmlConvert**(enc,
 2468                        &fromPtr, rawNameEnd,
 2469                        (ICHAR **)&toPtr, (ICHAR *)tag->bufEnd - 1);
 2470             convLen = (int)(toPtr - (XML_Char *)tag->buf);
** 2471             if ((convert_res == XML_CONVERT_COMPLETED) || (convert_res == XML_CONVERT_INPUT_INCOMPLETE)) {
 2472               tag->name.strLen = convLen;
 2473               break;
 2474             } **
 2475             bufSize = (int)(tag->bufEnd - tag->buf) << 1;
 2476             { 
 2477               char *temp = (char *)REALLOC(tag->buf, bufSize);
 2478               if (temp == NULL)
 2479                 return XML_ERROR_NO_MEMORY;
 2480               tag->buf = temp;
 2481               tag->bufEnd = temp + bufSize;
 2482               toPtr = (XML_Char *)temp + convLen;
 2483             } 
 2484           }   
 2485         }   
 2486         tag->name.str = (XML_Char *)tag->buf;
 2487         *toPtr = XML_T('\0');
 2488         result = storeAtts(parser, enc, s, &(tag->name), &(tag->bindings));
 2489         if (result)
 2490           return result;
 2491         if (startElementHandler)
 2492           startElementHandler(handlerArg, tag->name.str,
 2493                               (const XML_Char **)atts);
 2494         else if (defaultHandler)
 2495           reportDefault(parser, enc, s, next);
 2496         poolClear(&tempPool);
 2497         break;
 2498       }

At the line of 2467, the call of XmlConvert will actually call the utf8_toUtf16().

In the utf8_toUtf16(), the switch block always goes into the 'default' block which will incrase both the to and from pointers. And looking back to the doContent(), the variables passed to utf8_toUtf16() are the following:

1) fromP is the start of the tag 'advanced-file-transfer'
2) fromLim is the end of the tag 'advanced-file-transfer', it is the address of the space character.
3) toP is the the start of the tag->buf which points to a preallocated 32 bytes memory.
4) toLim is the last unsigned short address of the preallocated 32 bytes memory.

Since the switch block always goes into the 'default' block, there will be only the 15 characters (advanced-file-t) converted and stored in the tag->buf, and the utf_toUtf16() returns the value in res which is XML_CONVERT_COMPLETED by default. This will trigger the execution of line 2471 to 2474 and break the parsing of the whole XML tag.

392 static enum XML_Convert_Result PTRCALL
393 utf8_toUtf16(const ENCODING *enc,
394              const char **fromP, const char *fromLim,
395              unsigned short **toP, const unsigned short *toLim)
396 {
397   enum XML_Convert_Result res = **XML_CONVERT_COMPLETED**;
398   unsigned short *to = *toP;
399   const char *from = *fromP;
400   while (from < fromLim && to < toLim) {
401     switch (((struct normal_encoding *)enc)->type[(unsigned char)*from]) {
402     case BT_LEAD2:
403       if (fromLim - from < 2) {
404         res = XML_CONVERT_INPUT_INCOMPLETE;
405         break;
406       }
407       *to++ = (unsigned short)(((from[0] & 0x1f) << 6) | (from[1] & 0x3f));
408       from += 2;
409       break;
410     case BT_LEAD3:
411       if (fromLim - from < 3) {
412         res = XML_CONVERT_INPUT_INCOMPLETE;
413         break;
414       }
415       *to++ = (unsigned short)(((from[0] & 0xf) << 12)
416                                | ((from[1] & 0x3f) << 6) | (from[2] & 0x3f));
417       from += 3;
418       break;
419     case BT_LEAD4:
420       {
421         unsigned long n;
422         if (toLim - to < 2) {
423           res = XML_CONVERT_OUTPUT_EXHAUSTED;
424           goto after;
425         }
426         if (fromLim - from < 4) {
427           res = XML_CONVERT_INPUT_INCOMPLETE;
428           goto after;
429         }
430         n = ((from[0] & 0x7) << 18) | ((from[1] & 0x3f) << 12)
431             | ((from[2] & 0x3f) << 6) | (from[3] & 0x3f);
432         n -= 0x10000;
433         to[0] = (unsigned short)((n >> 10) | 0xD800);
434         to[1] = (unsigned short)((n & 0x3FF) | 0xDC00);
435         to += 2;
436         from += 4;
437       }
438       break;
**439     default:
440       *to++ = *from++;
441       break;**
442     }
443   }
444 after:
445   *fromP = from;
446   *toP = to;
447   return res;
448 }

Karl Waclawek - 2016-07-07

With the attached test.xml file (encoded in UTF-8) I can duplicate the
issue using libexpatw.dll.
When I convert the file to UCS2-BE or LE encoding the issue disappears.

So it seems the problems is parsing UTF-8 input and outputting as UTF-16
(utf8_to_utf16 as observed).

I would think the same issue shows in other platforms when using the wide
character version of libexpat.

On Thu, Jul 7, 2016 at 9:38 AM, Andy Wang applewwhua@users.sf.net wrote:

Hi Sebastian,

I have tried to build the xmlwf with Unicode supporting on Windows, but
failed. So I don't know how to prove my findings. I attached my XML
document as the attached test.xml with the file encoding utf8.

Last edit: Karl Waclawek 2016-07-07

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Wang - 2016-07-07

Attach the test.xml

test.xml

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Wang - 2016-07-08

Thanks Karl and Sebastian.

Could you please evaluate and suggest a fix for it?

And now can I make this issue as public so that we can let other users know this issue?

And thanks again for all your helps.

Regards
Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andy Wang - 2016-07-11

Hi Karl,

Any comments?

Regards
Andy

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-11

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-11

I need a few more days to have time for a closer look.

No objections from my side to removing the "private bit" from this ticket.

Best, Sebastian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Andy Wang - 2016-07-11
  
  Thanks Sebastian.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-17

private: Yes --> No
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-17

Hello again,

I followed an idea by Karl with some custom debugging code and found a fix (https://sourceforge.net/p/expat/code_git/ci/af507cef2c93cb8d40062a0abe43a4f4e9158fb2) to the bug you reported.

To reproduce the bug off a Git clone for an unfixed version on a Linux shell using your test.xml linked above and the tagwalk.c debugging code (attached to this reply) one would run:

./configure CPPFLAGS="-DXML_UNICODE" make clean buildlib cc -Wall -Wextra -DXML_UNICODE -Ilib -L.libs -lexpat -o tagwalk{,.c} LD_LIBRARY_PATH=.libs ./tagwalk test.xml ; echo $?

To see it with libexpat installed system-wide, it would be something like:

cc -Wall -Wextra -DXML_UNICODE -DXML_UNICODE_WCHAR_T -lexpatw -o tagwalk{,.c} # a) cc -Wall -Wextra -DXML_UNICODE -lexpatu -o tagwalk{,.c} # b) ./tagwalk test.xml ; echo $?

Please note that libexpatw and libexpatu may be different things across distributions of Linux.

Please let me know if the bug can be closed.
Thanks and best

Sebastian

tagwalk.c
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-17

PS: It may not be done yet, I noticed with -DXML_UNICODE_WCHAR_T tag names come out as "advanced-file-ransfer" (with "t" missing) for me.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sebastian Pipping - 2016-07-17
  
  If I am not mistaken Expat really needs -fshort-wchar for -DXML_UNICODE_WCHAR_T (and the case without it would then be unsupported. If so we could catch that at runtime some way). With -fshort-wchar, tag names seem to get through fine. Maybe Karl knows more about it.
  
  If that's a bug it affects Expat 2.1.1 already which the rest of this report does not. So it's a different thing.
  
  PS: Attached is tagwalk.c with support for -fshort-wchar (and a wide wchar libc).
  
  tagwalk.c
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Karl Waclawek - 2016-07-17
    
    Well, -fshort-wchar does not apply to Visual Studio compiling, as far as i
    remember.
    
    I will take a closer look soon.
    
    On Jul 17, 2016 16:28, "Sebastian Pipping" hartwork@users.sf.net wrote:
    
    If I am not mistaken Expat really needs -fshort-wchar for
    -DXML_UNICODE_WCHAR_T (and the case without it would then be unsupported.
    If so we could catch that at runtime some way). With -fshort-wchar, tag
    names seem to get through fine. Maybe Karl knows more about it.
    
    If that's a bug it affects Expat 2.1.1 already which the rest of this
    report does not. So it's a different thing.
    
    PS: Attached is tagwalk.c with support for -fshort-wchar (and a wide wchar
    libc).
    
    [bugs:#539] https://sourceforge.net/p/expat/bugs/539/ Expat 2.2.0
    doesn't support the XML tags larger than 15 characters in Unicode*
    
    Status: open
    Group: Feature Request
    Labels: 2.2.0
    Created: Tue Jul 05, 2016 12:24 AM UTC by Andy Wang
    Last Updated: Sun Jul 17, 2016 07:35 PM UTC
    Owner: Sebastian Pipping
    
    Hi Sebastian,
    
    There might be an issue in the latest expat release 2.2.0 that it doesn't
    support parsing the XML tags whose length is larger than 15 characters for
    Unicode and 31 characters for none Unicode.
    
    I was trying to use the expat and our C++ expat wrapper to parse the XML
    piece like "<advanced-file-transfer> ... " on
    Windows platform with Unicode, only the name 'advanced-file-t' is
    recognized. After debugging into the source code of expat and looking into
    the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd
    Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the
    changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c.</advanced-file-transfer>
    
    Could you please take a look?
    
    Thanks
    Andy
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/expat/bugs/539/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    Related
    
    Bugs: ~~#539~~
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Sebastian Pipping - 2016-07-17
      
      Thank you!
      
      Maybe -fshort-wchar is enabled on Windows all the time. I think it's 32bit on Linux but 16bit by default on Windows. Maybe not the best source but: https://msdn.microsoft.com/library/windows/desktop/aa367308(v=vs.85).aspx
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Karl Waclawek - 2016-07-18
        
        Yes, on Windows WCHAR_T is always 16 bits. That's why the MS compiler does
        not understand the -fshort-wchar flag. Visual Studio only
        needs XML_UNICODE_WCHAR_T defined without any additional flags.
        
        On Sun, Jul 17, 2016 at 6:19 PM, Sebastian Pipping hartwork@users.sf.net
        wrote:
        
        Thank you!
        
        Maybe -fshort-wchar is enabled on Windows all the time. I think it's 32bit
        on Linux but 16bit by default on Windows. Maybe not the best source but:
        https://msdn.microsoft.com/library/windows/desktop/aa367308(v=vs.85).aspx
        
        [bugs:#539] https://sourceforge.net/p/expat/bugs/539/ Expat 2.2.0
        doesn't support the XML tags larger than 15 characters in Unicode*
        
        Status: open
        Group: Feature Request
        Labels: 2.2.0
        Created: Tue Jul 05, 2016 12:24 AM UTC by Andy Wang
        Last Updated: Sun Jul 17, 2016 07:35 PM UTC
        Owner: Sebastian Pipping
        
        Hi Sebastian,
        
        There might be an issue in the latest expat release 2.2.0 that it doesn't
        support parsing the XML tags whose length is larger than 15 characters for
        Unicode and 31 characters for none Unicode.
        
        I was trying to use the expat and our C++ expat wrapper to parse the XML
        piece like "<advanced-file-transfer> ... " on
        Windows platform with Unicode, only the name 'advanced-file-t' is
        recognized. After debugging into the source code of expat and looking into
        the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd
        Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the
        changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c.</advanced-file-transfer>
        
        Could you please take a look?
        
        Thanks
        Andy
        
        Sent from sourceforge.net because you indicated interest in
        https://sourceforge.net/p/expat/bugs/539/
        
        To unsubscribe from further messages, please visit
        https://sourceforge.net/auth/subscriptions/
        
        Related
        
        Bugs: ~~#539~~
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sebastian Pipping - 2016-07-23

Hello Andy,
many stable Linux distributions have that bug unfixed and I would like to push the fix to them in a few days. Please report back if https://sourceforge.net/p/expat/code_git/ci/af507cef2c93cb8d40062a0abe43a4f4e9158fb2 fixed the issue for you.

Many thanks and best

Sebastian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Don Lewis - 2016-08-05
  
  I ran into this issue when attempting to update the version of expat that is bundled with Apache OpenOffice. It is compiled with -DXML_UNICODE and I saw the truncation to 15 character problem on FreeBSD. I believe that the problem is actually in utf8_toUtf16(), which is incorrectly returning XML_CONVERT_COMPLETED, when it should be returning XML_CONVERT_OUTPUT_EXHAUSTED because the output buffer is too small for the result. The attached patch fixes the problem for me.
  
  I think most of the rest of the conversion routines work properly, but they should be audited.
  
  expatbug539.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sebastian Pipping - 2016-08-05
    
    Looks like you are right. Please share with me (up here or by mail to firstname@lastname.org):
    
    A mail address of yours for Git commit authorship
    
    An (ideally minimal) XML file triggering that very bug with Git master
    
    Last edit: Sebastian Pipping 2016-08-05
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Don Lewis - 2016-08-05
      
      When I wrote that patch, my concern was with the mixed use of break vs goto in the error checks inside the loop. That's why I added the (res == XML_CONVERT_COMPLETED) condition to avoid overwriting an error status that was set insdie the loop. After looking at the code some more, I noticed that all of those are inside a switch block, so the error checks that use break don't even terminate the loop!
      
      In this new patch, I changed those break statements to goto statements. Then if execution falls off the bottom of the loop, we know that it was due to the condition at the top of the loop evaluating to false. If (from < fromLim) is true, then we know that the reason that the loop terminated must be due to running out of output space.
      
      As I previously mentioned, the other toUtf16() routines should be checked. In particular, DEFINE_UTF16_TO_UTF16(E) looks wrong to me. I don't have any way of testing it, though.
      
      expatbug539-1.patch
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Sebastian Pipping - 2016-08-12
        
        Many thanks for that patch of yours! It's 896b6c1fd3b842f377d1b62135dccf0a579cf65d in Git now.
        My understanding is that:
        
        utf8_toUtf16 is used with -DXML_UNICODE / expatw, only
        
        UTF-8 input with incomplete two-byte and/or three-byte characters would be trouble to the unpatched code
        
        .. if such bytes get that far in.
        
        I have not been able to get input of such kind though to utf8_toUtf16 myself, yet. If it is possible, it should be an infinite loop. (If you manage to get it through, please contact us off-web with details.)
        
        Please elaborate on how DEFINE_UTF16_TO_UTF16 looks wrong to you. Maybe in a new ticket, ideally. This one is getting too large already. Thanks!
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Don Lewis - 2016-08-12
        
        Thanks!
        
        Yes, though In my case, I have an application that cherry picks a few files out of expat and compiles them with -DXML_UNICODE.
        
        I don't think incomplete UTF-8 would cause problems with the unpatched code and the patch should not change anything there. It should return XML_CONVERT_INPUT_INCOMPLETE, which should break out of the for (;;) loop.
        
        I don't know if it is possible to get that as input.
        
        The earlier change to doContent() did have me concerned about the possiblity of an infinite loop because the (fromPtr >= rawNameEnd) test will be false in this case, and without the XML_CONVERT_INPUT_INCOMPLETE, there would be an infinite loop until all memory was consumed. I think it's safe, but suboptimal because the pointer test is redundant with what all the other converters do to generate the XML_CONVERT_COMPLETED return value.
        
        It looks like what happens in the XML_CONVERT_INPUT_INCOMPLETE is that tag->name-str just gets set to the part of the name formed by the complete characters and the incomplete character at the end gets dropped.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Expat 2.2.0 doesn't support the XML tags larger than 15 characters in Unicode

Fast XML parser library in C

Group

Searches

Help

#539 Expat 2.2.0 doesn't support the XML tags larger than 15 characters in Unicode

Related

Discussion

Related

Related

Related