Hi Sebastian,
There might be an issue in the latest expat release 2.2.0 that it doesn't support parsing the XML tags whose length is larger than 15 characters for Unicode and 31 characters for none Unicode.
I was trying to use the expat and our C++ expat wrapper to parse the XML piece like "<advanced-file-transfer> ... " on Windows platform with Unicode, only the name 'advanced-file-t' is recognized. After debugging into the source code of expat and looking into the changes of the submission "be4b1c06daba1849b8ff5e00bae5caf47f6c39fd Merge branch 'cve-2016-0718-fix-2-2-1'", I found it might be caused by the changes in utf8_toUtf16() in xmltok.c and doConent() in xmlparse.c. </advanced-file-transfer>
Could you please take a look?
Thanks
Andy
Hello Andy,
I cannot confirm the issue so far. xmlwf seems to handle a tag with 40 characters length fine:
Can you share an XML file that triggers the bug with xmlwf?
Best, Sebastian
Hi Sebastian,
Thanks for your quick response.
And I am sorry it is my mistake that I didn't check the implementation of utf8_toUtf8(). The expat works well with the utf8 encoding. It seems the issue I found only exists on Windows platform with libexpatw.dll which supports Unicode.
Here is the XML snippet in our unit test case:
Our test case failed to parse the tag 'advanced-file-transfer', it only returns the tag 'advanced-file-t' on Windows platform with Unicode encoding.
And when I debugging into the utf8_toUtf16() in xmltok.c, the switch block always went into the below 'default' block and finally the function utf_toUtf16() returns XML_CONVERT_COMPLETED even the tag has not been converted fully.
Thanks
Andy
Last edit: Andy Wang 2016-07-06
xmlwf always uses the UTF-8 version of libexpat, even on Windows.
So this issue can't be checked using xmlwf.
It might actually be present on Linux as well, if one uses the UTF-16
version of the library.
On Tue, Jul 5, 2016 at 11:13 PM, Andy Wang applewwhua@users.sf.net wrote:
Related
Bugs:
#539Hello Andy,
could you attach it as a file (and re-verify that that very file still hits the issue) so we are sure talking about the very same bytes to parse?
Do you have xmlwf built, too? Does xmlwf fail on that file for you?
XML_CONVERT_COMPLETED means the encoding conversion was able to convert all input bytes. If your file is encoded well, that's to be expected. (We can still have a bug on another layer, though.)
The commit you mentioned earlier is a merge commit. In case you use git-bisect once more, I would be curious about the non-merge commit you end up with.
I don't have any Windows in sight so I will have to tweak my build settings on Linux to match yours on Windows somehow: should be doable. Are you using any of the bundled build systems or do you pass defines yourself if any? If so: which ones?
Best, Sebastian
PS: If you feel like calling you can whois pipping.org and call me on that number, ending on 17. I'm on UTC+2 over here. Just if you want.
Hi Sebastian,
I have tried to build the xmlwf with Unicode supporting on Windows, but failed. So I don't know how to prove my findings. I attached my XML document as the attached test.xml with the file encoding utf8.
So can I list my debug steps when I met the issue when using the libexpatw.dll to parse the XML document, so that we can verify if this issue exists or not with code review:
Store the XML document "<advanced-file-transfer xmlns="http://jabber.com/schemas/advanced-file-transfer"><size-bytes>7292928</size-bytes><url>xxx</url><screen-capture></screen-capture></advanced-file-transfer>" to a variable with std::string type.
Pass the std::string variable to our APIs which will call the APIs of expat to parse.
The execution of the code went into the API doContnet() in xmlparse.c:
At the line of 2467, the call of XmlConvert will actually call the utf8_toUtf16().
In the utf8_toUtf16(), the switch block always goes into the 'default' block which will incrase both the to and from pointers. And looking back to the doContent(), the variables passed to utf8_toUtf16() are the following:
1) fromP is the start of the tag 'advanced-file-transfer'
2) fromLim is the end of the tag 'advanced-file-transfer', it is the address of the space character.
3) toP is the the start of the tag->buf which points to a preallocated 32 bytes memory.
4) toLim is the last unsigned short address of the preallocated 32 bytes memory.
Since the switch block always goes into the 'default' block, there will be only the 15 characters (advanced-file-t) converted and stored in the tag->buf, and the utf_toUtf16() returns the value in res which is XML_CONVERT_COMPLETED by default. This will trigger the execution of line 2471 to 2474 and break the parsing of the whole XML tag.
With the attached test.xml file (encoded in UTF-8) I can duplicate the
issue using libexpatw.dll.
When I convert the file to UCS2-BE or LE encoding the issue disappears.
So it seems the problems is parsing UTF-8 input and outputting as UTF-16
(utf8_to_utf16 as observed).
I would think the same issue shows in other platforms when using the wide
character version of libexpat.
On Thu, Jul 7, 2016 at 9:38 AM, Andy Wang applewwhua@users.sf.net wrote:
Last edit: Karl Waclawek 2016-07-07
Attach the test.xml
Thanks Karl and Sebastian.
Could you please evaluate and suggest a fix for it?
And now can I make this issue as public so that we can let other users know this issue?
And thanks again for all your helps.
Regards
Andy
Hi Karl,
Any comments?
Regards
Andy
I need a few more days to have time for a closer look.
No objections from my side to removing the "private bit" from this ticket.
Best, Sebastian
Thanks Sebastian.
Hello again,
I followed an idea by Karl with some custom debugging code and found a fix (https://sourceforge.net/p/expat/code_git/ci/af507cef2c93cb8d40062a0abe43a4f4e9158fb2) to the bug you reported.
To reproduce the bug off a Git clone for an unfixed version on a Linux shell using your test.xml linked above and the tagwalk.c debugging code (attached to this reply) one would run:
To see it with libexpat installed system-wide, it would be something like:
Please note that libexpatw and libexpatu may be different things across distributions of Linux.
Please let me know if the bug can be closed.
Thanks and best
Sebastian
PS: It may not be done yet, I noticed with -DXML_UNICODE_WCHAR_T tag names come out as "advanced-file-ransfer" (with "t" missing) for me.
If I am not mistaken Expat really needs -fshort-wchar for -DXML_UNICODE_WCHAR_T (and the case without it would then be unsupported. If so we could catch that at runtime some way). With -fshort-wchar, tag names seem to get through fine. Maybe Karl knows more about it.
If that's a bug it affects Expat 2.1.1 already which the rest of this report does not. So it's a different thing.
PS: Attached is tagwalk.c with support for -fshort-wchar (and a wide wchar libc).
Well, -fshort-wchar does not apply to Visual Studio compiling, as far as i
remember.
I will take a closer look soon.
On Jul 17, 2016 16:28, "Sebastian Pipping" hartwork@users.sf.net wrote:
Related
Bugs:
#539Thank you!
Maybe -fshort-wchar is enabled on Windows all the time. I think it's 32bit on Linux but 16bit by default on Windows. Maybe not the best source but: https://msdn.microsoft.com/library/windows/desktop/aa367308(v=vs.85).aspx
Yes, on Windows WCHAR_T is always 16 bits. That's why the MS compiler does
not understand the -fshort-wchar flag. Visual Studio only
needs XML_UNICODE_WCHAR_T defined without any additional flags.
On Sun, Jul 17, 2016 at 6:19 PM, Sebastian Pipping hartwork@users.sf.net
wrote:
Related
Bugs:
#539Hello Andy,
many stable Linux distributions have that bug unfixed and I would like to push the fix to them in a few days. Please report back if https://sourceforge.net/p/expat/code_git/ci/af507cef2c93cb8d40062a0abe43a4f4e9158fb2 fixed the issue for you.
Many thanks and best
Sebastian
I ran into this issue when attempting to update the version of expat that is bundled with Apache OpenOffice. It is compiled with -DXML_UNICODE and I saw the truncation to 15 character problem on FreeBSD. I believe that the problem is actually in utf8_toUtf16(), which is incorrectly returning XML_CONVERT_COMPLETED, when it should be returning XML_CONVERT_OUTPUT_EXHAUSTED because the output buffer is too small for the result. The attached patch fixes the problem for me.
I think most of the rest of the conversion routines work properly, but they should be audited.
Looks like you are right. Please share with me (up here or by mail to firstname@lastname.org):
Last edit: Sebastian Pipping 2016-08-05
When I wrote that patch, my concern was with the mixed use of break vs goto in the error checks inside the loop. That's why I added the (res == XML_CONVERT_COMPLETED) condition to avoid overwriting an error status that was set insdie the loop. After looking at the code some more, I noticed that all of those are inside a switch block, so the error checks that use break don't even terminate the loop!
In this new patch, I changed those break statements to goto statements. Then if execution falls off the bottom of the loop, we know that it was due to the condition at the top of the loop evaluating to false. If (from < fromLim) is true, then we know that the reason that the loop terminated must be due to running out of output space.
As I previously mentioned, the other toUtf16() routines should be checked. In particular, DEFINE_UTF16_TO_UTF16(E) looks wrong to me. I don't have any way of testing it, though.
Many thanks for that patch of yours! It's 896b6c1fd3b842f377d1b62135dccf0a579cf65d in Git now.
My understanding is that:
I have not been able to get input of such kind though to utf8_toUtf16 myself, yet. If it is possible, it should be an infinite loop. (If you manage to get it through, please contact us off-web with details.)
Please elaborate on how DEFINE_UTF16_TO_UTF16 looks wrong to you. Maybe in a new ticket, ideally. This one is getting too large already. Thanks!
Thanks!
Yes, though In my case, I have an application that cherry picks a few files out of expat and compiles them with -DXML_UNICODE.
I don't think incomplete UTF-8 would cause problems with the unpatched code and the patch should not change anything there. It should return XML_CONVERT_INPUT_INCOMPLETE, which should break out of the for (;;) loop.
The earlier change to doContent() did have me concerned about the possiblity of an infinite loop because the (fromPtr >= rawNameEnd) test will be false in this case, and without the XML_CONVERT_INPUT_INCOMPLETE, there would be an infinite loop until all memory was consumed. I think it's safe, but suboptimal because the pointer test is redundant with what all the other converters do to generate the XML_CONVERT_COMPLETED return value.
It looks like what happens in the XML_CONVERT_INPUT_INCOMPLETE is that tag->name-str just gets set to the part of the name formed by the complete characters and the incomplete character at the end gets dropped.