Menu

#888 gSOAP returns invalid 6-byte characters for UTF-8

closed-fixed
None
5
2013-02-03
2012-10-18
No

Running code generated with gSOAP 2.7.6 worked OK and the UTF-8 characters were returned as entities. Regenerated gSOAP 2.8.8-2.8.11 code returns invalid 6-byte characters, and opening the XML file returned with Firefox gives the following (with a pointer to the first invalid character found):

XML Parsing Error: not well-formed

Discussion

  • Robert van Engelen

    Example code please? Note that UTF-8 handling depends on many parameters and flags that we need in order to determine if this is a bug and fix it. Note that for XML literal strings (_XML) the app is responsible to populate these and that the representation of entities is affected by the CANONICAL flag for c14n compliance.

     
  • Robert van Engelen

    Sorry, I am still confused as to what the exact example case is that causes this issue. Is the serialization of the literal XML string in your code (typedef wchar_t* XML) causing this issue, when wchar_t* strings are converted to UTF8?

     
  • Robert van Engelen

    • assigned_to: nobody --> engelen
     
  • Robert van Engelen

    Please try this patch for stdsoap2.c/.cpp soap_wstring_out():

    else /* check for UTF16 encoding when wchar_t is too small to hold
    UCS */
    { if (sizeof(wchar_t) < 4 && (c & 0xFC00) == 0xD800)
    { register soap_wchar d = *s++;
    if ((d & 0xFC00) == 0xDC00)
    c = ((c - 0xD800) << 10) + (d - 0xDC00) + 0x10000;
    else
    c = 0xFFFD; /* Malformed */
    }
    if (soap_pututf8(soap, (unsigned long)c))
    return soap->error;
    }

    I do not have any other information from you, so my best guess is that this is happening in the wchar_t serializer.

     
  • Daniel Bolgheroni

    Hi Robert,

    applied your patch but, unfortunately, I still get the "not well-formed" message.

    Sorry for the delay.
    Thank you.

     
  • Robert van Engelen

    Note that certain character codes cannot be represented in XML, so string interoperability is known to fail for many codes below 0x20 and other. I've added serialization test to confirm the XML serialization of strings in UTF8 is correct. There may have been some changes from 2.7 to 2.8, but the use of XML entities versus &#x coding makes no difference.

    I still do not know if your code fails for char* or wchar_t and whether you use _XML to serialize literal XML content. If you use the latter, then please be reminded that the XML content is supposed to be correctly put in the string before the engine serialized it. That is, the responsibility lies in the app logic.

     
  • Robert van Engelen

    Should be fixed in 2.8.12 (report lacks details to replicate the problem, but UTF8 improvements are implemented).

     
  • Robert van Engelen

    • status: open --> pending-fixed
     
  • Daniel Bolgheroni

     
  • Daniel Bolgheroni

    • status: pending-fixed --> open-fixed
     
  • Daniel Bolgheroni

    Hi Robert, sorry for taking so long to respond.

    The last version of gSOAP (2.8.12) didn't solve the problem. I did a simple test case for you to be able to reproduce the error. The tar is attached and contains a README.txt file with the steps followed.

    This used to work with gSOAP 2.7.6, but since I upgraded to 2.8.8 and beyond (up to 2.8.12), the clients always crash (Python suds or SOAP::Lite), as it can't parse the returned XML.

    I appreciate any enlightenment on this.
    Thank you.

     
  • regiov

    regiov - 2013-01-08

    Dear Robert,

    I'm afraid the title of this report is misleading. We imagined that we were dealing with UTF-8 data when it was actually latin1. In any case, the files recently provided to replicate the problem (gsoap_encoding_test.tar.gz) seem to follow all directions given by the documentation but it doesn't work. Our service uses the document/literal approach, and it needs to interact with another library which produces pieces of XML encoded in latin1. So we feed gsoap with this data as wchar_t* according to the documentation, expecting that gsoap would make the right conversion to UTF-8. As Daniel said, this used to work in previous versions but not in the latest ones. Are we doing something wrong? We would really appreciate your feedback. The example that we provided is very simple: it's a single operation that reads an XML file, transforms the content to wchar_t*, and returns this as part of the response.

    Thanks in advance,
    Renato

     
  • Daniel Bolgheroni

    New version as the other one seems corrupted.

     
  • Daniel Bolgheroni

    New version.

     
  • Daniel Bolgheroni

    Hi Robert,

    apparently sf.net is corrupting the file containing the test case. If you cannot untar the attached file, please take a look at:

    http://www.cria.org.br/~daniel/gsoap_encoding_test-20130114.tar.gz

    MD5: 23ce7c7d553bf7e5f1c6ca211d31fad4

    Thank you.

     
  • Robert van Engelen

    • status: open-fixed --> open-accepted
     
  • Robert van Engelen

    Great. Now we are getting somewhere. I will test your package.

     
  • Robert van Engelen

    • status: open-accepted --> pending-fixed
     
  • Robert van Engelen

    In your code the latin1 codes 0x80~0xFF are translated to sign-extended wide characters, which I assume is not intended since it leads to invalid wide unicode chars.

    For example, 0xF1 is converted to 0xFFF1 and then by the gSOAP engine to the UTF8 encoded sequence of this invalid character.

    Please change your code to handle this properly by casting to unsigned char:

    staticwchar_t* convertToWideChar( const char* p )
    { wchar_t *r;
    r = new wchar_t[strlen(p)+1];
    const char *tempsource = p;
    wchar_t *tempdest = r;
    while (( *tempdest++ = (unsigned char)*tempsource++ ));
    return r;
    }

    A simple way to test this is:

    ./etest.cgi < test.test.req.xml > result.xml

    and then edit result.xml to remove the HTTP header lines, save, and open it in a browser, say Firefox. Note that test.test.req.xml is auto-generated by soapcpp2.

     
  • Daniel Bolgheroni

    • status: pending-fixed --> open-fixed
     
  • Daniel Bolgheroni

    Hi Robert,

    It works with the cast you proposed.

    Thank you very much.

     
  • Robert van Engelen

    • status: open-fixed --> closed-fixed
     

Log in to post a comment.

MongoDB Logo MongoDB