Character encoding problem

Help
CAZypedia
2010-05-23
2013-06-06
  • CAZypedia
    CAZypedia
    2010-05-23

    Hi,

    I'm having a problem with nusoap (v0.7.3 or v0.9.5, PHP 5.1.6) and academic journal reference information served from the National Center for Biotechnology Information's Entrez Utilities SOAP server (http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html)

    The problem is related to character encoding.  The NCBI sends data in XML in ISO-8859-1 encoding (although this is not actually specified in the XML), and the nusoap parser is choking on all non-English characters.  The end result for us is that a Mediawiki extension, Biblio.php, which normally makes nice bibliographic lists in the wiki, fails since no data is received from nusoap.

    I've attached the raw XML data from the SOAP server, plus some debugging info and the error ("Could not parse xml, error: XML error parsing SOAP payload on line 14: Invalid character") at the end of this post.  Here, line 14 contains a French author name with the letter "é" (<Item Name="Author" Type="String">Béguin P</Item>).

    Any suggestions on how to fix this issue would be very much appreciated.  There are a number of Mediawiki-based sites around the world that are affected by this issue.

    Cheers,

    The CAZypedia Curator (www.cazypedia.org)


    nusoap_parser Object
    (
         => <?xml version="1.0"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >
    <SOAP-ENV:Body><eSummaryResult xmlns="http://www.ncbi.nlm.nih.gov/soap/eutils/esummary">
    <DocSum>
    <Id>1618761</Id>
    <Item Name="PubDate" Type="Date">1992 Jun 25</Item>
    <Item Name="EPubDate" Type="Date"></Item>
    <Item Name="Source" Type="String">J Biol Chem</Item>
    <Item Name="AuthorList" Type="List">
    <Item Name="Author" Type="String">Gebler J</Item>
    <Item Name="Author" Type="String">Gilkes NR</Item>
    <Item Name="Author" Type="String">Claeyssens M</Item>
    <Item Name="Author" Type="String">Wilson DB</Item>
    <Item Name="Author" Type="String">Béguin P</Item>
    <Item Name="Author" Type="String">Wakarchuk WW</Item>
    <Item Name="Author" Type="String">Kilburn DG</Item>
    <Item Name="Author" Type="String">Miller RC Jr</Item>
    <Item Name="Author" Type="String">Warren RA</Item>
    <Item Name="Author" Type="String">Withers SG</Item>
    </Item>
    <Item Name="LastAuthor" Type="String">Withers SG</Item>
    <Item Name="Title" Type="String">Stereoselective hydrolysis catalyzed by related beta-1,4-glucanases and beta-1,4-xylanases.</Item>
    <Item Name="Volume" Type="String">267</Item>
    <Item Name="Issue" Type="String">18</Item>
    <Item Name="Pages" Type="String">12559-61</Item>
    <Item Name="LangList" Type="List">
    <Item Name="Lang" Type="String">English</Item>
    </Item>
    <Item Name="NlmUniqueID" Type="String">2985121R</Item>
    <Item Name="ISSN" Type="String">0021-9258</Item>
    <Item Name="ESSN" Type="String">1083-351X</Item>
    <Item Name="PubTypeList" Type="List">
    <Item Name="PubType" Type="String">Journal Article</Item>
    </Item>
    <Item Name="RecordStatus" Type="String">PubMed - indexed for MEDLINE</Item>
    <Item Name="PubStatus" Type="String">ppublish</Item>
    <Item Name="ArticleIds" Type="List">
    <Item Name="pubmed" Type="String">1618761</Item>
    </Item>
    <Item Name="History" Type="List">
    <Item Name="pubmed" Type="Date">1992/07/05 19:15</Item>
    <Item Name="medline" Type="Date">2001/03/28 10:01</Item>
    <Item Name="entrez" Type="Date">1992/07/05 19:15</Item>
    </Item>
    <Item Name="References" Type="List"></Item>
    <Item Name="HasAbstract" Type="Integer">1</Item>
    <Item Name="PmcRefCount" Type="Integer">26</Item>
    <Item Name="FullJournalName" Type="String">The Journal of biological chemistry</Item>
    <Item Name="ELocationID" Type="String"></Item>
    <Item Name="SO" Type="String">1992 Jun 25;267(18):12559-61</Item>
    </DocSum>

    …370 DocSum RECORD LINES CUT…

    </eSummaryResult>

    </SOAP-ENV:Body>
    </SOAP-ENV:Envelope>
         => ISO-8859-1
         => run_eSummary
         =>
         =>
         =>
         =>
         =>
         =>
         => 0
         => 0
         =>
         => Array
            (
            )

         => Array
            (
            )

         =>
         =>
         =>
         =>
         =>
         => Array
            (
            )

         => 1
         =>
         =>
         =>
         => 0
         => Array
            (
            )

         => Array
            (
            )

         => 0
         => NuSOAP
         => 0.9.5
         => $Revision: 1.123 $
         =>
         => 2010-05-17 12:51:43.795422 nusoap_parser: No encoding specified in XML declaration
    2010-05-17 12:51:43.795463 nusoap_parser: Entering nusoap_parser(), length=19640, encoding=ISO-8859-1

         => 1
         => 9
         => http://www.w3.org/2001/XMLSchema
         => ISO-8859-1
         => Array
            (
            )

         => Array
            (
                 => Array
                    (
                         => string
                         => boolean
                         => double
                         => double
                         => double
                         =>
                         => string
                         => string
                         => string
                         =>
                         =>
                         =>
                         =>
                         =>
                         => string
                         => string
                         => string
                         => string
                         => string
                         => string
                         =>
                         =>
                         =>
                         =>
                         =>
                         =>
                         =>
                         =>
                         =>
                         =>
                         => integer
                         => integer
                         => integer
                         => integer
                         => integer
                         => integer
                         => integer
                         => integer
                         =>
                         =>
                         =>
                         =>
                         =>
                    )

                 => Array
                    (
                         =>
                         => integer
                         => boolean
                         => string
                         => double
                         => double
                         => string
                         => string
                         => string
                         => string
                         => array
                    )

                 => Array
                    (
                         =>
                         => integer
                         => boolean
                         => string
                         => double
                         => double
                         => string
                         => string
                         => string
                         => string
                         => array
                    )

                 => Array
                    (
                         => struct
                    )

                 => Array
                    (
                         => string
                         => array
                         => array
                    )

                 => Array
                    (
                         => Map
                    )

            )

         => Array
            (
                 => "
                 => &
                 => <
                 => >
                 => '
            )

         => Resource id #100
    )
    Could not parse xml, error: XML error parsing SOAP payload on line 14: Invalid character

     
  • jskywalker
    jskywalker
    2010-05-24

    i think you should write their helpdesk, and ask them to add the propper encoding information in their XML-messages.

    source: http://www.w3.org/TR/REC-xml/#charencoding
    Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration:

     
  • CAZypedia
    CAZypedia
    2010-05-24

    Thanks for the suggestion.  I have in fact been in contact with the NCBI several times, but this has been largely unhelpful.  The trouble is that in Dec. 2008 they changed their WSDL file from v1.5 to 2.0:

    http://www.ncbi.nlm.nih.gov/entrez/eutils/soap/v1.5/eutils.wsdl
    http://www.ncbi.nlm.nih.gov/entrez/eutils/soap/v2.0/eutils.wsdl

    v1.5 of the WSDL is no longer supported, and simply switching to v2.0 breaks things totally.  Note that the v1.5 WSDL worked fine until about 3 weeks ago (that's ca. 2 years since it was abandoned), with absolutely no changes to any of the software on my side (PHP, Mediawiki, NuSOAP, Apache, etc., etc.).  So, my conclusion is that something changed in the XML data that the NCBI sends…but they are not telling what.

    Predictably, the NCBI only says to use the 2.0 WSDL, and that v1.5 is not supported.  As I mentioned, 2.0 doesn't work for us, for reasons I haven't tried yet to troubleshoot. However, both our NuSOAP installation and a generic SOAP client (http://www.soapclient.com/soaptest.html) return the same error.  When I asked NCBI about this, the reply was simply that "well, it works for us via Apache Axis2".

    So…I'm a bit stuck, since I don't want to rewrite the Biblio.php extension to work with Axis2.  The Biblio/NuSOAP combo actually works great on most of our pages, and has run really well for 2+ years.  For such a small bug, I am hoping someone out there might know a simple hack.

    Cheers.

     
  • Scott Nichol
    Scott Nichol
    2010-05-26

    jskywalker - because the XML in question is transported by HTTP, the encoding used by the XML can be deduced from the encoding of the HTTP stream per the spec you quote.

    The fault here is xml_parser_create in PHP 5, which always deduces the encoding from the XML.  This violates the XML spec.

    The workaround is to inject an encoding declaration in XML that lacks it.  I will look at doing this today and post here when the new code is in CVS.

     
  • Scott Nichol
    Scott Nichol
    2010-05-26

    I have committed a 3-line update to nusoap_parser that injects encoding information from HTTP into the xml declaration if the xml declaration does not already have it.

    cazypedia -  please get this (nusoap.php or class.soap_parser.php) from CVS and confirm that it works for you.

     
  • CAZypedia
    CAZypedia
    2010-05-26

    snichol - Just tried both nusoap.php and class.soap_parser.php by replacing these in the v0.9.5 package, but unfortunately it didn't fix the issue:  the biblio.php output breaks with the same symptoms as before.

    We found a hack that works for us, but that is certainly not a general fix.  In the nusoap_parser function, we explicitly convert the XML data into UTF-8 with iconv, and everything parses fine.  Here's the code snippet for reference:

    function nusoap_parser($xml,$encoding='UTF-8',$method='',$decode_utf8=true){
    parent::nusoap_base();

    // Hack by CAZypedia crew to fix character encoding of NCBI XML data from SOAP
    // This prevents non-English characters from causing the parser to choke.
    $xml = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $xml);
    // End hack.
    $this->xml = $xml;
    $this->xml_encoding = $encoding;
    $this->method = $method;
    $this->decode_utf8 = $decode_utf8;

    …of course, we know that the incoming XML is in ISO-8859-1.  The annoying thing is that we have to do the same conversion again in biblio.php on the data returned from nusoap, otherwise our wiki page ends up with ISO-8859-1 encoding in the journal references added by biblio.  We are seriously lost (in translation) - but our pages now look fine…

     
  • jskywalker
    jskywalker
    2010-05-26

    i checked and see headers and data as below….
    could you please give me a hind as WHERE the encding is to be found?

    HTTP/1.0 200 OK
    Date: Wed, 26 May 2010 18:34:19 GMT
    Server: Apache
    Last-Modified: Thu, 15 Apr 2010 08:36:30 GMT
    ETag: "7100af-282d-4844264299f80"-gzip
    Accept-Ranges: bytes
    Content-Type: text/plain
    Vary: Accept-Encoding
    Connection: close
    Content-Length: 10285
    <?xml version="1.0"?>
    <wsdl:definitions 
        xmlns:s="http://www.w3.org/2001/XMLSchema" 
        xmlns:http="http://schemas.xmlsoap.org/wsdl/http/" 
        xmlns:mime="http://schemas.xmlsoap.org/wsdl/mime/" 
        xmlns:soap="http://schemas.xmlsoap.org/wsdl/soap/" 
        xmlns:soapenc="http://schemas.xmlsoap.org/soap/encoding/" 
        xmlns:s0="http://www.ncbi.nlm.nih.gov/soap/eutils/" 
        xmlns:wsdl="http://schemas.xmlsoap...
    

    ….

     
  • Scott Nichol
    Scott Nichol
    2010-05-27

    Can you run my test app to confirm that there is not a difference between my version of PHP and yours that causes different behaviors?  If the test works for you but the nusoap.php does not, then there is something wrong with the code I have added to nusoap.php.

    I have an xml file (cazypedia.xml) and php (cazypedia.php) to test injection of xml declaration information.  The path to cazypedia.xml is hard-coded in cazypedia.php, so you need to change it for your environment.  The files are

    cazypedia.xml

    <?xml version="1.0"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >
    <SOAP-ENV:Body><eSummaryResult xmlns="http://www.ncbi.nlm.nih.gov/soap/eutils/esummary">
    <DocSum>
    <Id>1618761</Id>
    <Item Name="PubDate" Type="Date">1992 Jun 25</Item>
    <Item Name="EPubDate" Type="Date"></Item>
    <Item Name="Source" Type="String">J Biol Chem</Item>
    <Item Name="AuthorList" Type="List">
    <Item Name="Author" Type="String">Gebler J</Item>
    <Item Name="Author" Type="String">Gilkes NR</Item>
    <Item Name="Author" Type="String">Claeyssens M</Item>
    <Item Name="Author" Type="String">Wilson DB</Item>
    <Item Name="Author" Type="String">Béguin P</Item>
    <Item Name="Author" Type="String">Wakarchuk WW</Item>
    <Item Name="Author" Type="String">Kilburn DG</Item>
    <Item Name="Author" Type="String">Miller RC Jr</Item>
    <Item Name="Author" Type="String">Warren RA</Item>
    <Item Name="Author" Type="String">Withers SG</Item>
    </Item>
    </DocSum>
    </eSummaryResult>
    </SOAP-ENV:Body>
    </SOAP-ENV:Envelope>

    cazypedia.php

    <?php
    function start_element_handler($parser, $name, $attribs) {
    echo "start $name\n";
    }

    function end_element_handler($parser, $name) {
    echo "end $name\n";
    }

    function handler ($parser, $data) {
    echo "cdata $data\n";
    }

    echo "<pre>\n";

    /* Change this path for your environment */
    $filename = "c:\\code\\phphack\\cazypedia.xml";
    $len = filesize($filename);
    $fp = fopen($filename, "r");
    $xml = fread($fp, $len);
    fclose($fp);
    if (strlen($xml) != $len) {
    echo "Read " . strlen($xml) . " expected " . $len . "\n";
    }

    $encoding = "ISO-8859-1";

    $pos_xml = strpos($xml, '<?xml');
    if ($pos_xml !== FALSE) {
    $xml_decl = substr($xml, $pos_xml, strpos($xml, '?>', $pos_xml + 2) - $pos_xml + 1);
    if (preg_match("/encoding=(*)/", $xml_decl, $res)) {
    $xml_encoding = $res;
    if (strtoupper($xml_encoding) != $encoding) {
    $err = "Charset from HTTP Content-Type '" . $encoding . "' does not match encoding from XML declaration '" . $xml_encoding . "'";
    echo "$err\n";
    } else {
    echo "Charset from HTTP Content-Type matches encoding from XML declaration\n";
    }
    } else {
    echo "No encoding specified in XML declaration\n";
    //$xml = substr($xml, 0, $pos_xml + 5) . " encoding=\"$encoding\"" . substr($xml, $pos_xml + 5);
    $pos_end = strpos($xml, '?>', $pos_xml + 2);
    $xml = substr($xml, 0, $pos_end)  . " encoding=\"$encoding\"" . substr($xml, $pos_end);
    echo "********* Modified XML to be parsed *************\n";
    echo htmlspecialchars($xml, ENT_QUOTES);
    echo "********* End of modified XML *************\n";
    }
    } else {
    echo "No XML declaration\n";
    $xml = "<?xml version=\"1.0\" encoding=\"$encoding\"?>" . $xml;
    echo "********* Modified XML to be parsed *************\n";
    echo htmlspecialchars($xml, ENT_QUOTES);
    echo "********* End of modified XML *************\n";
    }

    $parser = xml_parser_create("ISO-8859-1");
    xml_parser_set_option($parser, XML_OPTION_CASE_FOLDING, 0);
    xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "ISO-8859-1");
    xml_set_element_handler($parser, 'start_element_handler', 'end_element_handler');
    xml_set_character_data_handler($parser, 'handler');

    if (!xml_parse($parser, $xml, true)) {
        $err = sprintf('XML error parsing SOAP payload on line %d: %s',
        xml_get_current_line_number($parser),
        xml_error_string(xml_get_error_code($parser)));
    echo "$err\n";
    } else {
    echo "parsed successfully\n";
    }

    xml_parser_free($parser);

    echo "</pre>\n";
    ?>

     
  • Scott Nichol
    Scott Nichol
    2010-05-27

    The XML 1.0 spec (http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding) *implies*, but is not explicit, that if there is no encoding in the XML declaration, the encoding is taken from MIME headers, transport, etc.  The HTTP 1.1 spec (http://www.ietf.org/rfc/rfc2616.txt) section 3.7.1 says

    >>>
    When no explicit charset parameter is provided by the sender, media subtypes of the "text"  type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.
    <<<

    In your example, there is no charset in the HTTP Content-Type, so the charset is ISO-8859-1.  Since the xml declaration has no encoding attribute, the XML should also be encoded in ISO-8859-1.

    Note that strictly the HTTP spec is for version 1.1.  By default, NuSOAP uses HTTP 1.0.  You can call $client->usePersistentConnection() or $client->setEncoding() to force HTTP 1.1.  That may change the behavior of the server, but I don't think it will affect your problem.

     
  • jskywalker
    jskywalker
    2010-05-27

    about default encoding of XML,
    i did use Google a bit, and found: http://www.opentag.com/xfaq_enc.htm#enc_default

    it says:
    If no encoding declaration is present in the XML document …….the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).
    …..The BOM is optional for UTF-8.

    so, if i approach this from an XML point of view, this XML is encoded in UTF-8  (i did not verify if there's really no BOM)
    when approached from HTTP it is encoded in ISO-8859-1

    this seems a contradiction, or not ?

     
  • Scott Nichol
    Scott Nichol
    2010-05-27

    To round out the context for anyone who does not follow the link, the full quote is
    >>>>
    If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).
    <<<<

    This says that when an XML document does not specify its own encoding, the inferred encoding depends on its context.  If the XML is, say, in a file, the BOM is used to determine the encoding.  If the XML is an e-mail attachment (with MIME headers) or in an HTML body (with a Content-Type or defaults), its encoding is specified by HTTP, MIME, etc.  I don't see this as a contradiction.  It just says that the encoding is context-dependent.  For me, it reinforces the specification of encoding as a best practice.  Leaving the encoding out is asking for trouble.

     
  • Bill  Flanagan
    Bill Flanagan
    2010-05-28

    I just came upon this thread after finding this problem and coming up with a fix for a MediaWiki extension I maintain. I followed the brute-force method cited above/below by forcing encoding to UTF-8. We use a web service to read Pubmed biomedical citation metadata. The input XML SOAP messages do not declare the character set encoding information. I suspect this is a pretty recent change in the service. Since the service is provided by the NIH and used by a lot of people already, I don't anticipate that they will change the service soon. But we can all hope. The other possibility is that Pubmed internationalized their character set encoding but just didn't identify it. I'm betting that I'll find that they do tell developers to use UTF-8 but didn't change their server when they started sending it.

    Before using the convert-to-UTF-8 hack, I did note that on one of my systems, the problem was mitigated by forcing the input character set for content to ISO-8859-1. This is probably irrelevant but this did allows the XML SOAP payload to be parsed but accented characters were mangled. This particular hack failed on other systems. This was the only one of the systems that had multibyte character support functions enabled, although they were not being called. When I recompiled PHP to support the mbstring functions and forced the encoding to UTF-8, it worked.

    On a final note, the characters now are bereft of their accents. When I forced the input type to ISO-8859-1, the XML parser returned with success but the accented characters returned as the infamous 'question mark' characters.

    Ideally, I'd like to get a fixed version of Nusoap that at least allows me to specify the default input character set to something like UTF-8. Several other sites that I'm aware of all are broken because of this and I'm trying to get a release to them as well as one I can use.

    By the way, Nusoap is a great package. I have no issues with PEAR and the native PHP version, but having one that allowed me to do hard things without regard for the PHP server config really is huge. Requiring the mbstring package isn't great but at least it's a way to solve this problem.

    Thanks for all the work you people put into it.

    Bill

     
  • CAZypedia
    CAZypedia
    2010-05-28

    @snichol, re: 11

    Here's the full output of cazypedia.php run on our server  - note the double 'cdata' lines for the parsed "Béguin P" entry.


    No encoding specified in XML declaration
    ********* Modified XML to be parsed *************
    <?xml version="1.0" encoding="ISO-8859-1"?>
    <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >
    <SOAP-ENV:Body><eSummaryResult xmlns="http://www.ncbi.nlm.nih.gov/soap/eutils/esummary">
    <DocSum>
    <Id>1618761</Id>
    <Item Name="PubDate" Type="Date">1992 Jun 25</Item>
    <Item Name="EPubDate" Type="Date"></Item>
    <Item Name="Source" Type="String">J Biol Chem</Item>
    <Item Name="AuthorList" Type="List">
    <Item Name="Author" Type="String">Gebler J</Item>
    <Item Name="Author" Type="String">Gilkes NR</Item>
    <Item Name="Author" Type="String">Claeyssens M</Item>
    <Item Name="Author" Type="String">Wilson DB</Item>
    <Item Name="Author" Type="String">Béguin P</Item>
    <Item Name="Author" Type="String">Wakarchuk WW</Item>
    <Item Name="Author" Type="String">Kilburn DG</Item>
    <Item Name="Author" Type="String">Miller RC Jr</Item>
    <Item Name="Author" Type="String">Warren RA</Item>
    <Item Name="Author" Type="String">Withers SG</Item>
    </Item>
    </DocSum>
    </eSummaryResult>
    </SOAP-ENV:Body>
    </SOAP-ENV:Envelope>********* End of modified XML *************
    start SOAP-ENV:Envelope
    cdata

    start SOAP-ENV:Body
    start eSummaryResult
    cdata

    start DocSum
    cdata

    start Id
    cdata 1618761
    end Id
    cdata

    start Item
    cdata 1992 Jun 25
    end Item
    cdata

    start Item
    end Item
    cdata

    start Item
    cdata J Biol Chem
    end Item
    cdata

    start Item
    cdata

    start Item
    cdata Gebler J
    end Item
    cdata

    start Item
    cdata Gilkes NR
    end Item
    cdata

    start Item
    cdata Claeyssens M
    end Item
    cdata

    start Item
    cdata Wilson DB
    end Item
    cdata

    start Item
    cdata B
    cdata éguin P
    end Item
    cdata

    start Item
    cdata Wakarchuk WW
    end Item
    cdata

    start Item
    cdata Kilburn DG
    end Item
    cdata

    start Item
    cdata Miller RC Jr
    end Item
    cdata

    start Item
    cdata Warren RA
    end Item
    cdata

    start Item
    cdata Withers SG
    end Item
    cdata

    end Item
    cdata

    end DocSum
    cdata

    end eSummaryResult
    cdata

    end SOAP-ENV:Body
    cdata

    end SOAP-ENV:Envelope
    parsed successfully

     
  • CAZypedia
    CAZypedia
    2010-05-28

    @wjf42:  Out of curiosity, which Mediawiki extension are you maintaining, and can you send me a link with some info?  We like Biblio.php for formatting PubMed and ISBNdb data at www.cazypedia.org, but we're always on the lookout for new features.