#157 UTF-8 nbsp character causes incorrect content-length setting

open
Transport (26)
5
2009-04-02
2009-03-31
No

using 0.710.07 with Bugzilla 3.2.2 and Testopia 2.2. I've found that certain fields I'm retrieving via XMLRPC include the character sequence 
 (ampersand hash x d semi-colon), apparently the XML encoding for a carriage return. when such a character is encountered, my client barfs with something like:
Failed to parse servers response: XML document structures must start and end within the same entity.
looking at the raw TCP/HTTP with wireshark, its protocol analyzer is also confused, seeing garbage at the end of the XML response (an incomplete element, typically), and then additional HTTP data which is in fact the remainder of the XML data. basically the content-length is being set incorrectly, I believe because somewhere the 
 is being interpreted as a single character -- the extent to which content-length is off does seem to correspond to the number of 
's present.
I see comments indicative that this sort of thing is recognized as a potential issue in HTTP.pm:
# what's this all about?
# unfortunately combination of LWP and Perl 5.6.1 and later has bug
# in sending multibyte characters. LWP uses length() to calculate
# content-length header and starting 5.6.1 length() calculates chars
# instead of bytes. 'use bytes' in THIS file doesn't work, because
....
but the code therein does not seem to handle this situation.

Discussion

  • desolat

    desolat - 2009-04-01

    Can confirm that. Same Perl environment here and also using Bugzilla/Testopia in same versions.

     
  • desolat

    desolat - 2009-04-01

    More testing shows: HTML entity encoding seems to be the cause. For me it were mainly german umlauts.

     
  • Nobody/Anonymous

    it appears that the 
 was a red herring, and rather the problem is with non-breaking spaces, i.e.   or \xc2\xa0. it looks like in some cases I was storing (and attempting to send via xml-rpc) the utf8 sequence \xc2\xa0; when I edited the cases through the testopia web UI something (whether testopia or firefox) converted these to be stored in the database as  , and thereafter they came across the wire as the text characters " ", making the problem go away. desolat, I suspect that the same thing was happening for you, and again the umlauts and other such HTML entities existed alongside the nbsp's, and editing of any sort converted the nbsp's to &nbsp in the back-end database. furthermore, I suspect that someone who starts from a recent version of testopia will never have this problem, and only cases created in older (pre-2.0?) versions manifest the problem.

    our workaround for the problem is to replace the occurrences of \xc2\xa0 with   in HTTP.pm. I'm not sure how reasonable this is, frankly, but it works for us.

    --- HTTP.pm.orig 2009-03-31 13:37:50.000000000 -0700
    +++ HTTP.pm 2009-04-01 11:18:28.000000000 -0700
    @@ -403,6 +403,8 @@
    sub make_response {
    my ($self, $code, $response) = @_;

    + $response =~ s/\xc2\xa0/ /g;
    +
    my $encoding = $1
    if $response =~ /^<\?xml(?: version="1.0"| encoding="([^\"]+)")+\?>/;

     
  • Nathan Parrish

    Nathan Parrish - 2009-04-02
    • summary: character causing incorrect content-length setting --> UTF-8 nbsp character causes incorrect content-length setting
     
  • Nobody/Anonymous

    The basic problem seems to be the incorrect content-length calculation for multi-byte characters such as used in UTF-8.

    Example: german "ä" is a UTF-8 2-byte character (0xc3a4). This is taken into account for the content-length. But in the HTTP dump (see https://bugzilla.mozilla.org/attachment.cgi?id=370398\) there are 4 1-byte characters for the "ä" (\303\203\302\244 or 0x c3 83 c2 a4). This means 2 additional bytes of which no account in being taken in the content-length. And this clearly shifts out the last 2 1-byte characters of the http-response.

    So the question is: why is a 2-byte character being translated into a 4-byte representation (is this UTF-16?)?

     
  • Nobody/Anonymous

    Ah, getting closer to it. Seems to be a double-encoding problem: The already UTF-8-encoded 2-byte character (my database is completely UTF-8) gets again encoded after content-length calculation.

     
  • Nobody/Anonymous

    Could finally fix it. Cause was a Bugzilla Webservice bug fix trying to circumvent a SOAP::Lite bug which seems to be fixed in the meantime. See mentioned Bugzilla bug report for details.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks