UTF-8 nbsp character causes incorrect content-length setting

Brought to you by: byrnereese, kutterma, paulclinger

#157 UTF-8 nbsp character causes incorrect content-length setting

Status: open

Owner: Martin Kutter

Labels: Transport (26)

Priority: 5

Updated: 2009-04-02

Created: 2009-03-31

Creator: Nathan Parrish

Private: No

using 0.710.07 with Bugzilla 3.2.2 and Testopia 2.2. I've found that certain fields I'm retrieving via XMLRPC include the character sequence (ampersand hash x d semi-colon), apparently the XML encoding for a carriage return. when such a character is encountered, my client barfs with something like:
Failed to parse servers response: XML document structures must start and end within the same entity.
looking at the raw TCP/HTTP with wireshark, its protocol analyzer is also confused, seeing garbage at the end of the XML response (an incomplete element, typically), and then additional HTTP data which is in fact the remainder of the XML data. basically the content-length is being set incorrectly, I believe because somewhere the is being interpreted as a single character -- the extent to which content-length is off does seem to correspond to the number of 's present.
I see comments indicative that this sort of thing is recognized as a potential issue in HTTP.pm:
# what's this all about?
# unfortunately combination of LWP and Perl 5.6.1 and later has bug
# in sending multibyte characters. LWP uses length() to calculate
# content-length header and starting 5.6.1 length() calculates chars
# instead of bytes. 'use bytes' in THIS file doesn't work, because
....
but the code therein does not seem to handle this situation.

Discussion

desolat - 2009-04-01

Can confirm that. Same Perl environment here and also using Bugzilla/Testopia in same versions.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

desolat - 2009-04-01

See Bugzilla/Testopia bug: https://bugzilla.mozilla.org/show_bug.cgi?id=486306

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

desolat - 2009-04-01

More testing shows: HTML entity encoding seems to be the cause. For me it were mainly german umlauts.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-04-01

it appears that the was a red herring, and rather the problem is with non-breaking spaces, i.e.   or \xc2\xa0. it looks like in some cases I was storing (and attempting to send via xml-rpc) the utf8 sequence \xc2\xa0; when I edited the cases through the testopia web UI something (whether testopia or firefox) converted these to be stored in the database as  , and thereafter they came across the wire as the text characters "&nbsp;", making the problem go away. desolat, I suspect that the same thing was happening for you, and again the umlauts and other such HTML entities existed alongside the nbsp's, and editing of any sort converted the nbsp's to &nbsp in the back-end database. furthermore, I suspect that someone who starts from a recent version of testopia will never have this problem, and only cases created in older (pre-2.0?) versions manifest the problem.

our workaround for the problem is to replace the occurrences of \xc2\xa0 with &nbsp; in HTTP.pm. I'm not sure how reasonable this is, frankly, but it works for us.

--- HTTP.pm.orig 2009-03-31 13:37:50.000000000 -0700
+++ HTTP.pm 2009-04-01 11:18:28.000000000 -0700
@@ -403,6 +403,8 @@
sub make_response {
my ($self, $code, $response) = @_;

+ $response =~ s/\xc2\xa0/&nbsp;/g;
+
my $encoding = $1
if $response =~ /^<\?xml(?: version="1.0"| encoding="([^\"]+)")+\?>/;

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nathan Parrish - 2009-04-02

summary: character causing incorrect content-length setting --> UTF-8 nbsp character causes incorrect content-length setting
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-04-02

The basic problem seems to be the incorrect content-length calculation for multi-byte characters such as used in UTF-8.

Example: german "ä" is a UTF-8 2-byte character (0xc3a4). This is taken into account for the content-length. But in the HTTP dump (see https://bugzilla.mozilla.org/attachment.cgi?id=370398\) there are 4 1-byte characters for the "ä" (\303\203\302\244 or 0x c3 83 c2 a4). This means 2 additional bytes of which no account in being taken in the content-length. And this clearly shifts out the last 2 1-byte characters of the http-response.

So the question is: why is a 2-byte character being translated into a 4-byte representation (is this UTF-16?)?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-04-02

Ah, getting closer to it. Seems to be a double-encoding problem: The already UTF-8-encoded 2-byte character (my database is completely UTF-8) gets again encoded after content-length calculation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-04-02

This seems to be related to a known SOAP::Lite bug filed here: http://rt.cpan.org/Public/Bug/Display.html?id=32952

Bugzilla uses a workaround at package Bugzilla::WebService::XMLRPC::Serializer

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-04-02

Now a ticket in https://rt.cpan.org/Ticket/Display.html?id=44755

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-05-07

Could finally fix it. Cause was a Bugzilla Webservice bug fix trying to circumvent a SOAP::Lite bug which seems to be fixed in the meantime. See mentioned Bugzilla bug report for details.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.