|
From: Stefan S. <ze...@ze...> - 2001-03-30 16:25:20
|
On Fri, Mar 30, 2001 at 10:15:12AM -0500, Chris Nandor wrote:
>=20
> It is just one extra complication we didn't get around to doing. On th=
e
> one hand it seems easy, but on the other ... what if the encoding for y=
our
> site ALLOWS these characters? Then maybe you wouldn't want them encode=
d.
> It could be a preference, of course, but I hope you can see why it isn'=
t as
> easy as just doing a conversion on submission.
%-) Thanks for the hint ...
>=20
>=20
> >But all of this doesn't have to do anything with the problem I have:
> >encode_entities() can't convert UTF-8 encoded strings :-(
>=20
> Where do you have a UTF-8 string?
>=20
After doing the
$p->parse($d, ProtocolEncoding =3D> 'ISO-8859-1') or
portaldLog("$bid did not parse properly");
in portald $str seems to contain characters (in my case the german
umlauts) HTML::Entities::encode can't convert anymore (and which don't
display right in the browser). I wanted to use it in "sub char_handler".
From the manpage of XML::Parser for the Char-Handler:
This event is generated when non-markup is recognized. The
non-markup sequence of characters is in String. A single
non-markup sequence of characters may generate multiple
calls to this handler. Whatever the encoding of the string
in the original document, this is given to the handler in
UTF-8.
So for example instead an "=FC" there shows up an "=C3=BC" (I hope you ge=
t the
characters transmitted right ...).
>=20
> >And I still think that XML::Parser returns all strings in UTF-8 after
> >doing a parse() regardless the encoding of the strings before.
>=20
> I am pretty sure it just returns raw data. If you put in a u with an
> umlaut (Latin-1 character 252), then that is what you should get in ret=
urn.
>=20
> #!/usr/bin/perl -wl
> use XML::RSS;
> $x =3D XML::RSS->new(encoding=3D>"ISO-8859-1");
> $y =3D XML::RSS->new(encoding=3D>"ISO-8859-1");
> $x->channel(title =3D> chr(252));
> $y->parse($x->as_string);
> print ord $y->{channel}{title};
>=20
> Returns:
>=20
> 252
>=20
> That's just Latin-1. There's no UTF-8 string there.
>=20
Hmmm ... who else is killing my umlauts then?=20
Greets
Steve
|