|
From: Stefan S. <ze...@ze...> - 2001-03-26 21:52:49
|
Hi an Hello,
portald doesn't seem to work with RDF containing german Umlauts (=FC,=F6,=
=E4
etc.)
XML::Parser just stops with en error, if it encounters such a character.
(Well in fact, he just does this, if the encoding isn't set right in the
RDF-file). So I added the following:
$p->parse($d, ProtocolEncoding =3D> 'ISO-8859-1') or
portaldLog("$bid did not parse properly");
This seems to fix the problem for the first step, but now, if I got it
right, the resulting data is encoded in UTF-8 and therefore doesn't
display right in the browser. I only get very ugly symbols but not the
umlauts (in fact, these should be html-encoded, but this is another
problem, I can easily deal with).
Any idea how to fix this?
Greets,
Steve
=09
|
|
From: Chris N. <pu...@po...> - 2001-03-26 22:08:02
|
At 23:52 +0200 2001.03.26, Stefan Strigler wrote:
>portald doesn't seem to work with RDF containing german Umlauts (=FC,=F6=
,=E4
>etc.)
>
>XML::Parser just stops with en error, if it encounters such a character.
>(Well in fact, he just does this, if the encoding isn't set right in the
>RDF-file). So I added the following:
>
>$p->parse($d, ProtocolEncoding =3D> 'ISO-8859-1') or
> portaldLog("$bid did not parse properly");
The fact that $p->parse($d) fails means the document is broken. If Slash
created that document, then the way to fix it is to include "encoding =3D=
>
'ISO-8859-1'" in the newrdf function in slashd. In Slash 2.0, this is
handled by "rdfencoding" in the vars table.
>This seems to fix the problem for the first step, but now, if I got it
>right, the resulting data is encoded in UTF-8 and therefore doesn't
Nope. UTF-8 is the default. ISO-8859-1 is not UTF-8, but Latin-1.
>display right in the browser. I only get very ugly symbols but not the
>umlauts (in fact, these should be html-encoded, but this is another
>problem, I can easily deal with).
Nope, the encoding of the RDF really has nothing to do with how it is
displayed. "=FC" will look bad on most browsers, because chances are you=
r
HTML document is set to only display ASCII characters, and even if it is
not, most clients won't display it properly. The answer is to do as you
suggest: convert the data to the proper HTML entities.
When you put data into Slash, in a story etc., you should use the HTML
entities. The only problem with this is in the daily mailer, which (I
think) would just put the HTML entities in verbatim. It's on my list of
things to look at.
--=20
Chris Nandor pu...@po... http://pudge.net/
Open Source Development Network pu...@os... http://osdn.com/
|
|
From: Alessio B. <al...@al...> - 2001-03-30 12:19:15
|
Chris Nandor wrote: > When you put data into Slash, in a story etc., you should use the HTML > entities. The only problem with this is in the daily mailer, which (I > think) would just put the HTML entities in verbatim. It's on my list of > things to look at. One of the point of Slash is to have less-computer-literated people work with the system with no troubles. Asking them to change characters they have on their keyboard with HTML entities is not very popular and probably not useful. Is it possible/reasonable to have a conversion from special chars to entities built in the admin interface? P.S. Thanks to Chris for looking into these hairy problems which are really important for us old Europeans. :-) -- Alessio F. Bragadini al...@al... APL Financial Services http://village.albourne.com Nicosia, Cyprus phone: +357-2-755750 "It is more complicated than you think" -- The Eighth Networking Truth from RFC 1925 |
|
From: Stefan S. <ze...@ze...> - 2001-03-30 12:42:23
|
On Fri, Mar 30, 2001 at 03:19:04PM +0300, Alessio Bragadini wrote: > > One of the point of Slash is to have less-computer-literated people work > with the system with no troubles. Asking them to change characters they > have on their keyboard with HTML entities is not very popular and > probably not useful. Is it possible/reasonable to have a conversion from > special chars to entities built in the admin interface? > Using HTML::Entities all this is no problem at all (despite the work to use it consistently ... ;-) ). Just do a decode_entities() followed by a encode_entities() for every text-input. Sure it would be better to *not* store HTML-entities in the DB. With encode_entities() you could generate them everytime some text from the db ist to be sent over HTTP. But all of this doesn't have to do anything with the problem I have: encode_entities() can't convert UTF-8 encoded strings :-( And I still think that XML::Parser returns all strings in UTF-8 after doing a parse() regardless the encoding of the strings before. If you still don't believe have a look at http://wurbel.spline.de and select the "Heise Newsticker"-box from "Benutzeraccount" on the left. (And for sure p.e. mozilla can display umlauts without the need for them being encoded) Greets Steve |
|
From: Chris N. <pu...@po...> - 2001-03-30 15:15:57
|
At 14:42 +0200 2001.03.30, Stefan Strigler wrote:
>On Fri, Mar 30, 2001 at 03:19:04PM +0300, Alessio Bragadini wrote:
>>
>> One of the point of Slash is to have less-computer-literated people work
>> with the system with no troubles. Asking them to change characters they
>> have on their keyboard with HTML entities is not very popular and
>> probably not useful. Is it possible/reasonable to have a conversion from
>> special chars to entities built in the admin interface?
>
>Using HTML::Entities all this is no problem at all (despite the work to
>use it consistently ... ;-) ).
>Just do a decode_entities() followed by a encode_entities() for every
>text-input.
>Sure it would be better to *not* store HTML-entities in the DB. With
>encode_entities() you could generate them everytime some text from the
>db ist to be sent over HTTP.
It is just one extra complication we didn't get around to doing. On the
one hand it seems easy, but on the other ... what if the encoding for your
site ALLOWS these characters? Then maybe you wouldn't want them encoded.
It could be a preference, of course, but I hope you can see why it isn't as
easy as just doing a conversion on submission.
>But all of this doesn't have to do anything with the problem I have:
>encode_entities() can't convert UTF-8 encoded strings :-(
Where do you have a UTF-8 string?
>And I still think that XML::Parser returns all strings in UTF-8 after
>doing a parse() regardless the encoding of the strings before.
I am pretty sure it just returns raw data. If you put in a u with an
umlaut (Latin-1 character 252), then that is what you should get in return.
#!/usr/bin/perl -wl
use XML::RSS;
$x = XML::RSS->new(encoding=>"ISO-8859-1");
$y = XML::RSS->new(encoding=>"ISO-8859-1");
$x->channel(title => chr(252));
$y->parse($x->as_string);
print ord $y->{channel}{title};
Returns:
252
That's just Latin-1. There's no UTF-8 string there.
>If you still don't believe have a look at http://wurbel.spline.de and
>select the "Heise Newsticker"-box from "Benutzeraccount" on the left.
I can't find anything like that on the site; it might not help that I don't
speak German.
--
Chris Nandor pu...@po... http://pudge.net/
Open Source Development Network pu...@os... http://osdn.com/
|
|
From: Stefan S. <ze...@ze...> - 2001-03-30 16:25:20
|
On Fri, Mar 30, 2001 at 10:15:12AM -0500, Chris Nandor wrote:
>=20
> It is just one extra complication we didn't get around to doing. On th=
e
> one hand it seems easy, but on the other ... what if the encoding for y=
our
> site ALLOWS these characters? Then maybe you wouldn't want them encode=
d.
> It could be a preference, of course, but I hope you can see why it isn'=
t as
> easy as just doing a conversion on submission.
%-) Thanks for the hint ...
>=20
>=20
> >But all of this doesn't have to do anything with the problem I have:
> >encode_entities() can't convert UTF-8 encoded strings :-(
>=20
> Where do you have a UTF-8 string?
>=20
After doing the
$p->parse($d, ProtocolEncoding =3D> 'ISO-8859-1') or
portaldLog("$bid did not parse properly");
in portald $str seems to contain characters (in my case the german
umlauts) HTML::Entities::encode can't convert anymore (and which don't
display right in the browser). I wanted to use it in "sub char_handler".
From the manpage of XML::Parser for the Char-Handler:
This event is generated when non-markup is recognized. The
non-markup sequence of characters is in String. A single
non-markup sequence of characters may generate multiple
calls to this handler. Whatever the encoding of the string
in the original document, this is given to the handler in
UTF-8.
So for example instead an "=FC" there shows up an "=C3=BC" (I hope you ge=
t the
characters transmitted right ...).
>=20
> >And I still think that XML::Parser returns all strings in UTF-8 after
> >doing a parse() regardless the encoding of the strings before.
>=20
> I am pretty sure it just returns raw data. If you put in a u with an
> umlaut (Latin-1 character 252), then that is what you should get in ret=
urn.
>=20
> #!/usr/bin/perl -wl
> use XML::RSS;
> $x =3D XML::RSS->new(encoding=3D>"ISO-8859-1");
> $y =3D XML::RSS->new(encoding=3D>"ISO-8859-1");
> $x->channel(title =3D> chr(252));
> $y->parse($x->as_string);
> print ord $y->{channel}{title};
>=20
> Returns:
>=20
> 252
>=20
> That's just Latin-1. There's no UTF-8 string there.
>=20
Hmmm ... who else is killing my umlauts then?=20
Greets
Steve
|
|
From: Chris N. <pu...@po...> - 2001-03-30 16:35:16
|
At 18:09 +0200 2001.03.30, Stefan Strigler wrote: >So for example instead an "=FC" there shows up an "=C3=BC" (I hope you g= et the >characters transmitted right ...). Hm. I am not sure what the problem is, I don't do a lot of work with XML::Parser directly. However, you might want to look into Unicode::Map8= : http://search.cpan.org/search?dist=3DUnicode-Map8 I guess this is something we will need to revisit. Maybe in the new slashcode-i18n mailing list, which I plan to announce this afternoon? --=20 Chris Nandor pu...@po... http://pudge.net/ Open Source Development Network pu...@os... http://osdn.com/ |
|
From: Stefan S. <st...@ze...> - 2001-03-30 16:44:34
|
On Fri, Mar 30, 2001 at 11:34:50AM -0500, Chris Nandor wrote: > > I guess this is something we will need to revisit. Maybe in the new > slashcode-i18n mailing list, which I plan to announce this afternoon? > Oh cool! :-) Greets Steve |
|
From: Stefan S. <st...@ze...> - 2001-03-30 18:09:02
|
On Fri, Mar 30, 2001 at 11:34:50AM -0500, Chris Nandor wrote: > At 18:09 +0200 2001.03.30, Stefan Strigler wrote: > >So for example instead an "=FC" there shows up an "=C3=BC" (I hope you= get the > >characters transmitted right ...). >=20 > Hm. I am not sure what the problem is, I don't do a lot of work with > XML::Parser directly. However, you might want to look into Unicode::Ma= p8: >=20 > http://search.cpan.org/search?dist=3DUnicode-Map8 >=20 Thanks for the hint, i think i got a solution: %<-------<schnipp>------- #!/usr/bin/perl -wl use Unicode::Map8; use Unicode::String qw(utf8 latin1); my $map =3D Unicode::Map8->new("latin1") || die; my $string =3D "Einsch=FCchterung geh=F6rt zur politischen Kultur"; print "original string -> $string\n"; print "broken string -> ", $map->tou($string), "\n"; print "repaired string -> ", $map->tou($string)->latin1, "\n"; %<-------<schnapp>------- So, after including libs from above in portald I changed char_handler to %<-------<schnipp>------- [ ... ] use Unicode::String qw(utf8 latin1); [ ... ] sub char_handler { my($p, $data) =3D @_; $data =3D~ s/\s/ /g; =20 $data =3D encode_entities(utf8($data)->latin1); if ($snatchtitle) { $title .=3D $data; } elsif ($snatchlink) { $link .=3D $data; } } %<-------<schnapp>------- And it seems to work now ... Greets Steve |