Non-Latin 1 characters shown as % encodings
Status: Beta
Brought to you by:
morbus
The CVS version now shows non-Latin 1 characters as %
nn encoded bytes, instead of UTF-8 characters as
before.
Sample feeds:
http://www.ibm.com/news/at/de/index.rss
"Österreich" is rendered as "%C3%96sterreich"
http://www.djeaux.com/rss/insecure-bugtraq.rss
"bugtraq @ insecure.org" is rendered as "bugtraq %2540
insecure.org"
Logged In: YES
user_id=69804
This is a side-effect of the "non-latin1 characters cause the
mySubscriptions.opml to be destroyed" bugfix, still opened as a
bug in the tracker (with the understanding that the bugfix is solely
a "less of two evils" - subscription corruption, or corrupted display,
though, see below). Roughly, in save_my_channels
(MyChannels.pm), the decimal encoding happens (provided
cheaply by URI::Escape, a default AmphetaDesk module), so that
when they're saved to disk, they're readable on the next restart
without crashing.
The side-effect occurs because this translation happens in
memory, which will remain in memory until the user refreshes the
index.html (when the internal subscription data is refreshed with
the "real" data from the channel itself). Case in point: add one of
those feeds, see how the "My Channels" is incorrect. Refresh the
"Channels Home" page, things are fine (everywhere save for
myChannels.opml).
I'm still looking around for an elegant solution. As you've
discovered, the only decent magick is in Perl 5.8 and Encode
(which I can depend on in Mac OS X Panther but currently no
where else). There are various other modules that purport to do
encoding translation (some Unicode:: modules, etc., etc.), but
further investigation shows nothing entirely worthwhile using. After
about six hours of searching on Friday, I cheated futilely with the
decimal encoding.
Likewise, the "use utf8" code you posted in the "non-latin1
characters in gui" would only work on 5.8, as the utf8 pragma
exists only there.
Is there a regexp that you can provide that would: strip everything
non-latin 1 and non-utf8? Even though we'd be losing proper data,
the more important goal is a) not to lose data in the OPML due to
invalid XML, and b) to provide some semblance of readability in
the GUI. Since this problem won't be properly solved until I can
depend on Perl 5.8 everywhere (which I have control over ONLY on
Windows 98, but no where else), an inelegant hack is probably
something we're going to depend on.
I'll already need to do a "require Encode; import Encode;" once I
start fiddling with MD5's (otherwise, Digest::MD5 chokes under
Perl 5.8/Panther), but that's an inelegant "what version of Perl are
we currently using" check for each run. Sigh.
Logged In: YES
user_id=365576
use utf8; is available in Perl 5.6, only the conversion functions
are not implemented, should be usable to strip all non-ASCII
characters like this:
use utf8;
s/[^\x01-\x7f]//g;
Alternatively, you could encode non-ASCII characters as UTF-
8 entities, like this:
use utf8;
s/([^\x20-\x7f])/sprintf("&#%d;", ord($1)/eg;
*should* work
Logged In: YES
user_id=69804
I'd be much more inclined to attempt to translate them to utf-8
entities: that seems more "right" then just stripping without
making a positive attempt. I'll take a look into it (and yup,
certifiably my mistake on utf8/5.6. blame it on poor usage of
perldoc.com).
Logged In: YES
user_id=69804
Heh, heh. That regular expression (not the "use") fails with
"Unknown error" on ActivePerl 5.6.1. Still investigating. My
test code (for save_my_channels) is:
my $temp; # used for temporary storage.
foreach my $channel ( values %CHANNELS ) {
push @{ $temp->{body}->{outline} }, $channel;
}
my $data = XMLout( $temp, noattr => 0, rootname =>
'opml', xmldecl => 1 );
use utf8; $data =~ s/([^\x20-\x7f])/sprintf("&#%d;",
ord($1)/eg;
my $my_channels = $passed_path ||
get_setting("files_myChannels");
if (open(OPML, ">$my_channels")) { print OPML
$my_channels; close(OPML); }
else { note("There was an error saving your
subscriptions: $!"); return 0; }
Logged In: YES
user_id=69804
Alright, the stripping entirely (no sprintf/ord) works, but
suffers the same sort of visual bug as the decimal encoding:
if the user opens the "My Channels" page before they open
the channels home, they'll see the stripped (or, previously,
decimal) entities. Any idea on the sprintf/ord issue?
Logged In: YES
user_id=69804
Likewise, the OPML file is now bereft of newlines.
Logged In: YES
user_id=69804
Oddly, the reason the sprintf/ord failed was due to a
missing end-parens in the regexp - why I received "unknown
error" and not the syntax warning, I'm not sure. Using the
following script:
#!/usr/bin/perl
use strict;
use utf8;
use LWP::Simple;
my $data = get("http://www.buergerportal.de/xml/RSS");
$data =~ s/([^\x20-\x7f])/sprintf("&#%d;", ord($1))/eg;
I'll receive zillions of errors concerning "Malformed UTF-8
characters", regardless of strict, warnings, etc, and the
resultant output of $data is something I didn't expect (run
it yourself to see - I fear pasting will cause SF's tracker
to interpret the paste incorrectly).
Logged In: YES
user_id=69804
Check the CVS on myChannels.pm ("use utf8;" is now part of
AmphetaDesk.pl, a move that we'll have to use eventually -
might as well start now). Currently, we use your regexp, but
without stripping. Adding in the corrected sprint/ord works
correctly (without a bunch of multibyte-errors), but the
replaced characters always seem to be #0.