#31 Non-Latin 1 characters shown as % encodings

open
Morbus Iff
None
5
2003-12-21
2003-12-21
No

The CVS version now shows non-Latin 1 characters as %
nn encoded bytes, instead of UTF-8 characters as
before.

Sample feeds:
http://www.ibm.com/news/at/de/index.rss
"Österreich" is rendered as "%C3%96sterreich"

http://www.djeaux.com/rss/insecure-bugtraq.rss
"bugtraq @ insecure.org" is rendered as "bugtraq %2540
insecure.org"

Discussion

  • Morbus Iff
    Morbus Iff
    2003-12-21

    • assigned_to: nobody --> morbus
     
  • Morbus Iff
    Morbus Iff
    2003-12-21

    Logged In: YES
    user_id=69804

    This is a side-effect of the "non-latin1 characters cause the
    mySubscriptions.opml to be destroyed" bugfix, still opened as a
    bug in the tracker (with the understanding that the bugfix is solely
    a "less of two evils" - subscription corruption, or corrupted display,
    though, see below). Roughly, in save_my_channels
    (MyChannels.pm), the decimal encoding happens (provided
    cheaply by URI::Escape, a default AmphetaDesk module), so that
    when they're saved to disk, they're readable on the next restart
    without crashing.

    The side-effect occurs because this translation happens in
    memory, which will remain in memory until the user refreshes the
    index.html (when the internal subscription data is refreshed with
    the "real" data from the channel itself). Case in point: add one of
    those feeds, see how the "My Channels" is incorrect. Refresh the
    "Channels Home" page, things are fine (everywhere save for
    myChannels.opml).

    I'm still looking around for an elegant solution. As you've
    discovered, the only decent magick is in Perl 5.8 and Encode
    (which I can depend on in Mac OS X Panther but currently no
    where else). There are various other modules that purport to do
    encoding translation (some Unicode:: modules, etc., etc.), but
    further investigation shows nothing entirely worthwhile using. After
    about six hours of searching on Friday, I cheated futilely with the
    decimal encoding.

    Likewise, the "use utf8" code you posted in the "non-latin1
    characters in gui" would only work on 5.8, as the utf8 pragma
    exists only there.

    Is there a regexp that you can provide that would: strip everything
    non-latin 1 and non-utf8? Even though we'd be losing proper data,
    the more important goal is a) not to lose data in the OPML due to
    invalid XML, and b) to provide some semblance of readability in
    the GUI. Since this problem won't be properly solved until I can
    depend on Perl 5.8 everywhere (which I have control over ONLY on
    Windows 98, but no where else), an inelegant hack is probably
    something we're going to depend on.

    I'll already need to do a "require Encode; import Encode;" once I
    start fiddling with MD5's (otherwise, Digest::MD5 chokes under
    Perl 5.8/Panther), but that's an inelegant "what version of Perl are
    we currently using" check for each run. Sigh.

     
  • Logged In: YES
    user_id=365576

    use utf8; is available in Perl 5.6, only the conversion functions
    are not implemented, should be usable to strip all non-ASCII
    characters like this:

    use utf8;
    s/[^\x01-\x7f]//g;

    Alternatively, you could encode non-ASCII characters as UTF-
    8 entities, like this:

    use utf8;
    s/([^\x20-\x7f])/sprintf("&#%d;", ord($1)/eg;

    *should* work

     
  • Morbus Iff
    Morbus Iff
    2003-12-22

    Logged In: YES
    user_id=69804

    I'd be much more inclined to attempt to translate them to utf-8
    entities: that seems more "right" then just stripping without
    making a positive attempt. I'll take a look into it (and yup,
    certifiably my mistake on utf8/5.6. blame it on poor usage of
    perldoc.com).

     
  • Morbus Iff
    Morbus Iff
    2003-12-22

    Logged In: YES
    user_id=69804

    Heh, heh. That regular expression (not the "use") fails with
    "Unknown error" on ActivePerl 5.6.1. Still investigating. My
    test code (for save_my_channels) is:

    my $temp; # used for temporary storage.
    foreach my $channel ( values %CHANNELS ) {
    push @{ $temp->{body}->{outline} }, $channel;
    }

    my $data = XMLout( $temp, noattr => 0, rootname =>
    'opml', xmldecl => 1 );
    use utf8; $data =~ s/([^\x20-\x7f])/sprintf("&#%d;",
    ord($1)/eg;
    my $my_channels = $passed_path ||
    get_setting("files_myChannels");
    if (open(OPML, ">$my_channels")) { print OPML
    $my_channels; close(OPML); }
    else { note("There was an error saving your
    subscriptions: $!"); return 0; }

     
  • Morbus Iff
    Morbus Iff
    2003-12-22

    Logged In: YES
    user_id=69804

    Alright, the stripping entirely (no sprintf/ord) works, but
    suffers the same sort of visual bug as the decimal encoding:
    if the user opens the "My Channels" page before they open
    the channels home, they'll see the stripped (or, previously,
    decimal) entities. Any idea on the sprintf/ord issue?

     
  • Morbus Iff
    Morbus Iff
    2003-12-22

    Logged In: YES
    user_id=69804

    Likewise, the OPML file is now bereft of newlines.

     
  • Morbus Iff
    Morbus Iff
    2003-12-22

    Logged In: YES
    user_id=69804

    Oddly, the reason the sprintf/ord failed was due to a
    missing end-parens in the regexp - why I received "unknown
    error" and not the syntax warning, I'm not sure. Using the
    following script:

    #!/usr/bin/perl
    use strict;

    use utf8;
    use LWP::Simple;
    my $data = get("http://www.buergerportal.de/xml/RSS");
    $data =~ s/([^\x20-\x7f])/sprintf("&#%d;", ord($1))/eg;

    I'll receive zillions of errors concerning "Malformed UTF-8
    characters", regardless of strict, warnings, etc, and the
    resultant output of $data is something I didn't expect (run
    it yourself to see - I fear pasting will cause SF's tracker
    to interpret the paste incorrectly).

     
  • Morbus Iff
    Morbus Iff
    2003-12-22

    Logged In: YES
    user_id=69804

    Check the CVS on myChannels.pm ("use utf8;" is now part of
    AmphetaDesk.pl, a move that we'll have to use eventually -
    might as well start now). Currently, we use your regexp, but
    without stripping. Adding in the corrected sprint/ord works
    correctly (without a bunch of multibyte-errors), but the
    replaced characters always seem to be #0.