[amphetadesk-develop] [ amphetadesk-Bugs-864067 ] Non-Latin 1 characters shown as % encodings
Status: Beta
Brought to you by:
morbus
From: SourceForge.net <no...@so...> - 2003-12-21 23:33:28
|
Bugs item #864067, was opened at 2003-12-21 19:49 Message generated for change (Comment added) made by krusch You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=372519&aid=864067&group_id=21649 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Klaus Johannes Rusch (krusch) Assigned to: Morbus Iff (morbus) Summary: Non-Latin 1 characters shown as % encodings Initial Comment: The CVS version now shows non-Latin 1 characters as % nn encoded bytes, instead of UTF-8 characters as before. Sample feeds: http://www.ibm.com/news/at/de/index.rss "Österreich" is rendered as "%C3%96sterreich" http://www.djeaux.com/rss/insecure-bugtraq.rss "bugtraq @ insecure.org" is rendered as "bugtraq %2540 insecure.org" ---------------------------------------------------------------------- >Comment By: Klaus Johannes Rusch (krusch) Date: 2003-12-22 00:33 Message: Logged In: YES user_id=365576 use utf8; is available in Perl 5.6, only the conversion functions are not implemented, should be usable to strip all non-ASCII characters like this: use utf8; s/[^\x01-\x7f]//g; Alternatively, you could encode non-ASCII characters as UTF- 8 entities, like this: use utf8; s/([^\x20-\x7f])/sprintf("&#%d;", ord($1)/eg; *should* work ---------------------------------------------------------------------- Comment By: Morbus Iff (morbus) Date: 2003-12-21 23:14 Message: Logged In: YES user_id=69804 This is a side-effect of the "non-latin1 characters cause the mySubscriptions.opml to be destroyed" bugfix, still opened as a bug in the tracker (with the understanding that the bugfix is solely a "less of two evils" - subscription corruption, or corrupted display, though, see below). Roughly, in save_my_channels (MyChannels.pm), the decimal encoding happens (provided cheaply by URI::Escape, a default AmphetaDesk module), so that when they're saved to disk, they're readable on the next restart without crashing. The side-effect occurs because this translation happens in memory, which will remain in memory until the user refreshes the index.html (when the internal subscription data is refreshed with the "real" data from the channel itself). Case in point: add one of those feeds, see how the "My Channels" is incorrect. Refresh the "Channels Home" page, things are fine (everywhere save for myChannels.opml). I'm still looking around for an elegant solution. As you've discovered, the only decent magick is in Perl 5.8 and Encode (which I can depend on in Mac OS X Panther but currently no where else). There are various other modules that purport to do encoding translation (some Unicode:: modules, etc., etc.), but further investigation shows nothing entirely worthwhile using. After about six hours of searching on Friday, I cheated futilely with the decimal encoding. Likewise, the "use utf8" code you posted in the "non-latin1 characters in gui" would only work on 5.8, as the utf8 pragma exists only there. Is there a regexp that you can provide that would: strip everything non-latin 1 and non-utf8? Even though we'd be losing proper data, the more important goal is a) not to lose data in the OPML due to invalid XML, and b) to provide some semblance of readability in the GUI. Since this problem won't be properly solved until I can depend on Perl 5.8 everywhere (which I have control over ONLY on Windows 98, but no where else), an inelegant hack is probably something we're going to depend on. I'll already need to do a "require Encode; import Encode;" once I start fiddling with MD5's (otherwise, Digest::MD5 chokes under Perl 5.8/Panther), but that's an inelegant "what version of Perl are we currently using" check for each run. Sigh. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=372519&aid=864067&group_id=21649 |