From: Chris N. <pu...@po...> - 2001-02-02 18:46:11
|
OK, first, the problem: taking data from a Slash database and encoding it so it will be 1. legal XML, and 2. easily decoded back into something resembling the original data. This is compounded by the problem that som= e people may put illegal data into the database to begin with (e.g., a lone "&" should not ever be in a title or description). Then, we want to decode the data reasonably. All of the above also assumes that we are encoding from and decoding to HTML. If a user of our RSS file wants to then run something like HTML::Entities::decode_entities() on the result, they can get a non-HTML version of it. The short of it is the program below. It will take some data, encode it for inclusion in an RSS file, then decode it to see what it would be on output. For example: Original: <em>I've "a" <a href=3D"bio.html">"Bio"</a> && a <Résum=E9!></em> Encoded: <em>I've &quot;a&quot; <a href=3D"bio.html">"Bio"</a> &amp;&amp; a &#x3c;R&eacute;sumé!&#x3E;</em> Decoded: <em>I've "a" <a href=3D"bio.html">"Bio"</a> && a <Résumé!></em> Note that in the original, we have a character (e with an acute accent) that we want to have encoded. We want to preserve the < and >, but we don't want the < to become <, or the > to become >. Anyway, if you can, please follow the code and let me know any problems y= ou have with our methods here. I realize I might not be very clear; it's be= en a long day. Let me know if I can clarify anything for you. Thanks, --Chris #!/usr/bin/perl -wl use strict; use XML::RSS; # includes XML::Parser::Expat my $text =3D <<EOT; <em>I've "a" <a href=3D"bio.html">"Bio"</a> && a <Résum=E9!></em> EOT sub encode_text { my($text) =3D @_; # if there is an & that is not part of an entity, convert it # to & $text =3D~ s/&(?!#?[a-zA-Z0-9]+;)/&/g; # convert & < > to XML entities $text =3D XML::Parser::Expat->xml_escape($text, ">"); # convert ASCII-non-printable to numeric entities $text =3D~ s/([^\s\040-\176])/ "&#" . ord($1) . ";" /ge; return $text; } { # for all following chars but &, convert entities back to # the actual character # for &, convert & back to &, but only if it is the # beginning of an entity (like "&#32;") # precompile these so we only do it once my %e =3D qw(< lt > gt " quot ' apos & amp); for my $chr (keys %e) { my $word =3D $e{$chr}; my $ord =3D ord $chr; my $hex =3D sprintf "%x", $ord; $hex =3D~ s/([a-f])/[$1\U$1]/g; my $regex =3D qq/&(?:$word|#$ord|#[xX]$hex);/; $regex .=3D qq/(?=3D#?[a-zA-Z0-9]+;)/ if $chr eq "&"; $e{$chr} =3D qr/$regex/; } sub decode_text { my($text) =3D @_; # do & only _after_ the others for my $chr ( (grep !/^&$/, keys %e), "&") { $text =3D~ s/$e{$chr}/$chr/g; } return $text; } } print $text; print $text =3D encode_text($text); print $text =3D decode_text($text); __END__ --=20 Chris Nandor pu...@po... http://pudge.net/ Open Source Development Network pu...@os... http://osdn.com/ |