|
From: Chris N. <pu...@po...> - 2001-02-02 18:46:11
|
OK, first, the problem: taking data from a Slash database and encoding it
so it will be 1. legal XML, and 2. easily decoded back into something
resembling the original data. This is compounded by the problem that som=
e
people may put illegal data into the database to begin with (e.g., a lone
"&" should not ever be in a title or description).
Then, we want to decode the data reasonably.
All of the above also assumes that we are encoding from and decoding to
HTML. If a user of our RSS file wants to then run something like
HTML::Entities::decode_entities() on the result, they can get a non-HTML
version of it.
The short of it is the program below. It will take some data, encode it
for inclusion in an RSS file, then decode it to see what it would be on
output. For example:
Original:
<em>I've "a" <a href=3D"bio.html">"Bio"</a> && a
<Résum=E9!></em>
Encoded:
<em>I've &quot;a&quot; <a
href=3D"bio.html">"Bio"</a> &amp;&amp; a
&#x3c;R&eacute;sumé!&#x3E;</em>
Decoded:
<em>I've "a" <a href=3D"bio.html">"Bio"</a> && a
<Résumé!></em>
Note that in the original, we have a character (e with an acute accent)
that we want to have encoded. We want to preserve the < and >, but we
don't want the < to become <, or the > to become >.
Anyway, if you can, please follow the code and let me know any problems y=
ou
have with our methods here. I realize I might not be very clear; it's be=
en
a long day. Let me know if I can clarify anything for you.
Thanks,
--Chris
#!/usr/bin/perl -wl
use strict;
use XML::RSS; # includes XML::Parser::Expat
my $text =3D <<EOT;
<em>I've "a" <a href=3D"bio.html">"Bio"</a> && a
<Résum=E9!></em>
EOT
sub encode_text {
my($text) =3D @_;
# if there is an & that is not part of an entity, convert it
# to &
$text =3D~ s/&(?!#?[a-zA-Z0-9]+;)/&/g;
# convert & < > to XML entities
$text =3D XML::Parser::Expat->xml_escape($text, ">");
# convert ASCII-non-printable to numeric entities
$text =3D~ s/([^\s\040-\176])/ "&#" . ord($1) . ";" /ge;
return $text;
}
{
# for all following chars but &, convert entities back to
# the actual character
# for &, convert & back to &, but only if it is the
# beginning of an entity (like "&#32;")
# precompile these so we only do it once
my %e =3D qw(< lt > gt " quot ' apos & amp);
for my $chr (keys %e) {
my $word =3D $e{$chr};
my $ord =3D ord $chr;
my $hex =3D sprintf "%x", $ord;
$hex =3D~ s/([a-f])/[$1\U$1]/g;
my $regex =3D qq/&(?:$word|#$ord|#[xX]$hex);/;
$regex .=3D qq/(?=3D#?[a-zA-Z0-9]+;)/ if $chr eq "&";
$e{$chr} =3D qr/$regex/;
}
sub decode_text {
my($text) =3D @_;
# do & only _after_ the others
for my $chr ( (grep !/^&$/, keys %e), "&") {
$text =3D~ s/$e{$chr}/$chr/g;
}
return $text;
}
}
print $text;
print $text =3D encode_text($text);
print $text =3D decode_text($text);
__END__
--=20
Chris Nandor pu...@po... http://pudge.net/
Open Source Development Network pu...@os... http://osdn.com/
|