[Slashcode-development] encoding / decoding RSS

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

OK, first, the problem: taking data from a Slash database and encoding it
so it will be 1. legal XML, and 2. easily decoded back into something
resembling the original data.  This is compounded by the problem that som=
e
people may put illegal data into the database to begin with (e.g., a lone
"&" should not ever be in a title or description).

Then, we want to decode the data reasonably.

All of the above also assumes that we are encoding from and decoding to
HTML.  If a user of our RSS file wants to then run something like
HTML::Entities::decode_entities() on the result, they can get a non-HTML
version of it.

The short of it is the program below.  It will take some data, encode it
for inclusion in an RSS file, then decode it to see what it would be on
output.  For example:

Original:
<em>I've &quot;a&quot; <a href=3D"bio.html">"Bio"</a> && a
&#x3c;R&eacute;sum=E9!&#x3E;</em>

Encoded:
&lt;em&gt;I've &amp;quot;a&amp;quot; &lt;a
href=3D"bio.html"&gt;"Bio"&lt;/a&gt; &amp;amp;&amp;amp; a
&amp;#x3c;R&amp;eacute;sum&#233;!&amp;#x3E;&lt;/em&gt;

Decoded:
<em>I've &quot;a&quot; <a href=3D"bio.html">"Bio"</a> &amp;&amp; a
&#x3c;R&eacute;sum&#233;!&#x3E;</em>

Note that in the original, we have a character (e with an acute accent)
that we want to have encoded.  We want to preserve the < and >, but we
don't want the &#x3c; to become <, or the &#x3E; to become >.

Anyway, if you can, please follow the code and let me know any problems y=
ou
have with our methods here.  I realize I might not be very clear; it's be=
en
a long day.  Let me know if I can clarify anything for you.

Thanks,

--Chris

#!/usr/bin/perl -wl

use strict;
use XML::RSS;  # includes XML::Parser::Expat

my $text =3D <<EOT;
<em>I've &quot;a&quot; <a href=3D"bio.html">"Bio"</a> && a
&#x3c;R&eacute;sum=E9!&#x3E;</em>
EOT

sub encode_text {
	my($text) =3D @_;

	# if there is an & that is not part of an entity, convert it
	# to &amp;
	$text =3D~ s/&(?!#?[a-zA-Z0-9]+;)/&amp;/g;

	# convert & < > to XML entities
	$text =3D XML::Parser::Expat->xml_escape($text, ">");

	# convert ASCII-non-printable to numeric entities
	$text =3D~ s/([^\s\040-\176])/ "&#" . ord($1) . ";" /ge;

	return $text;
}

{
	# for all following chars but &, convert entities back to
	# the actual character

	# for &, convert &amp; back to &, but only if it is the
	# beginning of an entity (like "&amp;#32;")

	# precompile these so we only do it once

	my %e =3D qw(< lt > gt " quot ' apos & amp);
	for my $chr (keys %e) {
		my $word =3D $e{$chr};
		my $ord =3D ord $chr;
		my $hex =3D sprintf "%x", $ord;
		$hex =3D~ s/([a-f])/[$1\U$1]/g;
		my $regex =3D qq/&(?:$word|#$ord|#[xX]$hex);/;
		$regex .=3D qq/(?=3D#?[a-zA-Z0-9]+;)/ if $chr eq "&";
		$e{$chr} =3D qr/$regex/;
	}

	sub decode_text {
		my($text) =3D @_;

		# do & only _after_ the others
		for my $chr ( (grep !/^&$/, keys %e), "&") {
			$text =3D~ s/$e{$chr}/$chr/g;
		}

		return $text;
	}
}

print $text;
print $text =3D encode_text($text);
print $text =3D decode_text($text);

__END__

--=20
Chris Nandor                      pu...@po...    http://pudge.net/
Open Source Development Network    pu...@os...     http://osdn.com/