Special Characters in RSS Feed
Brought to you by:
adamkennedy
Special Characters in RSS Feed
Hi,
we use german in the cvs commit messages and thereby
some german special characters (Umlaute).
These special characters break the xml file so it is
not parsable. These characters must be stored as xml
entities, which is not the case at the moment.
Logged In: YES
user_id=153576
Andrew, the function you are probably looking for is
HTML::Entities::encode_entities( $string )
Logged In: YES
user_id=646767
I don't think HTML::Entities will help. XML::Generator
already takes care of escaping characters that aren't
allowed in XML. The problem is that the RSS XML is encoded
as utf8 (actually, it doesn't specify an encoding, so
parsers must treat it as utf8). I guess the umlaute in the
CVS log message is encoded as iso-8859-1, and not a valid
utf8 character, so it breaks the XML. As far as I know, CVS
has no concept of character encoding, so, we have a number
of options:
1: Force all feeds to be iso-8859-1, fixing this case,
breaking others
2: Use the default encoding of the machine as the encoding
for the XML
3: Add a configuration option, specifying the encoding of
log messages in the CVS repository, and use that
Option 2 sounds like the best, but would be a problem if
cvsmontior is hosted on a machine out of the control over
the users. Maybe adding a per-repository configuration
option for the encoding, and defaulting it to the default
encoding of the machine, is the best solution.
Logged In: YES
user_id=153576
Actually, I though german unlautes etc were in iso-8859-1,
or am I thinking of something else.
For the moment, we need to go with something like option 1.
Until CVS Monitor itself understands non-iso-8859-1
charsets, we should just force the RSS to do the same.
Better to be consistently broken than sometimes this,
sometimes that.
At the moment, to handle a previous Chinese bug, there is a
single hard-coded per-installation charset variable located at
CVSMonitor.pm:16
I recommend using that for now. Write the RSS using whatever
charset is specified there...
At the moment, I'm assuming that HEAD is going towards 0.7,
but I forked off a stable branch. Set the RSS to use
$CVSMonitor::CHARSET on that stable branch.
For 0.7, I'm definately thinking a per-repository charset value.
But that's another story :)
Logged In: YES
user_id=646767
I've put a patch that adds an explict encoding in the tracker.
Umlautes are in iso-8859-1, but unfortunately, aren't in the
subset that maps 1-1 onto utf8.