Menu

#63 Special Characters in RSS Feed

0.6.4
open
RSS (3)
5
2004-03-13
2004-03-09
Anonymous
No

Special Characters in RSS Feed

Hi,

we use german in the cvs commit messages and thereby
some german special characters (Umlaute).
These special characters break the xml file so it is
not parsable. These characters must be stored as xml
entities, which is not the case at the moment.

Discussion

  • Adam Kennedy

    Adam Kennedy - 2004-03-13

    Logged In: YES
    user_id=153576

    Andrew, the function you are probably looking for is
    HTML::Entities::encode_entities( $string )

     
  • Adam Kennedy

    Adam Kennedy - 2004-03-13
    • milestone: --> 0.6.4
     
  • Andrew Eland

    Andrew Eland - 2004-03-15

    Logged In: YES
    user_id=646767

    I don't think HTML::Entities will help. XML::Generator
    already takes care of escaping characters that aren't
    allowed in XML. The problem is that the RSS XML is encoded
    as utf8 (actually, it doesn't specify an encoding, so
    parsers must treat it as utf8). I guess the umlaute in the
    CVS log message is encoded as iso-8859-1, and not a valid
    utf8 character, so it breaks the XML. As far as I know, CVS
    has no concept of character encoding, so, we have a number
    of options:
    1: Force all feeds to be iso-8859-1, fixing this case,
    breaking others
    2: Use the default encoding of the machine as the encoding
    for the XML
    3: Add a configuration option, specifying the encoding of
    log messages in the CVS repository, and use that

    Option 2 sounds like the best, but would be a problem if
    cvsmontior is hosted on a machine out of the control over
    the users. Maybe adding a per-repository configuration
    option for the encoding, and defaulting it to the default
    encoding of the machine, is the best solution.

     
  • Adam Kennedy

    Adam Kennedy - 2004-03-16

    Logged In: YES
    user_id=153576

    Actually, I though german unlautes etc were in iso-8859-1,
    or am I thinking of something else.

    For the moment, we need to go with something like option 1.

    Until CVS Monitor itself understands non-iso-8859-1
    charsets, we should just force the RSS to do the same.
    Better to be consistently broken than sometimes this,
    sometimes that.

    At the moment, to handle a previous Chinese bug, there is a
    single hard-coded per-installation charset variable located at

    CVSMonitor.pm:16

    I recommend using that for now. Write the RSS using whatever
    charset is specified there...

    At the moment, I'm assuming that HEAD is going towards 0.7,
    but I forked off a stable branch. Set the RSS to use
    $CVSMonitor::CHARSET on that stable branch.

    For 0.7, I'm definately thinking a per-repository charset value.

    But that's another story :)

     
  • Andrew Eland

    Andrew Eland - 2004-03-16

    Logged In: YES
    user_id=646767

    I've put a patch that adds an explict encoding in the tracker.

    Umlautes are in iso-8859-1, but unfortunately, aren't in the
    subset that maps 1-1 onto utf8.

     

Log in to post a comment.