Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#2236 Ampersand in slashdot.rdf encoded incorrectly

closed
Rob Malda
None
5
2003-01-23
2003-01-23
Steve Pick
No

In the auto-generated slashdot.rdf file
(http://slashdot.org/slashdot.rdf) the ampersand (&)
character is encoded as "&" which means, when
converted to HTML, news stories that contain an & have
& shown instead. For example AT&T appears as
AT&T. Clearly the & should be encoded as & not
& or maybe I'm being an idiot who doesnt
understand XML.

Weevil
slashdot@baxpace.com

Discussion

  • Chris Nandor
    Chris Nandor
    2003-01-23

    • assigned_to: nobody --> cmdrtaco
     
  • Chris Nandor
    Chris Nandor
    2003-01-23

    Logged In: YES
    user_id=3660

    It is not incorrect; however, you are not misunderstanding
    XML, but the purpose of the data. The data is supplied as
    ready-to-use HTML. While we realize this is not ideal, the
    alternatives were not ideal either: we use many different
    HTML entities into the data (for example, "é"), and
    if we left it as-is, it would break the XML. So somehow,
    the entities need conversion.

    We could try to convert all entities to Latin-1 or UTF-8,
    but that would break for the majority of users who would not
    bother to re-encode it into HTML. We could use CDATA,
    which would break many RSS readers, and in the end, people
    would still get the same HTML entities anyway. Or, we could
    convert the "&" to "&", and just tell people the data is
    HTML, which is what we do, since that is how most people
    will use the data.

    So the bottom line is that we decided long ago that the
    least painful way was to leave everything encoded as HTML.
    It was a design decision, and perhaps one that not everyone
    likes, but we rarely get complaints, and if we did it any
    other way, we'd have gotten more complaints (or if not
    complaints, more broken output from the renderers).

    Something you should consider doing is what we do for our
    RSS reader: run a regex on the data that only HTML-escapes
    ampersands that are not already escaped. It should handle
    99%, if not 100%, of the issues of double-escaping, not just
    for Slash, but for other sites. Here's the one we use:

    $xml =~ s/&(?!#?[a-zA-Z0-9]+;)/&/g;

    It only will turn "&" into "&" if it is not already a
    part of an entity, so you will not convert "&" into
    "&".

     
  • Rob Malda
    Rob Malda
    2003-01-23

    • status: open --> closed
     
  • Steve Pick
    Steve Pick
    2003-01-23

    Logged In: YES
    user_id=690817

    I think you misunderstood me.

    Currently you're running a story on AT&T. In the
    slashdot.rdf file you will see this line:

    <title>AT&T Identifies Widespread Security Hole - In
    Locks</title>

    The above line was taken right out of the slashdot.rdf file,
    i've not changed or parsed it at all, that's the line in
    there, with that & included. That's the fault I'm
    pointing out.

    It should be as follows:
    <title>AT&T Identifies Widespread Security Hole - In
    Locks</title>

    But it isn't. This happens with all ampersands and it's been
    like this for ages. I checked it by downloading the file
    with wget and it's not a client side parse error. If I'm
    still missing a point just class me as an idiot and close
    this thread, but I'm pretty sure this isnt how it should be.

    Weevil

     
  • Steve Pick
    Steve Pick
    2003-01-23

    • status: closed --> open
     
  • Chris Nandor
    Chris Nandor
    2003-01-23

    Logged In: YES
    user_id=3660

    No, I understood you perfectly. It is correct to have
    "&".

    If we put only "&" in there, then what do we do with a
    headline like "Write Your Résumé"? If we
    leave it like that, it is broken XML. If we encode it into
    Latin-1 or UTF-8, then many readers would break for various
    reasons. We need to make a decision on how to encode it.
    Rather than encode it into a given character set, we chose
    the path of least pain, and encoded it as HTML, so what you
    get back is HTML.

    In our experience, this is the path of least pain for
    everyone. No method will make everyone happy.

    I know this is confusing. I am probably the only one here
    who still understands it, after we went around and around
    trying to figure out the best way to solve our problems for
    us and our users. This is the best compromise we have.

     
  • Chris Nandor
    Chris Nandor
    2003-01-23

    • status: open --> closed