#200 CDATA section in titles is not parsed correctly

closed-fixed
nobody
None
1
2005-03-12
2005-01-26
No

See the feed at http://www.gizmodo.com/atom.xml for
frequent examples. Right now it has an entry with this
title:

<title><![CDATA[M&frac14;-Card: Yet Another Memory Card
Format]]></title>

(Of course, the feed shouldn't be using CDATA *and*
encode the entity, but Liferea isn't doing anything
wrong in that regard.)

Discussion

  • Screenshot of the Gizmodo feed display

     
    Attachments
  • Lars Windolf
    Lars Windolf
    2005-01-27

    Logged In: YES
    user_id=834800

    Looks more like a parser bug (libxml2). If the XML parser
    passes the CDATA tag through the processing program must
    handle it like text.

     
  • Lars Windolf
    Lars Windolf
    2005-01-27

    Logged In: YES
    user_id=834800

    BTW cool screenshot!

     
  • Logged In: YES
    user_id=416550

    I think that this is based on how we are parsing the titles.
    A while ago, I decided that we should re-do how we parsed
    titles, but we never did since the current code works 99.9%
    of the time. The double-encoding may not be wrong. I'll have
    to look at the atom spec again to see what the default mode
    for reading the title is. If it is encoded HTML data, then
    it may be correct to do....

     
  • Lars Windolf
    Lars Windolf
    2005-01-27

    Logged In: YES
    user_id=834800

    I disagree with you Nathan. This bug is totally independent
    of what we do. CDATA sections are a XML encoding form. They
    must never end up in the result string. Maybe we/I should
    write a minimal test program to reproduce this and post a
    libxml2 bug.

    Concerning the encoding I believe that my last changes (with
    the HTMl entity support) solved all remaining problems. They
    ensure that all XML and HTML entities are resolved and the
    resulting text is either UTF-8 text or valid HTML. And all
    strings that are displayed with Pango are preprocessed with
    unhtmlize(). So all cases are covered. I wrote a summary on
    this a while ago in the mailing list.

     
  • Logged In: YES
    user_id=416550

    Liferea has two different ways to get the data from the XML
    tree. First is pretty much dumping the XML data verbitim:
    xmlNodeListGetString. It is often used so that unescaped
    HTML data can be extracted from a XML tree easily instead of
    having to parse the whole tree and reconvert the tree into
    flat HTML data....

    The second way concatinates the various text sub-nodes of a
    node in the XML tree:extractHTMLNode/xmlNodeDump. Unlike
    above, it will unescape stuff.

    In the case of a ATOM title, it will try to parse it as a
    content construct which should default to treating it as
    plain text.... So, the current behaviour of Liferea to just
    dump everything with xmlNodeListGetString is wrong since it
    is not unescaping it.

    I think that this is a desired behaviour of libxml and it a
    improper usage of the library by Liferea.

     
  • Lars Windolf
    Lars Windolf
    2005-01-27

    Logged In: YES
    user_id=834800

    Now I understand. If this is the case we just have to used
    the second way to extract titles. So like you proposed some
    time ago to split the atom:content extraction method.

     
  • Logged In: YES
    user_id=416550

    I think that the first thing that we need to do is to examin
    what the atom specifications say are valid titles... I don't
    want to expend effort to make it parse non-standard
    things.... But, if I am remembering correctly, it should be
    un-escaping once by default....

     
  • Logged In: YES
    user_id=956060

    atom:title is a content construct. By default it is expected
    to be a PCDATA text node. You can put a subtree in there
    using atom:title/@mode='xml' and an appropriate
    content-type. I've used HTML tags in the titles of some
    entries of the plasmasturm.org feed; Liferea handled those
    correctly for the list view (though it didn't show the title
    in the box in the content pane).

     
  • Screenshot of 43 Folders feed display

     
    Attachments
  • Logged In: YES
    user_id=956060

    Here's another screenshot showing a different feed where
    there's not even a CDATA section, just a numerical character
    entity reference. Liferea incorrectly displays it verbatim.

     
  • Lars Windolf
    Lars Windolf
    2005-01-28

    Logged In: YES
    user_id=834800

    I've seen it with 0.9.0 too and did some fixes in the CVS
    version. I think this problem is fixed now.

     
  • Logged In: YES
    user_id=416550

    Ah! I finally figured out what the bug was. I've been wrong
    all along....

    The issue is that the text widget in the item list cannot
    handle CDATA sections. I will try to think of a good
    work-around. I'll probably have to re-read a lot of the
    libxml API to figure out what should be done....

     
  • Lars Windolf
    Lars Windolf
    2005-01-30

    Logged In: YES
    user_id=834800

    I don't think that we need a workaround. Because we must
    never pass XML encoding stuff (CDATA,entities...) to the
    displaying widgets (Pango and the HTML rendering widgets).
    It must be ensured that we only use the libxml2 node dump
    method if:

    - type/mode of the element is HTML
    - the elements content is not to be displayed in Pango (true
    for all titles)

    Only if both conditions are true we can just dump a node.
    Currently this isn't the case. And I think we should ignore
    the case where CDATA is used in a HTML node. So someone
    should grep' the code for all xmlNodeDump() occurences and
    ensure that they are never used for titles. This should fix
    all problems we have now.

     
  • Logged In: YES
    user_id=956060

    Yes, that sounds good.

    Question: would that mean that once the new Liferea version
    with the fix is out, all my old cached feed items will show
    up correctly?

     
  • Logged In: YES
    user_id=416550

    Lars: We must pass entities to the pango markup. Otherwise,
    it would get confused by extra '<' and '> characters. So, I
    think that we must un-entity encode all the data in the
    parsers and then re-encode it while displaying it. You
    decided that the data should be treated as encoded HTML by
    default. So, the double-encoded thing in the bug report
    seems like it is correct and should work. I got this code
    working. I'm checking it into CVS, but it might need changes
    if it doesn't work for various things... I have not done
    that much testing.

    I think that we probably should change the default type of
    content from HTML back to text as specified in the RFC.

    Apag: The old cached feed items will not show up correctly.
    However, if items are still in the feed, they will probably
    be updated with their correct titles.

     
  • Lars Windolf
    Lars Windolf
    2005-01-30

    Logged In: YES
    user_id=834800

    I believe the current solution including entity resolving
    and reescaping is 100% correct. The bug we are currently
    talking about concerns only XML encoding tags that are not
    interpreted by Pango and therefore must not end up there.

    Again: I'm 100% sure, the encoding handling is correct. The
    escaping you demand is in the code (ui_itemlist.c:211). This
    works. Please don't change it when there are no problems
    with it. I have extensivly tested. But I must admit that I
    never had the idea to test CDATA sections with Atom titles.

    I think only the Atom parsing code (I think you checked in a
    fix some minutes ago) needs to be changed.

     
  • Logged In: NO

    It looks like we are agreed, then. No changes to the
    itemlist code, but changes to the parser. My only concern
    was that I had broken the parser.

    The example in this bug report now (with the checkin I did)
    gets treated as entity-encoded HTML by default. So the item
    title is "&frac14;-Card: Yet Another Memory Card". The This
    is treated as HTML. It is passed through unhtmlize because
    it is used as a title. unhtmlize unencoded entities (other
    than amp,gt,lt) , so it ends up saving the item title as
    "M1/4-Card....".

    If it were treated as text instead of HTML, it would end up
    being "M&frac14;-Card:...."

    -Nathan

     
    • status: open --> open-fixed
     
    • priority: 5 --> 1
     
  • Logged In: YES
    user_id=416550

    Fixed in CVS.

     
  • Lars Windolf
    Lars Windolf
    2005-03-12

    • status: open-fixed --> closed-fixed
     
  • Lars Windolf
    Lars Windolf
    2005-03-12

    Logged In: YES
    user_id=834800

    Fix released with 0.9.1. Please reopen if the problem remains.