#3 No parsing with blank lines

closed-invalid
nobody
XML-Parsing (4)
5
2002-08-13
2002-08-05
No

Version 1.1.5 doesn't appear to be able rss documents
when they contain blank lines right before the <xml>
declaration.

Blank lines should be ignored when parsing xml
documents.

Discussion

  • Cinek

    Cinek - 2002-08-12

    Logged In: YES
    user_id=107752

    Well, the exception which is being thrown is not caused by
    RssView directly. This is thrown by crimson and maybe it is
    right.

    Please read the XML specification on:
    http://www.w3.org/TR/2000/REC-xml-20001006#NT-document

    It appears to me that no blank lines are allowed before
    <?xml ...?> prolog. I cannot simply ignore the
    specification. If the document is not well-formed, it is
    allowed to ignore it or show errors.

    Can You give me a proof that it is allowed to insert empty
    lines before "<?xml ...?>" ?

     
  • Cinek

    Cinek - 2002-08-12
    • status: open --> pending
     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-12
    • status: pending --> open
     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-12

    Logged In: YES
    user_id=216797

    You're looking at the wrong specification. Before starting to
    parse the xml document, you need to properly extract the
    response from the HTTP request.

    See:

    http://www.w3.org/Protocols/rfc2616/rfc2616-
    sec4.html#sec4.1

    Hope this helps,

    E.

     
  • Cinek

    Cinek - 2002-08-12

    Logged In: YES
    user_id=107752

    It doesn't help very much, because I don't handle any
    HTTP protocol in my sources.

    The whole fetching is done at this place:

    try {
    parser.parse(new
    InputSource(channel.getURL()),this);
    } catch (Exception e) {
    System.err.println("Error while fetching channel: "+
    channel.getName());
    e.printStackTrace();
    return;
    }

    It's the InputSource class which fetches everything
    automatically. I just pass the URL that's all.

    Any other suggestions?

     
  • Cinek

    Cinek - 2002-08-12
    • status: open --> pending
     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-12

    Logged In: YES
    user_id=216797

    You could implement a simple error handler for your parser. I
    believe it will return a warning/recoverable error and not a fatal
    error.

    E.

     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-12
    • status: pending --> open
     
  • Cinek

    Cinek - 2002-08-13
    • status: open --> pending
     
  • Cinek

    Cinek - 2002-08-13

    Logged In: YES
    user_id=107752

    I am fetching the parser exception as You can see. It
    produces informative output and still lets RssView work. It
    is ignoring the page contents, if it is not RSS.

    Can You send me the URL of the RSS-document, which You are
    trying to access with RssView?

    I want to check, if it is compliant with XML specification.

     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-13

    Logged In: YES
    user_id=216797

    Not, not really. The only thing you're doing is trap the
    exception. If you look at the SAX documentation, you'll see
    two things:

    1. It explicitly recommends that all applications implement an
    error handler.

    2. If no error handler is provided, all errors will return an
    exception, regardless of whether they are warnings,
    recoverable errors or fatal errors.

    I don't recall which site it was, but I've setup a test page for
    you at:

    http://www.thauvin.net/blog/xmltest.jsp

    which RSSViewer will fail to parse. The exact same page
    with the <?xml ...?> as the first line is at:

    http://www.thauvin.net/blog/xml.jsp

    Hope this helps,

    E.

     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-13
    • status: pending --> open
     
  • Cinek

    Cinek - 2002-08-13

    Logged In: YES
    user_id=107752

    OK, I have checked this. You are of course right that an
    error handler should be used for errors, but it is a simple
    application which doesn't use SAX very intensively.

    You might be surprised what I say now, but I have to tell You
    that the first page which You told me of:
    http://www.thauvin.net/blog/xmltest.jsp
    is NOT a VALID and not even a WELL-FORMED xml-document.
    You might want to look at the specification of XML version
    1.0, as I told You before. You can also try to load it with
    Mozilla or parse it with any available parser.
    Another reason is that this issue cannot be simply fixed or
    I would break the XML-conformance:
    http://www.w3.org/XML/Test/

    Also look at the test-suite directly, there is an example
    from OASIS which says that the XML-document is not well-formed:
    p01fail1.xml (Reason: S cannot occur before the prolog)
    That means no whitespace before prolog.
    Whitespace includes \n and \r, of course.

    According to this test suite. Parsers are forced to show
    errors any other behavior will break the XML-conformance.
    Sorry, that I have to insist on this like I do here, but You
    should not forget that XML is not HTML, where tags can be
    placed in every possible way (even breaking everything!).
    XML is not designed to be flexible about error corrections.

     
  • Cinek

    Cinek - 2002-08-13
    • status: open --> closed-invalid
     
  • Erik C. Thauvin

    Erik C. Thauvin - 2002-08-13

    Logged In: YES
    user_id=216797

    It's beside the point, your software is not properly extracting
    the HTML document body, per the previously referenced
    specifications. If it did, the document would validate as well-
    formed.

    It's up to you if you want to be the only one not dealing with
    it, but all of the other viewers I've tried (Radio UserLand,
    NetNewsWire, AmphetaDesk, Headline Viewer, FreeReader
    and even IE), had no problems with it.

    Let me know if you ever decide to fix it and I might give
    RSSViewer another try.

     
  • Cinek

    Cinek - 2002-08-13

    Logged In: YES
    user_id=107752

    Well, I'm sorry I am parsing only well-formed RSS-RDF files.
    You can parse Your example sites with the validator on:
    http://www.w3.org/RDF/Validator/

    And then compare (e.g.) to the correct output of:
    http://slashdot.org/slashdot.rdf

    You will see that Slashdot's site produces a nice RDF-graph,
    which is full of information.

    Once again, I want to remind You that I don't have any HTTP
    protocol handlers and any parser handling in my code.
    Everything is done automatically by Java. The parser I am
    using is called crimson. The HTTP-protocol is directly
    supported in Java. I simply give crimson the stream which is
    being decoded by Java. I've shown to You that I only pass
    the URL in my code and nothing else.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks