Thread: [Sax-devel] SAX - endDocument() confusion again
Brought to you by:
dmegginson
From: Karl W. <ka...@wa...> - 2004-02-27 19:17:55
|
I was just reading through the ErrorHandler.fatalError() documentation, and there is says: <quote> The application must assume that the document is unusable after the parser has invoked this method, and should continue (if at all) only for the sake of collecting additional error messages: in fact, SAX parsers are free to stop reporting any other events once this method has been invoked. </quote> Since there is no strong and explicit assertion in the SAX docs that EndDocument *must* always be called, this would indicate to me that in fact there is no such assertion at all, as this looks like a proof to the contrary. To me this would also mean that any exceptions thrown should stop the parser without calling endDocument(), as an exception is certainly a stronger error condition than a fatalError() call-back that does not throw an exception. Any comments? Did I overlook/misunderstand something? Karl |
From: Karl W. <ka...@wa...> - 2004-02-27 20:38:14
|
----- Original Message ----- From: "Jeff Rafter" <li...@je...> To: "Karl Waclawek" <ka...@wa...>; <xm...@li...> Cc: <sax...@li...> Sent: Friday, February 27, 2004 2:54 PM > > To me this would also mean that any exceptions thrown > > should stop the parser without calling endDocument(), as > > an exception is certainly a stronger error condition than > > a fatalError() call-back that does not throw an exception. > > I would be a little more open about it. I would say that because it "may" > continue passing events, it is optional either way. With that being said, I > think that guaranteeing endDocument for the purpose of cleanup is useful-- I > just can't find it anywhere explicit. For cleanup - especially for a chain of SAX filters, one problem is that when parsing stops due to an exception, that the processors down the chain will not know if the stop was due to an error or not, as they will know nothing about the exception. What about this then: - Add an error argument to endDocument, like in public void endDocument(SAXParseException exception) throws SAXException which can be null. - Require endDocument to be called even in case of a call-back exception, and have its argument wrap that exception. This could even replace the fatalError call-back, unless a parser wants to continue reporting (but what? - isn't that contrary to the definition fatal error?). IMO, guaranteeing the call to endDocument() without the ability to pass information about the reason/status severely limits the usefulness of that guarantee. Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-27 21:05:25
|
At 3:27 PM -0500 2/27/04, Karl Waclawek wrote: >What about this then: > >- Add an error argument to endDocument, like in > > public void endDocument(SAXParseException exception) > throws SAXException > > which can be null. This would be backwards incompatible. Maybe if at some point in the future it's decided we need a backwards incompatible version of SAX, but not feasible for the immediate future in the Java 1.5 time frame. Even in the indefinite future,. I think the real way to handle this is a stack of nested exceptions thrown by parse(). -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-27 23:52:55
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: <xm...@li...>; <sax...@li...> Sent: Friday, February 27, 2004 3:42 PM > At 3:27 PM -0500 2/27/04, Karl Waclawek wrote: > > > >What about this then: > > > >- Add an error argument to endDocument, like in > > > > public void endDocument(SAXParseException exception) > > throws SAXException > > > > which can be null. > > This would be backwards incompatible. Maybe if at some point in the > future it's decided we need a backwards incompatible version of SAX, > but not feasible for the immediate future in the Java 1.5 time frame. Yes, I was just thinking out loud. > Even in the indefinite future,. I think the real way to handle this > is a stack of nested exceptions thrown by parse(). Would this solve the problem of propagating the error info through a chain of filters? Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-28 00:54:19
|
At 6:43 PM -0500 2/27/04, Karl Waclawek wrote: >> Even in the indefinite future,. I think the real way to handle this >> is a stack of nested exceptions thrown by parse(). > >Would this solve the problem of propagating the error info through >a chain of filters? I think it could. Each layer just catches the exception thrown by the underlying layer, wraps it in a new exception, and tosses it to the layer above it (unless it wants to fix the problem somehow. Right now I'm toying with the idea of an XMLFilter that fixes all the bugs I've uncovered in Xerces, including incorrect exception handling.) -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-28 04:48:28
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: <xm...@li...>; <sax...@li...> Sent: Friday, February 27, 2004 7:33 PM > At 6:43 PM -0500 2/27/04, Karl Waclawek wrote: > > >> Even in the indefinite future,. I think the real way to handle this > >> is a stack of nested exceptions thrown by parse(). > > > >Would this solve the problem of propagating the error info through > >a chain of filters? > > I think it could. Each layer just catches the exception thrown by the > underlying layer, wraps it in a new exception, and tosses it to the > layer above it (unless it wants to fix the problem somehow. Right now > I'm toying with the idea of an XMLFilter that fixes all the bugs I've > uncovered in Xerces, including incorrect exception handling.) I thought of the reverse direction. How would a filter *down* the chain deal with an "end of parsing" if it doesn't know why parsing stopped. Was it the end of the document? Was there an exception? I think someone else pointed that out already, but I missed the name. Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-28 13:16:22
|
At 11:38 PM -0500 2/27/04, Karl Waclawek wrote: >I thought of the reverse direction. >How would a filter *down* the chain deal with an "end of parsing" >if it doesn't know why parsing stopped. Was it the end of the document? >Was there an exception? I think someone else pointed that out already, >but I missed the name. I'm not sure I see your problem. The parse() method either returns or throws an exception. If it returns normally, the end of document was seen. If it throws an exception it wasn't. Of course the filter only really nows what the previous filter in the chain tells it. It doesn't know and doesn't need to know what filters earlier in the chain say. That's the point of filtering. -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-28 15:39:41
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: <xm...@li...>; <sax...@li...> Sent: Saturday, February 28, 2004 6:35 AM > At 11:38 PM -0500 2/27/04, Karl Waclawek wrote: > > >I thought of the reverse direction. > >How would a filter *down* the chain deal with an "end of parsing" > >if it doesn't know why parsing stopped. Was it the end of the document? > >Was there an exception? I think someone else pointed that out already, > >but I missed the name. > > > I'm not sure I see your problem. The parse() method either returns or > throws an exception. If it returns normally, the end of document was > seen. If it throws an exception it wasn't. Actually, I was thinking rather in terms of a chain of ContentHandlers, not true XMLFilter implementations. I didn't make that clear. Sometimes you don't know where the end of the chain is, so you can't call parse() on it and you have to drive the chain from the parser end. You are obviously right with your argument for XMLFilters. > Of course the filter only really nows what the previous filter in the > chain tells it. It doesn't know and doesn't need to know what filters > earlier in the chain say. That's the point of filtering. Yes, as above, I was not clear at all. The problem I described only arises when driving the "contentHandler-filter" chain from the parser end, in which case the filters down the chain have no parse() call to evaluate. Karl |
From: Karl W. <ka...@wa...> - 2004-02-28 16:39:06
|
----- Original Message ----- From: "Jeff Rafter" <li...@je...> To: "Karl Waclawek" <ka...@wa...>; "Elliotte Rusty Harold" <el...@me...> Cc: <xm...@li...>; <sax...@li...> Sent: Saturday, February 28, 2004 10:07 AM > Perhaps Karl is referring to what I see as a far more common approach-- > which is to simply filter a specific handler (i.e., create a passthrough > ContentHandler that acts on or modifies ContentHandler information as it is > passed through). Correct. My fault that I didn't make that clear. Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-27 20:49:03
|
At 2:07 PM -0500 2/27/04, Karl Waclawek wrote: ><quote> >The application must assume that the document is unusable after the parser >has invoked this method, and should continue (if at all) only for the sake >of collecting additional error messages: in fact, SAX parsers are free to >stop reporting any other events once this method has been invoked. ></quote> Yuck. That is nasty. This is why specs need a normative test suite. However, I think the text under endDocument() is more important: The SAX parser will invoke this method only once, and it will be the last method invoked during the parse. The parser shall not invoke this method until it has either abandoned parsing (because of an unrecoverable error) or reached the end of input. Clearly it's OK to call endDocument after a fatalError. Is it required? I think so, but it's arguable. David Brownell claims it's required <http://www.geocrawler.com/mail/msg.php3?msg_id=8561128&list=13179> and it's certainly useful to be able to depend on this. In practice, the only two actively developed SAX parsers for Java (Xerces-J and Oracle) do not call endDocument. Neither does Crimson. Some lesser known parsers like the various AElfred derivatives do. -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-27 23:50:12
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: <xm...@li...>; <sax...@li...> Sent: Friday, February 27, 2004 3:27 PM > ><quote> > >The application must assume that the document is unusable after the parser > >has invoked this method, and should continue (if at all) only for the sake > >of collecting additional error messages: in fact, SAX parsers are free to > >stop reporting any other events once this method has been invoked. > ></quote> > > Yuck. That is nasty. This is why specs need a normative test suite. > However, I think the text under endDocument() is more important: > > The SAX parser will invoke this method only once, and it will be the > last method invoked during the parse. The parser shall not invoke > this method until it has either abandoned parsing (because of an > unrecoverable error) or reached the end of input. > > Clearly it's OK to call endDocument after a fatalError. Is it > required? I think so, but it's arguable. David Brownell claims it's > required > <http://www.geocrawler.com/mail/msg.php3?msg_id=8561128&list=13179> > and it's certainly useful to be able to depend on this. I admit I am not a native speaker, but IMO the wording above would not contradict the behaviour of an exception stopping the parser cold. Exceptions are normally thought of as the "exceptional" case, and documenting the behaviour of an implementation does usually not imply that it will behave the same when an exception is thrown. I would think calling endDocument() always, as long as no exception is thrown, is a reasonable behaviour from a programmer's point of view. Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-28 00:54:13
|
At 6:40 PM -0500 2/27/04, Karl Waclawek wrote: >I admit I am not a native speaker, but IMO the wording above >would not contradict the behaviour of an exception stopping >the parser cold. Exceptions are normally thought of as >the "exceptional" case, and documenting the behaviour of an >implementation does usually not imply that it will behave >the same when an exception is thrown. > A very good point. However, I think the sort of exception you're describing is only a truly exceptional exception such as an I/O error like a broken socket or an out of memory condition. I'm not sure a malformed document qualifies as exceptional in this context. There's no reason, after all, the parse method has to throw an exception to indicate malformedness. It could easily have returned a boolean indicating whether or not the document was well-formed. Not that I'm suggesting such a change at this late date, of course. I just want to point out that there are other ways to design such an API that don't rely on exceptions. In practice I encounter malformed documents far more often than I/O errors, out of memory errors, and similar problems. They just don't feel that exceptional to me. -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-28 04:45:55
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: <xm...@li...>; <sax...@li...> Sent: Friday, February 27, 2004 7:30 PM > At 6:40 PM -0500 2/27/04, Karl Waclawek wrote: > > >I admit I am not a native speaker, but IMO the wording above > >would not contradict the behaviour of an exception stopping > >the parser cold. Exceptions are normally thought of as > >the "exceptional" case, and documenting the behaviour of an > >implementation does usually not imply that it will behave > >the same when an exception is thrown. > > > > A very good point. However, I think the sort of exception you're > describing is only a truly exceptional exception such as an I/O error > like a broken socket or an out of memory condition. I'm not sure a > malformed document qualifies as exceptional in this context. I agree. I rather thought of exceptions thrown in the call-backs, based on how the application deals with the information it receives. > There's > no reason, after all, the parse method has to throw an exception to > indicate malformedness. It could easily have returned a boolean > indicating whether or not the document was well-formed. Not that I'm > suggesting such a change at this late date, of course. I just want to > point out that there are other ways to design such an API that don't > rely on exceptions. That would be more along my line of thinking anyway ... > In practice I encounter malformed documents far > more often than I/O errors, out of memory errors, and similar > problems. They just don't feel that exceptional to me. Absolutely. Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-27 22:05:21
|
At 2:07 PM -0500 2/27/04, Karl Waclawek wrote: >Since there is no strong and explicit assertion in the SAX docs >that EndDocument *must* always be called, this would indicate to me >that in fact there is no such assertion at all, as this looks >like a proof to the contrary. Taking this argument to extremes, is it acceptable for a parser not to call startDocument? Just to call fatalError? I have caught parsers doing this, especially when the error is very early in the document; e.g. in the byte order mark or the XML declaration. -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-27 23:36:34
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: <xm...@li...>; <sax...@li...> Sent: Friday, February 27, 2004 4:45 PM > At 2:07 PM -0500 2/27/04, Karl Waclawek wrote: > > >Since there is no strong and explicit assertion in the SAX docs > >that EndDocument *must* always be called, this would indicate to me > >that in fact there is no such assertion at all, as this looks > >like a proof to the contrary. > > > Taking this argument to extremes, is it acceptable for a parser not > to call startDocument? Just to call fatalError? I have caught parsers > doing this, especially when the error is very early in the document; > e.g. in the byte order mark or the XML declaration. I would say that at this point the document has started, and it is very simple to call startDocument before reading a byte order mark or XML declaration, and at the moment I can't think of any practical implications as for endDocument(). In short, I would say no, but I am writing this in a hurry and haven't checked the specs. On the other hand, there are two problems with endDocument(): 1) Not clearly documented 2) Which is the right behaviour? Not so easy to answer. There are issues around cleanup and filter chains, the semantics of an implicit try .. finally (you normally don't expect observable behaviour from a call once your call-back has thrown an exception) Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-28 13:16:25
|
At 12:03 PM +0200 2/28/04, Toni Uusitalo wrote: >I decided not to call startDocument/endDocument-pair when BOM or XML >declaration or setting forced encoding fails. Reason for this was I >wanted document's actual encoding to be known at the startDocument >stage (as there isn't necessarily xml declaration present of course). I see your point. Hmm, should we adjust Locator2.getEncoding() and getXMLVersion() so that they can return null if this information is not available? -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-28 15:48:23
|
----- Original Message ----- From: "Toni Uusitalo" <ton...@pa...> To: "Elliotte Rusty Harold" <el...@me...> Cc: <xm...@li...> Sent: Saturday, February 28, 2004 5:03 AM > I decided not to call startDocument/endDocument-pair when BOM or XML > declaration or setting forced encoding fails. Reason for this was I wanted > document's actual encoding to be known at the startDocument stage (as there > isn't necessarily xml declaration present of course). Then again I'm > talking about my SAX C library Parsifal which is far away from "official" > SAX java implementations. Good point. However, the docs say this: <quote> Note that the locator will return correct information only during the invocation SAX event callbacks after startDocument returns and before endDocument is called. The application should not attempt to use it at any other time. </quote> Karl |
From: Elliotte R. H. <el...@me...> - 2004-02-28 17:25:20
|
At 10:38 AM -0500 2/28/04, Karl Waclawek wrote: >Good point. However, the docs say this: > ><quote> >Note that the locator will return correct information only during >the invocation >SAX event callbacks after startDocument returns and before >endDocument is called. >The application should not attempt to use it at any other time. ></quote> >Karl Yes, but the situation is this: 1. Parser opens stream. 2. Parser calls setDocumentLocator and passes in a Locator2. 3. Parser calls startDocument 4. Parser parses XML declaration. Oops there's an error! 5. Parser calls fatalError 6. In fatalError Client calls Locator2.getXMLVersion and/or getEncoding. 7. Parser calls endDocument. What is the Locator2 supposed to return? It's to avoid this problem, that Parsifal is not calling either setDocumentLocator or startDocument until the declaration (or lack thereof) has been successfully parsed. Possibly, there's a loophole, however The Locator JavaDoc says: Note that the results returned by the object will be valid only during the scope of each callback method: the application will receive unpredictable results if it attempts to use the locator at any other time, or after parsing completes. I think it's reasonable to say that parsing was completed before fatalError was called; not completed successfully of course, but definitely completed. Therefore, there's no place in this chain where the Locator methods can be expected to return correct information. Therefore Parsifal (and other SAX parsers) should indeed call startDocument, fatalError, and endDocument in that order when they encounter an error very early in the XML document. If the client uses the Locator2 object at any point in this process, they deserve what they get. Possible problem with this chain of reasoning: in more normal circumstances it's very useful to use a Locator in fatalError() to get the line and column number where the errror appears. This logic would forbid such use. Hmm, still another tricky bit, thisn one not even requiring an error: 1. Parser opens stream. 2. Parser calls setDocumentLocator and passes in a Locator2. 3. Parser calls startDocument. startDocument returns. 4. Another thread uses the Locator2 object before the parser has parsed the XML declaration. What do getXMLVersion and getEncoding return? I think the best solution is to make a minor fix to the Locator2 JavaDocs that allow (indeed require) these methods to return null at points where the version and encoding are not known. -- Elliotte Rusty Harold el...@me... Effective XML (Addison-Wesley, 2003) http://www.cafeconleche.org/books/effectivexml http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosim/cafeaulaitA |
From: Karl W. <ka...@wa...> - 2004-02-28 18:09:44
|
----- Original Message ----- From: "Elliotte Rusty Harold" <el...@me...> To: "Karl Waclawek" <ka...@wa...> Cc: "Toni Uusitalo" <ton...@pa...>; <xm...@li...>; <sax...@li...> > Yes, but the situation is this: > > 1. Parser opens stream. > 2. Parser calls setDocumentLocator and passes in a Locator2. > 3. Parser calls startDocument > 4. Parser parses XML declaration. Oops there's an error! > 5. Parser calls fatalError > 6. In fatalError Client calls Locator2.getXMLVersion and/or getEncoding. > 7. Parser calls endDocument. > > What is the Locator2 supposed to return? > > It's to avoid this problem, that Parsifal is not calling either > setDocumentLocator or startDocument until the declaration (or lack > thereof) has been successfully parsed. > > Possibly, there's a loophole, however The Locator JavaDoc says: > > Note that the results returned by the object will be valid only > during the scope of each callback method: the application will > receive unpredictable results if it attempts to use the locator at > any other time, or after parsing completes. > > I think it's reasonable to say that parsing was completed before > fatalError was called; not completed successfully of course, but > definitely completed. Therefore, there's no place in this chain where > the Locator methods can be expected to return correct information. Actually, I am not sure that one can say that formally, but it doesn't matter, IMO. > Therefore Parsifal (and other SAX parsers) should indeed call > startDocument, fatalError, and endDocument in that order when they > encounter an error very early in the XML document. If the client uses > the Locator2 object at any point in this process, they deserve what > they get. I agree. If the error says that the very information you want could not be retrieved, well, then you have to take that into account. > Possible problem with this chain of reasoning: in more normal > circumstances it's very useful to use a Locator in fatalError() to > get the line and column number where the errror appears. This logic > would forbid such use. I would not say so. You just have to take the type of error into account. Why should Locator be useful when the error says that it can't be? > Hmm, still another tricky bit, thisn one not even requiring an error: > > 1. Parser opens stream. > 2. Parser calls setDocumentLocator and passes in a Locator2. > 3. Parser calls startDocument. startDocument returns. > 4. Another thread uses the Locator2 object before the parser has > parsed the XML declaration. > > What do getXMLVersion and getEncoding return? You are only allowed to use Locator in a call-back, I think. How can the other thread be in a call-back? Is the parser supposed to be thread-safe? > I think the best solution is to make a minor fix to the Locator2 > JavaDocs that allow (indeed require) these methods to return null at > points where the version and encoding are not known. Sure, that sounds like a good requirement in any case. Karl |