saxdotnet-devel Mailing List for SAX for .NET

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I think we have everything at a release level
except for the Conformance demo.
Anyone interested should check out current CVS.

Karl

Jeff Rafter wrote:

>> However, I think I mentioned that RPC is not a common SAX use case.
> 
> 
> I agree, it is not common. I think that when we have that documentation 
> around end document, if we include information about this, we need to 
> explain it clearly (as you did above). Also, I think we need to be 
> explicit about what happens in the case of a user generated exception in 
> a callback (i.e., EndDocument is *still* called).

This is what I have in CVS currently as doc for EndDocument():

     /// <summary>See <see 
href="http://www.saxproject.org/apidoc/org/xml/sax/ContentHandler.html#endDocument()">
     /// ContentHandler.endDocument</see> on www.saxproject.org.</summary>
     /// <remarks>Differences to Java:
     /// <list type="bullet">
     ///   <item>Stricter about when to call: <c>EndDocument</c> <b>must</b> be called by the
     ///     SAX event producer exactly once as the last event in a SAX event stream initiated
     ///     by a <see cref="IContentHandler.StartDocument"/> call, regardless of any exceptional
     ///     or error situation encountered. Depending on the call communication mechanism, however,
     ///     this is no guarantee that the SAX event consumer will also receive that call.</item>
     /// </list></remarks>

>> I am not sure I understand you fully.
> 
> 
> Well, I was imagining the case where there was a two-gig element name, 
> or two-gig (count) of attributes on an element. We cannot pass the 
> information back clearly... maybe a specialized exception, or something 
> similar to an API FatalError would be useful at that point... so that 
> the cause is clearly identified. Again, this is not something for this 
> release. I just feel as though there are things that we could do to fix 
> this...

Normally, overflow situations are already handled by the runtime system.
What is it we should do over and above that?

Karl

> What I mean is that in a scenario like DCOM (or some other RPC mechanism)
> one cannot guarantee that the call reaches the target, since RPC
> mechanisms may have communication failures. You may guarantee that
> the call is made (i.e. that it originates), but not that it arrives.
> So in such a situation the callee cannot rely on the EndDocument() 
> callback.

> However, I think I mentioned that RPC is not a common SAX use case.

I agree, it is not common. I think that when we have that documentation 
around end document, if we include information about this, we need to 
explain it clearly (as you did above). Also, I think we need to be 
explicit about what happens in the case of a user generated exception in 
a callback (i.e., EndDocument is *still* called).

> A feature called reader-control has been added. It is obviously a read-only
> feature.

Excellent.

> I am not sure I understand you fully.

Well, I was imagining the case where there was a two-gig element name, 
or two-gig (count) of attributes on an element. We cannot pass the 
information back clearly... maybe a specialized exception, or something 
similar to an API FatalError would be useful at that point... so that 
the cause is clearly identified. Again, this is not something for this 
release. I just feel as though there are things that we could do to fix 
this...

> Very good indeed!

Yes quite. I also took Elliotte's advice and looked through the cvs on 
the sax.sf.net site for changes to documentation. There were no changes 
regarding for the issues we discussed (as near as I can tell) since last 
April. So we have been working with the latest documentation base...

Cheers,
Jeff

Jeff Rafter wrote:

>>  D) Requiring EndDocument
> 
> 
>>  However, this can be guaranteed only for in-process call-backs.
> 
> 
> Can you explain this a little more? Is this the point about non SAX 
> generated Exceptions (i.e., user throws an exception in a callback)? 
> Otherwise, agreed.

What I mean is that in a scenario like DCOM (or some other RPC mechanism)
one cannot guarantee that the call reaches the target, since RPC
mechanisms may have communication failures. You may guarantee that
the call is made (i.e. that it originates), but not that it arrives.
So in such a situation the callee cannot rely on the EndDocument() callback.
However, I think I mentioned that RPC is not a common SAX use case.

>>  F) Merging the IXmlReaderControl with IXmlReader
> 
> 
> I agree with Elliotte's opposition here. But at the time we were talking 
> about adding a lot more exceptional circumstances. Since that time, we 
> have scaled back that approach quite a bit. So I think that we can merge 
> still. We also wanted to have an accompanying mechanism to check support 
> without raising the exception (a feature?). But I think this is 
> manageable. So I think the 2:1 is accurate.

A feature called reader-control has been added. It is obviously a read-only
feature.

> 
>>  G) SAX holes
>>
>>  - Max string length: String.Length is defined as int,
>>     not much we can do here.
> 
> 
> The only thing we could do is come up with some mechanism for (a) 
> passing the stream info in a callback, (b) allowing subsequent calls on 
> all callbacks for the items to be aggregated (c) raise an exception in 
> the event that something exceeds the limit.

I am not sure I understand you fully.

>>  I) Re-define what "parsing" means in SAX.
>>  allow for "optional" argument or return values.
> 
> 
> Agreed, I will work on the new wording.

Thanks!

> 
> So based on that we have very little left undecided... just a little 
> more work to do...

Very good indeed!

Karl

Another review of the current status for the various threads:
(only 3 participants so far)
When I mark an item/solution as accepted, please speak up
if you have objections.

  A) "Unifying" the core and extension interfaces:

  No opposition, seems accepted.

  B) Changing Exceptions in the API

  SaxNotSupportedException and SaxNotRecognizedException have been
  removed in favour of the built-in .NET exceptions ArgumentException
  and NotSupportedException. I assume this is accepted.

  C) Reducing the versioning / eliminating the Ixxx2 interfaces

  When unifying IAttributes and IAttributes2, the IsDeclared()
  methods have been removed in favour of an additional return value
  for GetType(). This value is UNDECLARED.

  Also, GetType() differentiates between NMTOKEN and ENUMERATION,
  adding the latter as another new return value .

  I assume this is accepted as well.

  D) Requiring EndDocument

  For the purpose of enabling content handler chains
  (filter pipelines), we should require that EndDocument()
  is always called when StartDocument() was called.
  However, this can be guaranteed only for in-process call-backs.
  Documentation has been updated, and I assume this is accepted.

  E) StartElement, when URI is not present

  We decided to follow what the standard NET libraries expect
  as an input for an URI, if the URI is supposed to be absent.
  It was found that many of them expect an empty string and not null.
  So, the solution proposed is to follow the Java API and require
  that a namespace URI has the value "" when a qualified name
  is not in any namespace. This applies to IAttributes.GetUri()
  and the URI arguments passed to Start/EndElement().
  As a similar consequence we require that the prefix argument
  passed to Start/EndPrefixMapping is "" for the default namespace.
  For the rest of the API, absence of a string parameter will still
  be indicated through a null reference. This is especially necessary
  for IDeclhandler.AttributeDecl(), where the value parameter has
  a defined meaning for "".
  Assuming acceptance.

  F) Merging the IXmlReaderControl with IXmlReader

  Some opposition from Elliotte, that is, 2:1 in favour of  adding
  Suspend/Resume/Abort() to IXmlReader. I propose we go with the merge
  and make these methods optional (allowed to throw NotSupportedException).
  However, the IXmlReader.Status property will be non-optional,
  as it is easy to implement. Still open, but I hope acceptance is not far.

  G) SAX holes

  - Max string length: String.Length is defined as int,
     not much we can do here.
  - Skipped entities in attribute values: possibility of
     marking spot in attribute value (at user option).
  I suggest we do nothing for this release, as the API
  changes might be invasive. Assuming acceptance.

  H) Replace is-standalone feature with ILocator member

  The is-standalone feature was removed and a new property was added
  to ILocator called EntityType. It is an enumeration of this type:

   public enum ParsedEntityType
   {
     /// <summary>Document entity without specified value for the standalone flag.</summary>
     Document,
     /// <summary>Document entity with standalone="no".</summary>
     NotStandalone,
     /// <summary>Document entity with standalone="yes".</summary>
     Standalone,
     /// <summary>External general entity.</summary>
     General,
     /// <summary>External parameter entity.</summary>
     Parameter
   }

   No discussion yet (but off-list agreement between Jeff and Karl).
   We might want to add a value of "Other" or "Unknown" for those
   cases where we are not actually parsing an XML document.

  I) Re-define what "parsing" means in SAX.

  This is a new issue that came up while discussing the fact that
  certain API members and/or names only make sense when parsing an actual
  XML document (ParseError, SaxParseException, IXmlReader.Parse(),
  ILocator.PublicId, ...). This puts unnecessary limitations on SAX event
  generation not based on parsing a document. We have two choices, rename
  and re-design  part of the SAX API, or re-define the meaning of parsing.
  Luckily, it seems that those API members that force a document interpretation
  allow for "optional" argument or return values.

  So, as a starting point this was proposed:

    "For the purpose of the SAX API we define "parsing" as
    generating a sequence of call-backs on SAX handler interfaces
    that represent a well-formed XML document. This applies even
    if no actual XML document is being processed."

  Jeff thinks it needs some refinement, and I asked him to refine ...
  I also think we should add this:

    Whenever the term "document" is used in SAX, it should be
    interpreted more generally as "source of well-formed SAX events".

  This would for instance make it unnecessary to add a new member
  (see above) to ParsedEntityType.

  SUMMARY

  So, we have F, H and I as open issues, but I think we are getting close.
  If anyone sees new issues, please come forward.

Karl

Jeff Rafter wrote:
>> One of the things Elliotte mentioned was passing the URI to
>> other libraries, instead of checking it. That is what D.Megginson 
>> probably
>> meant with convenience. So, which libs in .NET would one use?
>> And what do they accept? It would be funny if after all the
>> discussion they accepted null for an URI.

I just checked a few: Uri constructor accepts an empty relative Uri,
but not null. XmlQualifiedName has an empty string for "no namespace".
So I guess its settled, prefix and Uri are emtpy strings in
StartElementHandler and IAttributes when no namespace is present.
Incorrect, but convenient.

> Elliotte also mentioned that he didn't like to include if (uri == null) 
> everywhere-- and that knowing the uri would never be null saved him from 
> the clutter.

Yes, but that only applies if he can just pass it on
and let someone else make the check. If some of *his* code
needs to be executed depending on whether the name is
in a namespace or not, then he needs to check anyway,
and it makes no difference whether you check against null or "".

Just imagine if all those framework classes above would accept
null instead. Our decision would certainly be different.

>> I am hesitant to dictate my opinion here. In the end, anything is 
>> workable.
>> Unfortunately not too much feedback.
> 
> 
> So you are a benevolent dictator... I like that...
> 
>> If we go with String.Empty for URIs and prefixes, then I would suggest
>> they should be the only such case, as then we would be pretty much in 
>> agreement
>> with the original Java API, and the work effort to change all other such
>> API cases and their implementations could be quite inconvenient.
> 
> 
> 100% agree...

OK.
> 
>>> In any event-- we need to decide, we need to document the decision 
>>> very clearly and we need to make sure our SAX conformance application 
>>> checks for all of the appropriate cases. Which may involve adding a 
>>> secondary test suite.
>>
>>
>> Yes, definitely.
> 
> 
> As a side note-- the Java SAX Conformance suite relies on the fact that 
> the URI will not be null (as a side affect). Anything that does not do 
> so is not "SAX Conformant" according to Elliotte's suite. Now, I think 
> that is wrong because it is not legislated-- but every parser he tested 
> is conformant on that point... meaning they all pass string.empty and 
> not null.
> 
> ===============
> The XmlReader.NamespaceURI has this:
> 
>   Property Value
>   The namespace URI of the current node; otherwise an empty string.
> 
>   Remarks
>   This property is relevant to Element and Attribute nodes only.
> 
> In XmlDocument.CreateElement:
> 
>   namespaceURI
>   The namespace URI of the new element (if any). String.Empty and a
>   null reference (Nothing in Visual Basic) are equivalent.
> 
> In XmlElement.NamespaceURI:
> 
>   The namespace URI of this node. If there is no namespace URI, this
>   property returns String.Empty.
> 
> And finally, XmlNamespaceManager.AddNamespace Method throws an 
> ArgumentNullException in the case:
> 
>   The value for prefix or uri is a null reference (Nothing in Visual
>   Basic).
> ===========
> 
> This all seems pretty compelling... :)

Yes, as I said - convenience interacting with the libraries.
It would be interesting if these are new libs adjusting to
a precedent set by SAX originally?

Karl

> One of the things Elliotte mentioned was passing the URI to
> other libraries, instead of checking it. That is what D.Megginson probably
> meant with convenience. So, which libs in .NET would one use?
> And what do they accept? It would be funny if after all the
> discussion they accepted null for an URI.

Elliotte also mentioned that he didn't like to include if (uri == null) 
everywhere-- and that knowing the uri would never be null saved him from 
the clutter.

> I am hesitant to dictate my opinion here. In the end, anything is workable.
> Unfortunately not too much feedback.

So you are a benevolent dictator... I like that...

> If we go with String.Empty for URIs and prefixes, then I would suggest
> they should be the only such case, as then we would be pretty much in 
> agreement
> with the original Java API, and the work effort to change all other such
> API cases and their implementations could be quite inconvenient.

100% agree...

>> In any event-- we need to decide, we need to document the decision 
>> very clearly and we need to make sure our SAX conformance application 
>> checks for all of the appropriate cases. Which may involve adding a 
>> secondary test suite.
> 
> Yes, definitely.

As a side note-- the Java SAX Conformance suite relies on the fact that 
the URI will not be null (as a side affect). Anything that does not do 
so is not "SAX Conformant" according to Elliotte's suite. Now, I think 
that is wrong because it is not legislated-- but every parser he tested 
is conformant on that point... meaning they all pass string.empty and 
not null.

===============
The XmlReader.NamespaceURI has this:

   Property Value
   The namespace URI of the current node; otherwise an empty string.

   Remarks
   This property is relevant to Element and Attribute nodes only.

In XmlDocument.CreateElement:

   namespaceURI
   The namespace URI of the new element (if any). String.Empty and a
   null reference (Nothing in Visual Basic) are equivalent.

In XmlElement.NamespaceURI:

   The namespace URI of this node. If there is no namespace URI, this
   property returns String.Empty.

And finally, XmlNamespaceManager.AddNamespace Method throws an 
ArgumentNullException in the case:

   The value for prefix or uri is a null reference (Nothing in Visual
   Basic).
===========

This all seems pretty compelling... :)

Jeff

Jeff Rafter wrote:
>> So, I simply don't see a contradiction.
> 
> 
> It seems that we are the only ones arguing about this. Elliotte seemed 
> to be in favor of string.empty as well and his last email on the subject 
> was very strong-- but he also added the caveat that he is a Java guy for 
> the sake of this discussion. 

When I asked David Meggisonso he said the orginal reason was
for programming convenience.

> So if we took a vote:
> 
> ======================================
> 2 use string.empty for the URI param
> when xmlns="" or xmlns is not present.
> 
> 1 use null for the above
> ======================================
> 
> This would indicate that we would either need to change such params in 
> other callbacks or live with the inconsistency.

I guess we already have one inconsistency at our hands that
we cannot bypass: In the AttributeDecl() call-back, there are
defined meanings for value=="" and value==null, so we must allow both.

One of the things Elliotte mentioned was passing the URI to
other libraries, instead of checking it. That is what D.Megginson probably
meant with convenience. So, which libs in .NET would one use?
And what do they accept? It would be funny if after all the
discussion they accepted null for an URI.

> Now of course, you are project admin and SAX is historically a 
> dictatorship-- as the only other implementer I can tell you that I will 
> implement it however you decide.

I am hesitant to dictate my opinion here. In the end, anything is workable.
Unfortunately not too much feedback.

If we go with String.Empty for URIs and prefixes, then I would suggest
they should be the only such case, as then we would be pretty much in agreement
with the original Java API, and the work effort to change all other such
API cases and their implementations could be quite inconvenient.

> In any event-- we need to decide, we need to document the decision very 
> clearly and we need to make sure our SAX conformance application checks 
> for all of the appropriate cases. Which may involve adding a secondary 
> test suite.

Yes, definitely.

Karl

Jeff Rafter wrote:
>> For the purpose of the SAX API we define "parsing" as
>> generating a sequence of call-backs on SAX handler interfaces
>> that represent a well-formed XML document. This applies even
>> if no actual XML document is being processed.
> 
> 
> I can agree with that-- it would be good to couch it in such wording 
> though with the statement that we know what SAX Processing is and the 
> difference between an XML Processor and Application [1] proper are... 
> also, it should be noted (maybe in an example) that in the case of 
> something like a CSV parser that generates SAX events line number and 
> column number are still useful concepts. When not useful they should be 
> -1 and the entity should be (dare I say it?) null...
> 
> [1] http://www.w3.org/TR/REC-xml/#dt-xml-proc

Would you mind making such additions/corrections?
You already seem to know what should be added.
I can do the CVS stuff, to save you time, as there are already
lots of changes in CVS that you would have to check out first.

:-)

Karl

> For the purpose of the SAX API we define "parsing" as
> generating a sequence of call-backs on SAX handler interfaces
> that represent a well-formed XML document. This applies even
> if no actual XML document is being processed.

I can agree with that-- it would be good to couch it in such wording 
though with the statement that we know what SAX Processing is and the 
difference between an XML Processor and Application [1] proper are... 
also, it should be noted (maybe in an example) that in the case of 
something like a CSV parser that generates SAX events line number and 
column number are still useful concepts. When not useful they should be 
-1 and the entity should be (dare I say it?) null...

[1] http://www.w3.org/TR/REC-xml/#dt-xml-proc

Otherwsie, sounds good.

Jeff

Jeff Rafter wrote:
>> I must have missed this. Could you explain it again?
> 
> 
> Something like this could be placed in your object heirarchy. One could 
> also very easily create a tee-like observer pattern and some 
> ExceptionHandler class. This could be designed into a subclass 
> ContentHandler and used from within callbacks
> 
> public FooContentHandler : ContentHandler {
> 
>   public ExceptionHandler exceptionHandler;
> 
>   public void startElement(...) {
>     // some code happens, we need to throw an exception
>     exceptionHandler.Handle(new FooException());
>   }
> }
> 
> public ExceptionHandler {
> 
>   public void Handle(Exception e)
>     throws SAXException {
>     if (....) {
> 
>     } else
>      throw new SAXException(e);
>   }
> }

Yes, this is definitely possible. Although I like the name
of ErrorHandler better, since passing error information does
not have to be based on using Exception objects.

This was already discussed (sort of) on xml-dev,
but I really would prefer to use the existing API.
The API above would be similar to any (proprietary)
way of exchanging information between those content
handler that you have control over, and your application.

The question is, do we need to standardize on this in SAX?
As far as all the other interfaces are concerned, they form
a contract for an IXmlReader implementation, so standardizing
them is good.

But I don't think we should make too many restrictions on
how content handler implementations and the application communicate.
This could not even be tested for conformance.

What about a validating IXmlFilter implementation?
Well, it must conform to the IXmlreader contract, and therefore
should call back on IErrorHandler.

> Now these are just some random ideas thrown out after a long night in 
> the rain so I could be way off. Also, in the back of my mind I am 
> wondering about how Java handles that exception class if not actually 
> thrown. 

I guess the GC disposes of them.

> On top of that I saw in some of your examples that you handled a 
> FileNotFound exception in a slightly different way (it seems that IO 
> exceptions would need a slightly different pattern anyway because they 
> are not treated strictly as SAXParseExceptions to begin with). But all 
> of leads me to think that you can have your cake and eat it too...
> 
>> It would be a simple documentation change.
> 
> 
> I like simple...

What do you think of this:

For the purpose of the SAX API we define "parsing" as
generating a sequence of call-backs on SAX handler interfaces
that represent a well-formed XML document. This applies even
if no actual XML document is being processed.

Karl

> So, I simply don't see a contradiction.

It seems that we are the only ones arguing about this. Elliotte seemed 
to be in favor of string.empty as well and his last email on the subject 
was very strong-- but he also added the caveat that he is a Java guy for 
the sake of this discussion. So if we took a vote:

======================================
2 use string.empty for the URI param
when xmlns="" or xmlns is not present.

1 use null for the above
======================================

This would indicate that we would either need to change such params in 
other callbacks or live with the inconsistency.

Now of course, you are project admin and SAX is historically a 
dictatorship-- as the only other implementer I can tell you that I will 
implement it however you decide.

In any event-- we need to decide, we need to document the decision very 
clearly and we need to make sure our SAX conformance application checks 
for all of the appropriate cases. Which may involve adding a secondary 
test suite.

Cheers,
Jeff

Jeff Rafter wrote:
>> We really only have two cases: foo has a namespace, or it doesn't.
>>
>> Or did you mean something else?
> 
> 
> <snip/>
> 
>>> <foo bar="">
>>> Would the bar attribute's value be null or string.empty?
>>
>> String.Empty. null would mean: no value.
>>
> 
> This is the contradiction I am referring to... technically the value of 
> bar is emtpy which is more or less null. But we make the distinction 
> because it is helpful to know that even though there is no value, the 
> attribute bar is present. 

I am not sure if so documented, but if an attribute is passed
through SAX (IAttributes), it can never have a null value argument,
simply because an attribute without value is nonsense.
null means "no value" (see below).

 > This is the same as the xmlns="" declarations.
> Technically the value means null-- literally it is empty-- and it is 
> helpful to be able to distinguish between the two (at least in editor 
> applications)...

I don't quite agree - null means "no value". there is no null value
for strings. null is a value for references/pointers. The string
arguments in SAX are not strings, but string references, and as such
they can be null, meaning they do not point to any string object.

This is also one of the reasons why namespace URI references are not
allowed to be empty strings in XML, even though URI references in general
are allowed to, and do have a specific meaning assigned to an empty string.
Otherwise one could not express "absence" or "removal" of namespaces
in the serialized format, because XML is a text format - everything must
be expressed as text. xmlns="" can be interpreted on two levels:
1) an attribute with name xmlns and value "".
2) an expression of the fact that the default namespace is turned off.
The concept of null comes into play at a later stage, and that is
when the parser wants to pass a name to the app. How does it express
that this name has/does not have a namespace? If the uri argument,
which is a string reference, points to nothing (==null) then we don't
have an uri string object, and therefore no namespace.

So, I simply don't see a contradiction.

Karl

> We really only have two cases: foo has a namespace, or it doesn't.
> 
> Or did you mean something else?

<snip/>

>> <foo bar="">
>> Would the bar attribute's value be null or string.empty?
> String.Empty. null would mean: no value.
> 

This is the contradiction I am referring to... technically the value of 
bar is emtpy which is more or less null. But we make the distinction 
because it is helpful to know that even though there is no value, the 
attribute bar is present. This is the same as the xmlns="" declarations. 
Technically the value means null-- literally it is empty-- and it is 
helpful to be able to distinguish between the two (at least in editor 
applications)...

Jeff

> I must have missed this. Could you explain it again?

Something like this could be placed in your object heirarchy. One could 
also very easily create a tee-like observer pattern and some 
ExceptionHandler class. This could be designed into a subclass 
ContentHandler and used from within callbacks

public FooContentHandler : ContentHandler {

   public ExceptionHandler exceptionHandler;

   public void startElement(...) {
     // some code happens, we need to throw an exception
     exceptionHandler.Handle(new FooException());
   }
}

public ExceptionHandler {

   public void Handle(Exception e)
     throws SAXException {
     if (....) {

     } else
      throw new SAXException(e);
   }
}

Now these are just some random ideas thrown out after a long night in 
the rain so I could be way off. Also, in the back of my mind I am 
wondering about how Java handles that exception class if not actually 
thrown. On top of that I saw in some of your examples that you handled a 
FileNotFound exception in a slightly different way (it seems that IO 
exceptions would need a slightly different pattern anyway because they 
are not treated strictly as SAXParseExceptions to begin with). But all 
of leads me to think that you can have your cake and eat it too...

> It would be a simple documentation change.

I like simple...

Jeff

test2

Jeff Rafter wrote:
> Karl,
> 
> What did you think about my idea for a supplemental Exception handler 
> interface? This could be grafted on without much change to the 
> fundamentals of the API.

I must have missed this. Could you explain it again?

> Otherwise, might I suggest making ParseError 
> descend from SAXError and making SAXError more generic. Then you could 
> simply use "is" or a cast to get the more complete parse error 
> information when applicable.

If you look at SAX more closely, a lot of stuff that
now is based on the assumption of "parsing an XML document"
could be re-defined more generically in terms of a well-formed event sequence.
That could be a lot of work.

For instance: ParseError.Throw() has this code:
   throw new SaxParseException(this).
So, if we change to SAXError, we should first make "Throw()"
a virtual method and add it to SAXError where it throws a SaxExcpetion.
Then we need to add a SAXError member to SaxException and and a corresponding
constructor. We have to remove it from SAXParseException (as it already exists
in the base class) and change its constructor.
Then we need to override ParseError.Throw().
Then we have to modify the ParseErrorImpl class accordingly.
And then we still haven't covered all areas where "parsing"
should be generalized.

Yes, I could introduce a SAXError class, but what if we
just document that "parsing" really does not mean that
there has to be an underlying XML document? Any well-formed
event stream is "parsing".
Then we could leave ParseError and even SaxParseException as is.
The Locator related fields in ParseError (and SaxParseException)
are alread optional (they can return null or -1), even in Java.

It would be a simple documentation change.

Karl

Jeff Rafter wrote:
>> One could think of this as a guideline:
>>
>> - If we would say: this string parameter/argument/value
>>   can be absent, then let's use null to indicate it.
>> - If we would rather say: this string parameter/argument/value
>>   can be empty, then let's use "" to indicate it.
>>
>> Coming back to the SAX API:
>> How would the above guideline be resolved for namespace URIs
>> and prefixes when an XML name is not in any namespace?
> 
> 
> I think that these are good guidelines-- and tough to argue with... but 
> for namespace URIs I think there is some ambiguity still...
> 
> a) <foo/>
> b) <foo xmlns="http://foo"/>
> c) <foo xmlns=""/>
> 
> Most naturally I would see this as
> 
> a) null
> b) "http://foo"
> c) string.empty

Are you asking what to pass for the namespace uri of foo
in the StartElementHandler()?

a) don't know, depends if there is a default namespace
b) "http://foo"
c) null (there is no namespace for foo)

We really only have two cases: foo has a namespace, or it doesn't.

Or did you mean something else?

> But I could see this making for needlessly complex handler code. Which 
> is why we want to land on either null or string.empty. string.empty 
> gives you less chance of a runtime exception but does not represent case 
> (a) very well. Using null does not represent (c) very well. I think that 
> in the XML Corpus this is one of the few areas where "" has a specific 
> meaning. You brought up arguments about API consistency and should we 
> use string.empty if no Public ID is provided (for instance)... but what 
> about the case where you have:
> 
> <foo bar="">
> 
> Would the bar attribute's value be null or string.empty?

String.Empty. null would mean: no value.

Karl

Trying to get the empty string vs. null discussion restarted.

Karl Waclawek wrote:

> 
> I had another look at the Java API.
> It seems it is quite inconsistent with respect to empty string vs. null.
> 
> Examples:
> 
> - In EntityResolver, all occurrences of publicId or baseUri
>   are supposed to be null, when absent.
> 
> - DTDHandler.notationDecl(): publicId, systemId can be null,
>   if not provided
> 
> - DTDHandler.unparsedEntityDecl: publicId can be null if not provided
> 
> - DeclHandler.attributeDecl(): mode and value can be null, where
>   in the latter there is actually a semantic difference between
>   value = null (meaning: none defined) and value = empty string
>   (meaning: a value is specified, and it is the empty string).
> 
> - DeclHandler.externalEntityDecl: publicId can be null, if not provided
> 
> - LexicalHandler.startDTD(): publicId, systemId can be null,
>   if not declared
> 
> It actually seems that the prevalent approach is to use null for absence 
> of a string parameter, and only in the case of namespaces does the API
> stray from this rule.
> 
> However, I would strongly suggest that we remain consistent in
> SAX for .NET. If we pick String.Empty, then we need to allow one
> inconsistency - and that is for the Value parameter passed to the
> attributeDecl() call-back.

Originally, this thread was about namespace URI references
and prefixes. The question was whether they should be passed
as empty strings or as null references when absent/not applicable.

However, whatever the outcome of this discussion, it should be applicable
to the whole API, but not as a general rule of either always null
or always empty. The reason is that in the case of DeclHandler.AttributeDecl(),
the "value" parameter has well defined meanings for both, empty string
and null, so both must be allowed depending on the intended semantics.

Maybe we should first ask, when such a problem of deciding
between an empty string or null actually exists?
I would say that is the case when the meaning of both
is "roughly" the same: an undefined or absent string value.

Two examples where I would intuitively make different decisions:

1) ParseError.Message, or SAXParseException.Message:
    In this case I would always assume that there is a message,
    even if it is an empty string. It looks wrong to me to
    even allow a null value.

2) Locator.PublicId: If there is no public identifier, then
    one should make its absence clear. null is better at that.

One could think of this as a guideline:

- If we would say: this string parameter/argument/value
   can be absent, then let's use null to indicate it.
- If we would rather say: this string parameter/argument/value
   can be empty, then let's use "" to indicate it.

Coming back to the SAX API:
How would the above guideline be resolved for namespace URIs
and prefixes when an XML name is not in any namespace?

Karl

Reply to self:

Karl Waclawek wrote:
 > On xml-dev there is a thread called "[xml-dev] SAXException, checked, buy why?".
 > The problem that came up is what to do when trying to pass recoverable
 > errors to the application? Alan Gutierrez thinks one is not allowed to use
 > error handler call-backs, but I think that maybe this is just a result
 > of underspecification. In any case, the objects passed to the error
 > call-backs are SAXParseException objects in Java, and ParseError objects
 > in SAX for .NET.
 >
 > I am thinking we should rename the class from ParseError to SAXError,
 > as errors from application code (as you might have in your content handlers)
 > are not really parse errors, and you should still be allowed to pass
 > them to the error handlers, to avoid throwing an exception.
 >
 > However, all members of ParseError (or SAXParseException ) are geared towards
 > errors that can occur when parsing an XML document (publicId, SystemId,
 > Line/ColumnNumber, etc.).

There is also the problem that SaxParseException has ParseError as member.
So, renaming ParseError gives cascading problems.

 > So, what to do?
 > Should we stick with the limitation that only parse errors can
 > be reported to the error handlers? Is that not rather limiting?

Maybe the right approach is to simply docoument a broader definition
of "parsing". One could say that in the context of SAX,

  "parsing" denotes any form of generating a stream of
   ContentHandler/LexicalHandler events that correspond
   to a well-formed XML document.

With that definition, a "parse" error could also be
an error in the underlying event generation when it
is not XML document based.

So, with giving "parsing" a broader definition
we also give ParseError a broader meaning.

Karl

On xml-dev there is a thread called "[xml-dev] SAXException, checked, buy why?".
The problem that came up is what to do when trying to pass recoverable
errors to the application? Alan Gutierrez thinks one is not allowed to use
error handler call-backs, but I think that maybe this is just a result
of underspecification. In any case, the objects passed to the error
call-backs are SAXParseException objects in Java, and ParseError objects
in SAX for .NET.

I am thinking we should rename the class from ParseError to SAXError,
as errors from application code (as you might have in your content handlers)
are not really parse errors, and you should still be allowed to pass
them to the error handlers, to avoid throwing an exception.

However, all members of ParseError (or SAXParseException ) are geared towards
errors that can occur when parsing an XML document (publicId, SystemId,
Line/ColumnNumber, etc.).

So, what to do?
Should we stick with the limitation that only parse errors can
be reported to the error handlers? Is that not rather limiting?

Karl

Jeff Rafter wrote:
>> In C# this seems less of an issue, as there are two
>> other null-safe options:
>> 1) static method: if (String.Equals(uri, "http://my.namespace.com")) 
>> {...}
>> 2) operator: if (uri == "http://my.namespace.com") {...}
> 
> 
> Admittedly this is pretty standard in C#, however it is plausible that 
> some Java coder will come to C# with limited training and just hack 
> away-- in which case they may not follow this pattern (or if someone is 
> porting a Java library for instance...)
> 
> In any event it can go either way and I will code to it... but I believe 
> that it should be very very clearly documented if we do not go with 
> string.empty. If I get a vote I still vote for string.empty-- though 
> your arguments have made me less zealous.

Would you vote for String.Empty in all cases, or just for namespace
URIs and prefixes? The former would make us even less conformant
with the Java specs.

> OFFTOPIC: sorry I have been out of the discussion for the past two 
> weeks-- I have been doing a lot of unexpected business travel...

You may have a few posts to reply to... :-)

Karl

Elliotte Harold wrote:
> Folks,
> 
> I just got an e-mail from David Megginson informing me that the JavaDocs 
> and some other docs at sax.sourceforge.net have been updated. 
> Apparently, he did not have access to the web site for some time, and 
> the documentation there was not up to date with the latest round of 
> revisions for SAX 2.0.2 that went on some months back for Java 1.5.
> 
> Anyway, this probably fixes at least some but not all of the 
> inconsistencies that have been noted here about what's null and what's 
> the empty string. I haven't checked in detail yet, but it's worth double 
> checking all of our assumptions and comments over the last month or so 
> against the latest docs.

I asked Dave Megginson and he does not remember why exactly
namespace URI string and prefixes are treated differently,
but he thinks it might have been programming convenience.

One xml-dev reference I found, was by Tim Bray, arguing
that since namespace URI references are not allowed to
be empty strings (as per the namespace specs), passing
an empty string should indicate an absent URI ref and then
one can use the Equals method on it without worrying about
null. (RFC 2396 gives a different meaning to empty URI references).

In C# this seems less of an issue, as there are two
other null-safe options:
1) static method: if (String.Equals(uri, "http://my.namespace.com")) {...}
2) operator: if (uri == "http://my.namespace.com") {...}

Of which the latter would be the standard way to compare strings.

Karl

Folks,

I just got an e-mail from David Megginson informing me that the JavaDocs 
and some other docs at sax.sourceforge.net have been updated. 
Apparently, he did not have access to the web site for some time, and 
the documentation there was not up to date with the latest round of 
revisions for SAX 2.0.2 that went on some months back for Java 1.5.

Anyway, this probably fixes at least some but not all of the 
inconsistencies that have been noted here about what's null and what's 
the empty string. I haven't checked in detail yet, but it's worth double 
checking all of our assumptions and comments over the last month or so 
against the latest docs.

-- 
Elliotte Rusty Harold  el...@me...
XML in a Nutshell 3rd Edition Just Published!
http://www.cafeconleche.org/books/xian3/
http://www.amazon.com/exec/obidos/ISBN=0596007647/cafeaulaitA/ref=nosim

Jeff Rafter wrote:
> This is actually something that is already in place and only affects 
> implementations, but we need to clarify when we require EndDocument to
> be called.
> 
> This is the proposal:
> 
> (a) EndDocument is required if StartDocument is called
> (b) Even when there is a FatalError
> (c) In the event of an encoding error, StartDocument should have been 
> called so EndDocument still needs to be called.
> (d) If there is an external exception (such as an IOExcpetion or a user 
> generated SaxExcpetion in a callback) EndDocument is still required.
> (e) EndDocument should be the final event in a SAX event stream.
> 
> Comments welcome,

Just a few general thoughts about his issue:

A) When is it really important for the SAX consumer to be able to rely
    on EndDocument() being called after StartDocument() has been called?

Clearly this is the case when the link between the SAX event generator
and the event consumer is restricted to the IContentHandler interface.
That is, the event generator has no access to, or control over, the
consumer other than through the IContentHandler call-backs.

This excludes the most common situation of calling a Parse() method
on an XML parser, as returning from the Parse() call (with or without
exception) is the equivalent of EndDocument(). The Parse() call is the
additional link.

It also excludes IXmlFilter chains, as with them the process is driven
from the end of the chain, again calling a Parse() method.

Where it applies would be a chain (pipeline) of ContentHandler
instances driven from the event generator, especially when they are
dynamically assembled.

B) Is it actually possible for the SAX event generator to guarantee
    to the consumer that it will receive the EndDocument() event?

The general answer would be no. The reason is that the connection
between event generator and consumer can cross process boundaries.
Especially remote call-backs simply cannot guarantee that EndDocument()
will be called on the consumer, but even inter-process calls on the same
machine are not fail-safe. That makes it impossible to fulfill the
EndDocument() contract on anything but in-process call-backs
(as long as the hardware has no problem, of course).

In practice, non-local SAX call-backs will rarely be used, but it is not
a completely unrealistic use case either.

C) So how would one deal with a scenario where the final call to
    EndDocument() is not guaranteed?

The cleanup routines normally called from EndDocument() should also
be callable from Finalize(), or any other "Reset()" kind of method.
So, once the consumer gets notified of the end of the event stream
(successful or not), the cleanup should proceed.

- For in-process consumers (.NET), driving the event generation:
   On return from Parse(), call EndDocument() (or the equivalent cleanup code)
   if it has not been called yet.

- For in-process consumers (.NET) not driving the event generation
   (downstream modules in a content handler chain):
   This depends if the consumer has knowledge of how it is going to
   be called. If it knows it will never be called across process boundaries,
   then it should assume that EndDocument() will be called. Otherwise it is the
   same case as for out-of-process consumers. See below.

- For out-of-process consumers:
   The "end-of-document" cleanup code should be put into a separate method
   callable from several points - from EndDocument() and  from Finalize(), or
   any other "Reset()" kind of method. It is the responsibility of whatever
   controller/container operates the consumer to make sure it gets notified
   of any communication errors. The SAX event generator cannot make such
   guarantees.

SUMMARY

It seems one has to separate the IXmlReader.Parse() method (and its 
equivalents) from the contract defined for IContentHandler. Exceptions thrown 
during Parse() may or may not be detectable by the event consumer and 
therefore should not have an influence on the contract.

Even if one requires that the SAX event generator must always make a call to 
IContentHandler.EndDocument() (when StartDocument() was called), one cannot 
guarantee that the SAX consumer will also receive that event. This means, the 
SAX consumer has to be aware of how it communicates with the SAX generator.

So let's then phrase the requirements again:

===========
For a stream of SAX events that represent an XML document, the SAX event 
producer must call IContentHandler.StartDocument() exactly once, *before* any 
part of the input, on which the SAX events are based, is processed.

IContentHandler.EndDocument() *must* be called by the SAX event producer 
exactly once as the last event in a SAX event stream initiated by a 
IContentHandler.StartDocument() call, regardless of any exceptional or error 
situation encountered. Depending on the call communication mechanism, however, 
this is no guarantee that the SAX event consumer will also receive that call.
===========

Note: It seems to me this covers Jeff's requirements a) to e) above and makes 
it clear that this not only applies to the standard configuration of calling 
IXmlReader.Parse() on a SAX parser - for which these requirements would not 
strictly be necessary - but more generally to any sequence of SAX events that 
represent an XML document.

Karl

2004	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec (45)
2005	Jan (20)	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec

saxdotnet-devel Mailing List for SAX for .NET

saxdotnet-devel — Development and API issues discussed here