A sizable percentage of the cost of parsing is for
creating Strings that are part of the callback interfaces.
The cost can be reduced when strings are repeated (like
element and attribute names), by caching and re-using
Strings. But in cases where Strings are unlikely to be
reused, like attribute values, the cost is high in having
to return String objects. It would be nice if all instances
of String were replaced with a mutatable version of
String, so the implementation had the option not to
create new Strings. If the client needed a real
java.lang.String, they could call toString().
Unfortunately I'm not sure how this could be retrofitted
into the existing interfaces without breaking current
clients?
Maybe it would require alternate interfaces (a mirror set
of Interface2's) that break backward compatibility?
Logged In: YES
user_id=44117
Unlikely to fix this. The fundamental problem is
still that java.lang.String costs are excessive.
That's a JDK 1.0 language design issue.
SAX implementations can also cache attribute
strings in certain cases (enumerated types),
or control costs by smarter handling of CDATA
strings (either not caching them, since they're
not likely to repeat, or having a cache that's
better at purging unused data).
Logged In: YES
user_id=494857
I'm not suggesting that you fix java.lang.String. Rather I think
you could replace the use of type java.lang.String, especially
for costly instances like attribute values, with a new type
org.sax.MyString (or whatever you wish to name it), where
MyString has more or less the same functional interface as
java.lang.String. But instances would be reused, rather that
create a new java.lang.String for every attribute value.
MyString would be mutatable and backed by the underlying
Sax char buffer, so no new objects are created and no
copying of data would be performed. It would be assumed
that the life-time of MyString would only last the length of the
callback. So if the client really needed
to save a java.lang.String (eg. building DOM), then the client
could call MyString.toString().
This leaves the cost of building a heavyweight java.lang.String
upto the client.
Minimum interface change I would like to see,
org.sax.MyString Attribute.getValue().
To be consistent character content could also be changed to
ContentHandler.characters(MyString).
Logged In: YES
user_id=44117
As I said, unlikely to fix. Implementations can change
their performance behaviors by interning/caching strings
used for attribute values, or not ... some might even be
able to defer creating them, though they'd need to at
least report XML parsing errors for things like entity
refs inside those strings.
That is, you can have an implementation with your
desired performance curve without changing APIs.
Logged In: YES
user_id=494857
For most cases, the variability of attribute values makes the
cost of caching higher than not caching. Lazy only helps for
use cases that skip attribute values.
eg. Find me all elements that contain an attribute with the
value of "dog". In this case SAX will create a new heavy
weight String for every attribute of the document. For this use
case, how do you propose I make {java.lang.String
Attribute.getValue()}, perform as well as {MyString
Attribute.getValue()}? Caching and lazy building will not do it.
If you feel it is not a big deal to represent attribute values with
java.lang.Strings, then why don't you do the same with
character data?
Logged In: YES
user_id=285221
There is a SAX2-like parser that has made the shift away
from String. The RealtimeParser javadoc describes the
changes that were necessary.
Project URL:
http://jade.dautelle.com/
SAX Parser URL:
http://jade.dautelle.com/api/jade/xml/sax/RealtimeParser.ht
ml
Logged In: YES
user_id=494857
Thanks Reid. I was actually aware of implements that
avoided Strings, but I was more interested in having the
official SAX interface support light-weight strings. I'm quite
surprised that the SAX maintainers can not see the value in
this. Hopefully some day they will see the light.