#31 Reduce the requirement to create Strings.


A sizable percentage of the cost of parsing is for
creating Strings that are part of the callback interfaces.
The cost can be reduced when strings are repeated (like
element and attribute names), by caching and re-using
Strings. But in cases where Strings are unlikely to be
reused, like attribute values, the cost is high in having
to return String objects. It would be nice if all instances
of String were replaced with a mutatable version of
String, so the implementation had the option not to
create new Strings. If the client needed a real
java.lang.String, they could call toString().

Unfortunately I'm not sure how this could be retrofitted
into the existing interfaces without breaking current

Maybe it would require alternate interfaces (a mirror set
of Interface2's) that break backward compatibility?


  • David Brownell
    David Brownell

    Logged In: YES

    Unlikely to fix this. The fundamental problem is
    still that java.lang.String costs are excessive.
    That's a JDK 1.0 language design issue.

    SAX implementations can also cache attribute
    strings in certain cases (enumerated types),
    or control costs by smarter handling of CDATA
    strings (either not caching them, since they're
    not likely to repeat, or having a cache that's
    better at purging unused data).

  • David Brownell
    David Brownell

    • priority: 5 --> 1
  • Marko

    Logged In: YES

    I'm not suggesting that you fix java.lang.String. Rather I think
    you could replace the use of type java.lang.String, especially
    for costly instances like attribute values, with a new type
    org.sax.MyString (or whatever you wish to name it), where
    MyString has more or less the same functional interface as
    java.lang.String. But instances would be reused, rather that
    create a new java.lang.String for every attribute value.
    MyString would be mutatable and backed by the underlying
    Sax char buffer, so no new objects are created and no
    copying of data would be performed. It would be assumed
    that the life-time of MyString would only last the length of the
    callback. So if the client really needed
    to save a java.lang.String (eg. building DOM), then the client
    could call MyString.toString().

    This leaves the cost of building a heavyweight java.lang.String
    upto the client.

    Minimum interface change I would like to see,
    org.sax.MyString Attribute.getValue().
    To be consistent character content could also be changed to

  • David Brownell
    David Brownell

    Logged In: YES

    As I said, unlikely to fix. Implementations can change
    their performance behaviors by interning/caching strings
    used for attribute values, or not ... some might even be
    able to defer creating them, though they'd need to at
    least report XML parsing errors for things like entity
    refs inside those strings.

    That is, you can have an implementation with your
    desired performance curve without changing APIs.

  • Marko

    Logged In: YES

    For most cases, the variability of attribute values makes the
    cost of caching higher than not caching. Lazy only helps for
    use cases that skip attribute values.

    eg. Find me all elements that contain an attribute with the
    value of "dog". In this case SAX will create a new heavy
    weight String for every attribute of the document. For this use
    case, how do you propose I make {java.lang.String
    Attribute.getValue()}, perform as well as {MyString
    Attribute.getValue()}? Caching and lazy building will not do it.

    If you feel it is not a big deal to represent attribute values with
    java.lang.Strings, then why don't you do the same with
    character data?

  • Marko

    Logged In: YES

    Thanks Reid. I was actually aware of implements that
    avoided Strings, but I was more interested in having the
    official SAX interface support light-weight strings. I'm quite
    surprised that the SAX maintainers can not see the value in
    this. Hopefully some day they will see the light.