Menu

#31 Reduce the requirement to create Strings.

open
nobody
None
1
2014-07-13
2003-05-28
Marko
No

A sizable percentage of the cost of parsing is for
creating Strings that are part of the callback interfaces.
The cost can be reduced when strings are repeated (like
element and attribute names), by caching and re-using
Strings. But in cases where Strings are unlikely to be
reused, like attribute values, the cost is high in having
to return String objects. It would be nice if all instances
of String were replaced with a mutatable version of
String, so the implementation had the option not to
create new Strings. If the client needed a real
java.lang.String, they could call toString().

Unfortunately I'm not sure how this could be retrofitted
into the existing interfaces without breaking current
clients?

Maybe it would require alternate interfaces (a mirror set
of Interface2's) that break backward compatibility?

Discussion

  • Anonymous

    Anonymous - 2003-05-29

    Logged In: YES
    user_id=44117

    Unlikely to fix this. The fundamental problem is
    still that java.lang.String costs are excessive.
    That's a JDK 1.0 language design issue.

    SAX implementations can also cache attribute
    strings in certain cases (enumerated types),
    or control costs by smarter handling of CDATA
    strings (either not caching them, since they're
    not likely to repeat, or having a cache that's
    better at purging unused data).

     
  • Anonymous

    Anonymous - 2003-05-29
    • priority: 5 --> 1
     
  • Marko

    Marko - 2003-05-30

    Logged In: YES
    user_id=494857

    I'm not suggesting that you fix java.lang.String. Rather I think
    you could replace the use of type java.lang.String, especially
    for costly instances like attribute values, with a new type
    org.sax.MyString (or whatever you wish to name it), where
    MyString has more or less the same functional interface as
    java.lang.String. But instances would be reused, rather that
    create a new java.lang.String for every attribute value.
    MyString would be mutatable and backed by the underlying
    Sax char buffer, so no new objects are created and no
    copying of data would be performed. It would be assumed
    that the life-time of MyString would only last the length of the
    callback. So if the client really needed
    to save a java.lang.String (eg. building DOM), then the client
    could call MyString.toString().

    This leaves the cost of building a heavyweight java.lang.String
    upto the client.

    Minimum interface change I would like to see,
    org.sax.MyString Attribute.getValue().
    To be consistent character content could also be changed to
    ContentHandler.characters(MyString).

     
  • Anonymous

    Anonymous - 2003-05-31

    Logged In: YES
    user_id=44117

    As I said, unlikely to fix. Implementations can change
    their performance behaviors by interning/caching strings
    used for attribute values, or not ... some might even be
    able to defer creating them, though they'd need to at
    least report XML parsing errors for things like entity
    refs inside those strings.

    That is, you can have an implementation with your
    desired performance curve without changing APIs.

     
  • Marko

    Marko - 2003-06-02

    Logged In: YES
    user_id=494857

    For most cases, the variability of attribute values makes the
    cost of caching higher than not caching. Lazy only helps for
    use cases that skip attribute values.

    eg. Find me all elements that contain an attribute with the
    value of "dog". In this case SAX will create a new heavy
    weight String for every attribute of the document. For this use
    case, how do you propose I make {java.lang.String
    Attribute.getValue()}, perform as well as {MyString
    Attribute.getValue()}? Caching and lazy building will not do it.

    If you feel it is not a big deal to represent attribute values with
    java.lang.Strings, then why don't you do the same with
    character data?

     
  • Reid M. Pinchback

    Logged In: YES
    user_id=285221

    There is a SAX2-like parser that has made the shift away
    from String. The RealtimeParser javadoc describes the
    changes that were necessary.

    Project URL:
    http://jade.dautelle.com/

    SAX Parser URL:
    http://jade.dautelle.com/api/jade/xml/sax/RealtimeParser.ht
    ml

     
  • Marko

    Marko - 2004-09-08

    Logged In: YES
    user_id=494857

    Thanks Reid. I was actually aware of implements that
    avoided Strings, but I was more interested in having the
    official SAX interface support light-weight strings. I'm quite
    surprised that the SAX maintainers can not see the value in
    this. Hopefully some day they will see the light.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.