Menu

#67 InputSource character buffer size 2048 problem

closed-rejected
nobody
1
2007-08-28
2006-08-11
Anonymous
No

I suppose someone has submitted this bug already. when
the characters content is across two "chunks of
character buffer", the content will be split into two...

please follow... if anyone wants to =P

Discussion

  • scpk

    scpk - 2006-11-28

    Logged In: YES
    user_id=1655502
    Originator: NO

    Hello,
    I get the same problem. Is any solution available for this problem?

     
  • David Megginson

    David Megginson - 2007-08-28
    • status: open --> open-rejected
     
  • David Megginson

    David Megginson - 2007-08-28

    Logged In: YES
    user_id=232602
    Originator: NO

    SAX allows parsers to split character data into multiple events however they want. A client application has to concatenate all of the character events inside an element.

     
  • David Megginson

    David Megginson - 2007-08-28
    • priority: 5 --> 1
     
  • David Megginson

    David Megginson - 2007-08-28
    • status: open-rejected --> closed-rejected
     
  • Nobody/Anonymous

    Logged In: NO

    Cute how the developers hide behind the spec.

    Yes, SAX allows the parser to split character data into multiple buffers, however, I'm sure that the originator of the spec. intended that the values of the incoming offset and length variables be relevant to the character buffer passed into the characters() method, more-so the value of the buffer offset.

    Under the current implementations of SAX the value of 'offset' is always the offset from the start of the document. This is 100% useless when the incoming document is split since it will regularly generate an ArrayOutOfBoundsException and the value of the buffer offset can not be manually corrected because there is no way to know WHERE the incoming document was split.

    For example; an incoming document is 6850 characters long so the parser splits it into multiple character buffers, let's say that each one is approx. 1280 characters long. The first buffer will be fine but imagine the third that will give characters() a buffer of 1280 characters long and an offset of 2560 plus the offset relative to the actual buffer received. This would be manageable if the buffers were all the same length but it seems that such a guarantee is not forthcoming.

    When coupled with the parser's insistence that a single opening square brace is the end of a CDATA section's node value this little problem becomes an unmanageable nightmare.

    Add to this little problem that I've just got an example where the node that characters() is beign called for actually has the character data SPLIT OVER TWO BUFFERS because the SAX parser insists that '[' indicates that the character data is done.

    This is shit. Not just any shit mind you, but horse shit. Decomposing horse shit on my living room floor.

    For the love of Christ people, either fix the parser to that the buffer offset is relevant to the buffer received or fix the parser so that '[' doesn't get recognised as the end of data tag for a CDATA section. Preferably do both but I won't hold my breath, I know how the OSS community operates.

     
  • David Megginson

    David Megginson - 2008-03-05

    Logged In: YES
    user_id=232602
    Originator: NO

    The rude comments from "nobody" seem to be about a specific parser, not about SAX itself. I suggest that you take your comments to the developers of the parser you're using, but FWIW, it sounds like they're implementing SAX correctly. We designed SAX to be fast, and were careful not to force extra buffering on anyone.

     

Log in to post a comment.