Hi.
I was trying to use Hotsax but I noticed some strange
behaviour when it came to handling of <!-- comments -->
and <![CDATA[]]> sections.
As a test I wrote a very simple ContentHandler &
LexicalHandler (attached) which just output the SAX
event sequence to sysout.
Input:
<x>
Text
<!-- comment -->
</x>
Resulting event sequence:
startDocument
startElement: [, x, ]
characters: [ Text ]
comment: [ comment ]
characters: [ Text ]
endElement: [, x, ]
endDocument
As you can see, the characters " Text " are getting
fired twice.
Input:
<x>
Text
<![CDATA[Cdata]]>
</x>
Resulting event sequence:
startDocument
startElement: [, x, ]
characters: [ Text ]
startCDATA
characters: [ Text ]
endElement: [, x, ]
endDocument
Again, very similar but with the added problem of none
of the CDATA data being fired at the content handler.
Parser code - includes the Content/Lexical Handler implementation.
you have a carriage return before your end tag.
if you tried:
<x>
Text
<!-- comment --></x>
<x>
Text
<![CDATA[Cdata]]></x>
then you don't get the problem which is shown by:
@Test
public void testSimpleComment() throws Exception {
String html = "<x>\n" +
"Text\n" +
"<!-- comment -->\n" +
"</x>";
final List<String> text = new ArrayList<String>();
ContentHandler ch = new DefaultContentHandler(){
@Override
public void characters(char[] ch, int start, int length)
throws SAXException {
String t = new String(ch, start, length);
text.add(t);
}
};
parser.setContentHandler(ch);
InputSource input = new InputSource(new StringReader(html));
parser.parse(input);
Assert.assertThat(text.size(), is (2));
Assert.assertThat(text.get(0), is("\nText\n") );
Assert.assertThat(text.get(1), is("\n") );
}
on the version of 0.1.2b which I am running.