dmunguiam - 2007-05-31

Hi,

I'm having trouble parsing the following html file:

<html>
<head><script language="javascript">alert("<body>something</body>");</script></head>
<body><h1>Hello World!</h1></body>
</html>

When I ask the parser to get the next body element it will return "<body>something</body>" instead of the correct one "<body><h1>Hello World!</h1></body>".

I tried to tell the parser to skip the head element and continue from there

        int afterHEADPosition = source.findNextElement(0, HTMLElementName.HEAD).getEnd();
        Element bodyElement = source.findNextElement(afterHEADPosition, HTMLElementName.BODY);

but the call to source.findNextElement(0, HTMLElementName.HEAD) will return "<head><script language="javascript">alert("" and a call to source.findNextElement(0, HTMLElementName.HEAD).getEndTag() will return null. The parsing of head is being cut when it finds the <body> token inside the javascript string.

Does anybody knows of a workaround for this? I'm guessing javascript is not being parsed as part of the html but in that case I'm wondering if the parser should ignore everything inside <script> tags.

Thanks in advance for your help.

- Diego