Unmatched <script> tag eats the rest of the markup in the document
Brought to you by:
mjericho
We're evaluating switching to Jericho from our own parser based on HTMLParser, as ours is sort of limited.
For this sample from our test cases:
Leading Line.
<script>Some broken script.<p>Line 1.</p><p>Line 2.</p>
Jericho currently decides that the last
is the end of the <script>, resulting in our text extraction getting no content as we omit scripts.</p> <p>Our existing parser hits the <p> and decides that the script is over, so "Line 1." comes out as text.</p></script>
Rats. Markdown interpreted that tag as part of the content and now I can't fix it.
Hi Trejkaz,
I can't see exactly what your problem is because of the markup issue, but the documentation of the TagType.isValidPosition method includes a comprehensive discussion about how script elements are parsed in different circumstances, and how the behaviour compares to HTML4 and HTML5 specifications.
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TagType.html#isValidPosition%28net.htmlparser.jericho.Source,%20int,%20int[]%29
Cheers
Martin