Menu

#68 Unmatched <script> tag eats the rest of the markup in the document

General
closed-out-of-date
nobody
None
5
2015-10-24
2013-09-10
Trejkaz
No

We're evaluating switching to Jericho from our own parser based on HTMLParser, as ours is sort of limited.

For this sample from our test cases:

Leading Line.
<script>Some broken script.<p>Line 1.</p><p>Line 2.</p>

Jericho currently decides that the last

is the end of the <script>, resulting in our text extraction getting no content as we omit scripts.</p> <p>Our existing parser hits the <p> and decides that the script is over, so "Line 1." comes out as text.</p></script>

Discussion

  • Trejkaz

    Trejkaz - 2013-09-10

    Rats. Markdown interpreted that tag as part of the content and now I can't fix it.

     
  • Martin Jericho

    Martin Jericho - 2013-09-10

    Hi Trejkaz,

    I can't see exactly what your problem is because of the markup issue, but the documentation of the TagType.isValidPosition method includes a comprehensive discussion about how script elements are parsed in different circumstances, and how the behaviour compares to HTML4 and HTML5 specifications.

    http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TagType.html#isValidPosition%28net.htmlparser.jericho.Source,%20int,%20int[]%29

    Cheers
    Martin

     
  • Martin Jericho

    Martin Jericho - 2013-09-10
    • status: unread --> pending
     
  • Martin Jericho

    Martin Jericho - 2015-10-24
    • status: pending --> closed-out-of-date
     

Log in to post a comment.