Jericho HTML Parser / Bugs / #68 Unmatched <script> tag eats the rest of the markup in the document

#68 Unmatched <script> tag eats the rest of the markup in the document

Milestone: General

Status: closed-out-of-date

Owner: nobody

Labels: None

Priority: 5

Updated: 2015-10-24

Created: 2013-09-10

Creator: Trejkaz

Private: No

We're evaluating switching to Jericho from our own parser based on HTMLParser, as ours is sort of limited.

For this sample from our test cases:

Leading Line.
<script>Some broken script.<p>Line 1.</p><p>Line 2.</p>

Jericho currently decides that the last

is the end of the <script>, resulting in our text extraction getting no content as we omit scripts.</p> <p>Our existing parser hits the <p> and decides that the script is over, so "Line 1." comes out as text.</p></script>

Discussion

Trejkaz - 2013-09-10

Rats. Markdown interpreted that tag as part of the content and now I can't fix it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-09-10

Hi Trejkaz,

I can't see exactly what your problem is because of the markup issue, but the documentation of the TagType.isValidPosition method includes a comprehensive discussion about how script elements are parsed in different circumstances, and how the behaviour compares to HTML4 and HTML5 specifications.

http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TagType.html#isValidPosition%28net.htmlparser.jericho.Source,%20int,%20int[]%29

Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2013-09-10

status: unread --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2015-10-24

status: pending --> closed-out-of-date
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Unmatched <script> tag eats the rest of the markup in the document

Group

Searches

Help

#68 Unmatched <script> tag eats the rest of the markup in the document

Discussion