We're evaluating switching to Jericho from our own parser based on HTMLParser, as ours is sort of limited.
For this sample from our test cases:
Leading Line. <script>Some broken script.<p>Line 1.</p><p>Line 2.</p>
Jericho currently decides that the lastis the end of the <script>, resulting in our text extraction getting no content as we omit scripts.
Our existing parser hits the
and decides that the script is over, so "Line 1." comes out as text.
Log in to post a comment.