I believe you want to set the static member for strict remark parsing to false:
org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;
----- Original Message ----
From: Subramanya Sastry <sa...@cs...>
To: htmlparser user list <htm...@li...>
Sent: Tuesday, December 11, 2007 5:02:08 PM
Subject: [Htmlparser-user] scanning / parsing bug?
For this url,
http://www.washingtonpost.com/wp-dyn/content/article/2007/12/10/AR2007121001600.html
(and maybe other washington post urls), I wonder if HTML Parser is
running into a bug.
The HTML source for this page has the following block of HTML in the
middle ..
<!---------------- End New Comments Box ------------------>
<div class="sidebarhack"><b></b></div>
....
....
</div>
<!-- sphereit end -->
<br clear="all">
The parser is ignoring all content from the start of the line 'End New
Comments Box' till 'sphereit end' ... I wonder if this is because of
the
lack of a space before the '-->' closing comment string in the first
line ... I tested the code by adding a space manually at that point,
and
sure enough, the block of HTML in the middle is correctly recognized.
Is there a workaround for this? I am also willing to download the
source code and incorporate a fix, if necessary.
Thanks,
Subbu.
-------------------------------------------------------------------------
SF.Net email is sponsored by:
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|