Re: [Htmlparser-user] scanning / parsing bug?
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-12-11 23:21:30
|
I believe you want to set the static member for strict remark parsing to false: org.htmlparser.lexer.Lexer.STRICT_REMARKS = false; ----- Original Message ---- From: Subramanya Sastry <sa...@cs...> To: htmlparser user list <htm...@li...> Sent: Tuesday, December 11, 2007 5:02:08 PM Subject: [Htmlparser-user] scanning / parsing bug? For this url, http://www.washingtonpost.com/wp-dyn/content/article/2007/12/10/AR2007121001600.html (and maybe other washington post urls), I wonder if HTML Parser is running into a bug. The HTML source for this page has the following block of HTML in the middle .. <!---------------- End New Comments Box ------------------> <div class="sidebarhack"><b></b></div> .... .... </div> <!-- sphereit end --> <br clear="all"> The parser is ignoring all content from the start of the line 'End New Comments Box' till 'sphereit end' ... I wonder if this is because of the lack of a space before the '-->' closing comment string in the first line ... I tested the code by adding a space manually at that point, and sure enough, the block of HTML in the middle is correctly recognized. Is there a workaround for this? I am also willing to download the source code and incorporate a fix, if necessary. Thanks, Subbu. ------------------------------------------------------------------------- SF.Net email is sponsored by: Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |