Thread: Re: [Htmlparser-user] scanning / parsing bug?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I believe you want to set the static member for strict remark parsing to false:
  org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;

----- Original Message ----
From: Subramanya Sastry <sa...@cs...>
To: htmlparser user list <htm...@li...>
Sent: Tuesday, December 11, 2007 5:02:08 PM
Subject: [Htmlparser-user] scanning / parsing bug?

For this url, 
http://www.washingtonpost.com/wp-dyn/content/article/2007/12/10/AR2007121001600.html 
(and maybe other washington post urls), I wonder if HTML Parser is 
running into a bug.

The HTML source for this page has the following block of HTML in the 
middle ..

    <!---------------- End New Comments Box ------------------>
    <div class="sidebarhack"><b></b></div>
    ....
    ....
    </div>
    <!-- sphereit end -->
    <br clear="all">

The parser is ignoring all content from the start of the line 'End New 
Comments Box' till 'sphereit end' ... I wonder if this is because of
 the 
lack of a space before the '-->' closing comment string in the first 
line ... I tested the code by adding a space manually at that point,
 and 
sure enough, the block of HTML in the middle is correctly recognized.

Is there a workaround for this?  I am also willing to download the 
source code and incorporate a fix, if necessary.

Thanks,
Subbu.

-------------------------------------------------------------------------
SF.Net email is sponsored by: 
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user