For this url,
http://www.washingtonpost.com/wp-dyn/content/article/2007/12/10/AR2007121001600.html
(and maybe other washington post urls), I wonder if HTML Parser is
running into a bug.
The HTML source for this page has the following block of HTML in the
middle ..
<!---------------- End New Comments Box ------------------>
<div class="sidebarhack"><b></b></div>
....
....
</div>
<!-- sphereit end -->
<br clear="all">
The parser is ignoring all content from the start of the line 'End New
Comments Box' till 'sphereit end' ... I wonder if this is because of the
lack of a space before the '-->' closing comment string in the first
line ... I tested the code by adding a space manually at that point, and
sure enough, the block of HTML in the middle is correctly recognized.
Is there a workaround for this? I am also willing to download the
source code and incorporate a fix, if necessary.
Thanks,
Subbu.
|