Re: [Htmlparser-user] Parsing malformed HTML whilst still leaving it intact

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

You have only addressed the top level nodes in this code (the nodes in 
list).
The toHtml() calls are recursive, so you need to put this logic in the 
definition of toHtml(), probably only in TagNode.java and maybe 
CompositeTag.java.

Marc Candle wrote:

> Hi,
>
> Thanks for your response. I tried this code below in an attempt to see 
> if it would work given your comment:
>
> StringBuffer finalContents = new StringBuffer();
>
> //Generate final output
>
> for (NodeIterator e = list.elements (); e.hasMoreNodes (); ) {
>
> Node node = e.nextNode ();
>
> if ( node.getEndPosition() == node.getStartPosition() ) {
>
> log.debug ( " IGNORED node : " + node.toHtml());
>
> continue;
>
> }
>
> if (node instanceof TagNode) {
>
> if ( ((TagNode)node).getTagEnd() == ((TagNode)node). getTagBegin() ) {
>
> log.debug ( " IGNORED node : " + node.toHtml());
>
> continue;
>
> }
>
> }
>
> finalContents.append(node.toHtml());
>
> }
>
> This didn't seem to make any different. The positions of the virtual 
> tags must’ve been corrected at an earlier stage in htmlparser. I have 
> started looking at the htmlparser source to see where this occurs.
>
> Kind Regards,
>
> Mark
>
> -----Original Message-----
> From: htm...@li... 
> [mailto:htm...@li...] On Behalf Of 
> Derrick Oswald
> Sent: 23 January 2006 12:37
> To: htm...@li...
> Subject: Re: [Htmlparser-user] Parsing malformed HTML whilst still 
> leaving it intact
>
> This has been a requested task for two years now:
>
> http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&func=browse
>
> The virtual tags that are added have the start position the same as the
>
> end position, so a smarter toHtml() could recognize them that way and
>
> avoid outputting them.
>
> Marc Candle wrote:
>
>>Hi,
>
>>
>
>>I'm parsing snippets of HTML pages at a time, making some changes and then
>
>>outputting back to HTML. The problem with HTML snippets is that they 
> will be
>
>>malformed since some closing tags, for example, will be missing.
>
>>
>
>>The Parser seems to automatically correct the malformed HTML by adding
>
>>closing tags. Is it possible to prevent it from doing so? Or at least 
> it can
>
>>notify me when it does so, so that before reconstructing the modified HTML
>
>>output I can simply delete them.
>
>>
>
>>An alternative would be to use the Lexer but then I loose all the
>
>>hierarchical features of the Parser, which not an option.
>
>>
>
>>This is similar to the general problem brought up in
>
>> <http://sourceforge.net/mailarchive/message.php?msg_id=12635550>
>
>>http://sourceforge.net/mailarchive/message.php?msg_id=12635550 .
>
>>
>
>>Kind Regards
>
>>
>
>>Mark
>
>>
>
>>
>
>>
>
>>
>
> -------------------------------------------------------
>
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log 
> files
>
> for problems? Stop! Download the new AJAX search engine that makes
>
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
>
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
>
> _______________________________________________
>
> Htmlparser-user mailing list
>
> Htm...@li...
>
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>