RE: [Htmlparser-user] Parsing malformed HTML whilst still leaving it intact

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

Thanks for your response. I tried this code below in an attempt to see if it
would work given your comment:

StringBuffer finalContents = new StringBuffer();

//Generate final output

for (NodeIterator e = list.elements (); e.hasMoreNodes (); ) {

    Node node = e.nextNode ();

    if ( node.getEndPosition() == node.getStartPosition() ) {

        log.debug ( " IGNORED node : " + node.toHtml());

        continue;

    }

    if (node instanceof TagNode) {

        if ( ((TagNode)node).getTagEnd() == ((TagNode)node). getTagBegin() )
{

log.debug ( " IGNORED node : " + node.toHtml());

continue;

        }

    }

    finalContents.append(node.toHtml());

}

This didn't seem to make any different. The positions of the virtual tags
must've been corrected at an earlier stage in htmlparser. I have started
looking at the htmlparser source to see where this occurs.

Kind Regards,

Mark

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of Derrick
Oswald
Sent: 23 January 2006 12:37
To: htm...@li...
Subject: Re: [Htmlparser-user] Parsing malformed HTML whilst still leaving
it intact

This has been a requested task for two years now:

http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&fun
c=browse

The virtual tags that are added have the start position the same as the 

end position, so a smarter toHtml() could recognize them that way and 

avoid outputting them.

Marc Candle wrote:

>Hi,

> 

>I'm parsing snippets of HTML pages at a time, making some changes and then

>outputting back to HTML. The problem with HTML snippets is that they will
be

>malformed since some closing tags, for example, will be missing. 

> 

>The Parser seems to automatically correct the malformed HTML by adding

>closing tags. Is it possible to prevent it from doing so? Or at least it
can

>notify me when it does so, so that before reconstructing the modified HTML

>output I can simply delete them.

> 

>An alternative would be to use the Lexer but then I loose all the

>hierarchical features of the Parser, which not an option.

> 

>This is similar to the general problem brought up in 

> <http://sourceforge.net/mailarchive/message.php?msg_id=12635550>

>http://sourceforge.net/mailarchive/message.php?msg_id=12635550 .

> 

>Kind Regards

> 

>Mark

> 

> 

>  

> 

-------------------------------------------------------

This SF.net email is sponsored by: Splunk Inc. Do you grep through log files

for problems?  Stop!  Download the new AJAX search engine that makes

searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!

http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642

_______________________________________________

Htmlparser-user mailing list

Htm...@li...

https://lists.sourceforge.net/lists/listinfo/htmlparser-user