RE: [Htmlparser-user] Parsing malformed HTML whilst still leaving it intact
Brought to you by:
derrickoswald
From: Marc C. <mc...@ja...> - 2006-01-24 12:13:53
|
Hi, Thanks for your response. I tried this code below in an attempt to see if it would work given your comment: StringBuffer finalContents = new StringBuffer(); //Generate final output for (NodeIterator e = list.elements (); e.hasMoreNodes (); ) { Node node = e.nextNode (); if ( node.getEndPosition() == node.getStartPosition() ) { log.debug ( " IGNORED node : " + node.toHtml()); continue; } if (node instanceof TagNode) { if ( ((TagNode)node).getTagEnd() == ((TagNode)node). getTagBegin() ) { log.debug ( " IGNORED node : " + node.toHtml()); continue; } } finalContents.append(node.toHtml()); } This didn't seem to make any different. The positions of the virtual tags must've been corrected at an earlier stage in htmlparser. I have started looking at the htmlparser source to see where this occurs. Kind Regards, Mark -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: 23 January 2006 12:37 To: htm...@li... Subject: Re: [Htmlparser-user] Parsing malformed HTML whilst still leaving it intact This has been a requested task for two years now: http://sourceforge.net/pm/task.php?group_project_id=21601&group_id=24399&fun c=browse The virtual tags that are added have the start position the same as the end position, so a smarter toHtml() could recognize them that way and avoid outputting them. Marc Candle wrote: >Hi, > >I'm parsing snippets of HTML pages at a time, making some changes and then >outputting back to HTML. The problem with HTML snippets is that they will be >malformed since some closing tags, for example, will be missing. > >The Parser seems to automatically correct the malformed HTML by adding >closing tags. Is it possible to prevent it from doing so? Or at least it can >notify me when it does so, so that before reconstructing the modified HTML >output I can simply delete them. > >An alternative would be to use the Lexer but then I loose all the >hierarchical features of the Parser, which not an option. > >This is similar to the general problem brought up in > <http://sourceforge.net/mailarchive/message.php?msg_id=12635550> >http://sourceforge.net/mailarchive/message.php?msg_id=12635550 . > >Kind Regards > >Mark > > > > ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |