I see that view/source in lobo is showing what was read in, not necessarily what resulted from the use of open/write or other dom functions.
Is there some code I'm not seeing for dumping out the HTML as text? Or would you recommend trax?
If you parse the document with Cobra, you can call getOuterHTML() on the elements just below the document to get essentially that result.
Could you offer any comparison between your parser and TagSoup in terms of tolerance of messy-looking inputs?
I haven't used TagSoup, but if you find Cobra is unable to parse any HTML the way a browser would be expected to, I would ask you to post a bug report.
I have to decide whether to move to you from TagSoup. TagSoup is focussed in getting XML from sloppy html. That works well for me until the point where I might want HTML back, instead of XML. The outerHTML thing hadn't occurred to me, I'll try it out at some point.
If you want to get an idea of what kind of DOM Cobra might generate for any particular document, try the Cobra Test Tool or the Parser Test program. It gives you a TreeView representation of the HTML DOM.
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.