From: Jenny B. <sk...@gm...> - 2008-04-17 23:28:33
|
On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: > Thanks for reporting back with your solution, and I'm glad you found one! Two > questions, though... > > 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the > other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not > choose one and be done? They tackle different problems. NekoHTML gives me excellent ability to manipulate the DOM tree, adding and removing nodes, rewriting attributes, getting text content out of them, etc. JTidy has poor DOM manipulation due to incomplete implementation (such as getTextContent() throws an abstract method error) and a narrower API. But NekoHTML fails on a lot of html oddities I was encountering, which JTidy deals with just fine and can clean up automatically. If I tried to use Neko alone, it failed on a significant portion of my test documents. If I put JTidy in front of it, Neko always got an input it could read. I needed clean first, and then I needed to do fairly invasive changes to the contents of the dom, then save the result. > 2. Did you find out whether Xalan's Serializer has special handling for XHTML > -vs- other types of XML documents? Browsers are quirky about how they parse > XHTML. For instance, some browsers deal better with <script></script> than > <script/>, especially browsers that don't really understand XHTML at all like IE > (they treat it as HTML). I didn't ask. Tho I do have a question still remaining there on whether I should be concerned with seeing this at the beginning of my output file: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> I suspect there's another setting somewhere I need to flip to make that go away (and I am guessing that it should go away), but I'm not especially familiar with xml, xhtml, namespaces, and doctypes - still learning on this piece. Jenny Brown |