Re: [nekohtml-user] Getting html back out of the DOM after manipulation

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote:
> Thanks for reporting back with your solution, and I'm glad you found one!  Two
>  questions, though...
>
>  1.  Why do you use both JTidy and NekoHTML?  Normally one would use one or the
>  other.  Both JTidy and NekoHTML allow you to generate a DOM from HTML.  So why not
>  choose one and be done?

They tackle different problems.  NekoHTML gives me excellent ability
to manipulate the DOM tree, adding and removing nodes, rewriting
attributes, getting text content out of them, etc.  JTidy has poor DOM
manipulation due to incomplete implementation (such as
getTextContent() throws an abstract method error) and a narrower API.
But NekoHTML fails on a lot of html oddities I was encountering, which
JTidy deals with just fine and can clean up automatically.  If I tried
to use Neko alone, it failed on a significant portion of my test
documents.  If I put JTidy in front of it, Neko always got an input it
could read.

I needed clean first, and then I needed to do fairly invasive changes
to the contents of the dom, then save the result.

>  2.  Did you find out whether Xalan's Serializer has special handling for XHTML
>  -vs- other types of XML documents?  Browsers are quirky about how they parse
>  XHTML.  For instance, some browsers deal better with <script></script> than
>  <script/>, especially browsers that don't really understand XHTML at all like IE
>  (they treat it as HTML).

I didn't ask.  Tho I do have a question still remaining there on
whether I should be concerned with seeing this at the beginning of my
output file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">

I suspect there's another setting somewhere I need to flip to make
that go away (and I am guessing that it should go away), but I'm not
especially familiar with xml, xhtml, namespaces, and doctypes - still
learning on this piece.

Jenny Brown