From: Jenny B. <sk...@gm...> - 2008-04-18 21:38:12
|
I'll try to sort out my data files to figure out exactly what's breaking Neko, but it may be a bit. These are typically big pages (that I didn't write :) ) with a lot of complexity so sometimes it's hard to trace down what broke. I do have one example I just traced; I'm sure there are others and I'll keep my eyes open. In the following case, Neko doesn't add a parent UL or OL resulting in difficulty handling the result in a dom tree (since those li's have no parent list-grouping tag, and I was expecting one). Sure, this is kind of 'dumb' html but that's what the real world out there gives me. <div><li class="listNoImage"><a class="fooLink" href="http://blah.blah.com">Blah blah blah blah</a></li></div> <div><li class="listNoImage"><a class="fooLink" href="http://foo.foo.com/">Foo foo foo foo</a></li></div> I'll keep my eyes open for other cases. If I get a reasonably traceable list I'll put in a more formal report. Jenny Brown On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: > > > Jenny Brown wrote: > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: > >> Thanks for reporting back with your solution, and I'm glad you found one! Two > >> questions, though... > >> > >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the > >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not > >> choose one and be done? > > > > They tackle different problems. NekoHTML gives me excellent ability > > to manipulate the DOM tree, adding and removing nodes, rewriting > > attributes, getting text content out of them, etc. JTidy has poor DOM > > manipulation due to incomplete implementation (such as > > getTextContent() throws an abstract method error) and a narrower API. > > But NekoHTML fails on a lot of html oddities I was encountering, which > > JTidy deals with just fine and can clean up automatically. If I tried > > to use Neko alone, it failed on a significant portion of my test > > documents. If I put JTidy in front of it, Neko always got an input it > > could read. > > > > I needed clean first, and then I needed to do fairly invasive changes > > to the contents of the dom, then save the result. > > > > I would encourage you to post a bug report and attach sample HTML files that > NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of > any kind, clean or messy. If it can't parse some HTML, then it should be enhanced > to do so. You shouldn't need two tools. > > Jake > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > |