From: Jenny B. <sk...@gm...> - 2008-04-18 21:51:37
|
Blah - just a point to say how confusing tracing this is, that might be a misidentified one... I think I was looking at the wrong portion of the html. So I still don't know which area was causing my html results to be weird. I'll have to set up more limited test cases. On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: > I'll try to sort out my data files to figure out exactly what's > breaking Neko, but it may be a bit. These are typically big pages > (that I didn't write :) ) with a lot of complexity so sometimes it's > hard to trace down what broke. I do have one example I just traced; > I'm sure there are others and I'll keep my eyes open. > > In the following case, Neko doesn't add a parent UL or OL resulting in > difficulty handling the result in a dom tree (since those li's have no > parent list-grouping tag, and I was expecting one). Sure, this is > kind of 'dumb' html but that's what the real world out there gives me. > > <div><li class="listNoImage"><a class="fooLink" > href="http://blah.blah.com">Blah blah blah blah</a></li></div> > <div><li class="listNoImage"><a class="fooLink" > href="http://foo.foo.com/">Foo foo foo foo</a></li></div> > > I'll keep my eyes open for other cases. If I get a reasonably > traceable list I'll put in a more formal report. > > Jenny Brown > > > > > On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: > > > > > > Jenny Brown wrote: > > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: > > >> Thanks for reporting back with your solution, and I'm glad you found one! Two > > >> questions, though... > > >> > > >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the > > >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not > > >> choose one and be done? > > > > > > They tackle different problems. NekoHTML gives me excellent ability > > > to manipulate the DOM tree, adding and removing nodes, rewriting > > > attributes, getting text content out of them, etc. JTidy has poor DOM > > > manipulation due to incomplete implementation (such as > > > getTextContent() throws an abstract method error) and a narrower API. > > > But NekoHTML fails on a lot of html oddities I was encountering, which > > > JTidy deals with just fine and can clean up automatically. If I tried > > > to use Neko alone, it failed on a significant portion of my test > > > documents. If I put JTidy in front of it, Neko always got an input it > > > could read. > > > > > > I needed clean first, and then I needed to do fairly invasive changes > > > to the contents of the dom, then save the result. > > > > > > > I would encourage you to post a bug report and attach sample HTML files that > > NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of > > any kind, clean or messy. If it can't parse some HTML, then it should be enhanced > > to do so. You shouldn't need two tools. > > > > Jake > > > > > > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > > Don't miss this year's exciting event. There's still time to save $100. > > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |