From: Andy C. <an...@cy...> - 2008-04-18 23:07:11
|
Quick question: does NekoHTML barf and die on your input documents? or does it just produce strange output? Jenny Brown wrote: > Blah - just a point to say how confusing tracing this is, that might > be a misidentified one... I think I was looking at the wrong portion > of the html. So I still don't know which area was causing my html > results to be weird. I'll have to set up more limited test cases. > > > On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: >> I'll try to sort out my data files to figure out exactly what's >> breaking Neko, but it may be a bit. These are typically big pages >> (that I didn't write :) ) with a lot of complexity so sometimes it's >> hard to trace down what broke. I do have one example I just traced; >> I'm sure there are others and I'll keep my eyes open. >> >> In the following case, Neko doesn't add a parent UL or OL resulting in >> difficulty handling the result in a dom tree (since those li's have no >> parent list-grouping tag, and I was expecting one). Sure, this is >> kind of 'dumb' html but that's what the real world out there gives me. >> >> <div><li class="listNoImage"><a class="fooLink" >> href="http://blah.blah.com">Blah blah blah blah</a></li></div> >> <div><li class="listNoImage"><a class="fooLink" >> href="http://foo.foo.com/">Foo foo foo foo</a></li></div> >> >> I'll keep my eyes open for other cases. If I get a reasonably >> traceable list I'll put in a more formal report. >> >> Jenny Brown >> >> >> >> >> On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: >> > >> > >> > Jenny Brown wrote: >> > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: >> > >> Thanks for reporting back with your solution, and I'm glad you found one! Two >> > >> questions, though... >> > >> >> > >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the >> > >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not >> > >> choose one and be done? >> > > >> > > They tackle different problems. NekoHTML gives me excellent ability >> > > to manipulate the DOM tree, adding and removing nodes, rewriting >> > > attributes, getting text content out of them, etc. JTidy has poor DOM >> > > manipulation due to incomplete implementation (such as >> > > getTextContent() throws an abstract method error) and a narrower API. >> > > But NekoHTML fails on a lot of html oddities I was encountering, which >> > > JTidy deals with just fine and can clean up automatically. If I tried >> > > to use Neko alone, it failed on a significant portion of my test >> > > documents. If I put JTidy in front of it, Neko always got an input it >> > > could read. >> > > >> > > I needed clean first, and then I needed to do fairly invasive changes >> > > to the contents of the dom, then save the result. >> > > >> > >> > I would encourage you to post a bug report and attach sample HTML files that >> > NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of >> > any kind, clean or messy. If it can't parse some HTML, then it should be enhanced >> > to do so. You shouldn't need two tools. >> > >> > Jake >> > >> > >> > >> > >> > >> > ------------------------------------------------------------------------- >> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference >> > Don't miss this year's exciting event. There's still time to save $100. >> > Use priority code J8TL2D2. >> > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone >> > _______________________________________________ >> > nekohtml-user mailing list >> > nek...@li... >> > https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |