From: Jenny B. <sk...@gm...> - 2008-04-18 23:09:42
|
Strange output is what I'm struggling with but I am having a heck of a time tracing exactly where the differences are occurring, because the code (of mine) that comes immediately after it is quite complex. Careful use of test cases is getting me closer to understanding but I'm not there yet; something subtle is tripping up my code somewhere. On Fri, Apr 18, 2008 at 6:07 PM, Andy Clark <an...@cy...> wrote: > Quick question: does NekoHTML barf and die on your > input documents? or does it just produce strange > output? > > > > Jenny Brown wrote: > > > Blah - just a point to say how confusing tracing this is, that might > > be a misidentified one... I think I was looking at the wrong portion > > of the html. So I still don't know which area was causing my html > > results to be weird. I'll have to set up more limited test cases. > > > > > > On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: > > > > > I'll try to sort out my data files to figure out exactly what's > > > breaking Neko, but it may be a bit. These are typically big pages > > > (that I didn't write :) ) with a lot of complexity so sometimes it's > > > hard to trace down what broke. I do have one example I just traced; > > > I'm sure there are others and I'll keep my eyes open. > > > > > > In the following case, Neko doesn't add a parent UL or OL resulting in > > > difficulty handling the result in a dom tree (since those li's have no > > > parent list-grouping tag, and I was expecting one). Sure, this is > > > kind of 'dumb' html but that's what the real world out there gives me. > > > > > > <div><li class="listNoImage"><a class="fooLink" > > > href="http://blah.blah.com">Blah blah blah blah</a></li></div> > > > <div><li class="listNoImage"><a class="fooLink" > > > href="http://foo.foo.com/">Foo foo foo foo</a></li></div> > > > > > > I'll keep my eyes open for other cases. If I get a reasonably > > > traceable list I'll put in a more formal report. > > > > > > Jenny Brown > > > > > > > > > > > > > > > On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: > > > > > > > > > > > > Jenny Brown wrote: > > > > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> > wrote: > > > > >> Thanks for reporting back with your solution, and I'm glad you > found one! Two > > > > >> questions, though... > > > > >> > > > > >> 1. Why do you use both JTidy and NekoHTML? Normally one would > use one or the > > > > >> other. Both JTidy and NekoHTML allow you to generate a DOM from > HTML. So why not > > > > >> choose one and be done? > > > > > > > > > > They tackle different problems. NekoHTML gives me excellent > ability > > > > > to manipulate the DOM tree, adding and removing nodes, rewriting > > > > > attributes, getting text content out of them, etc. JTidy has poor > DOM > > > > > manipulation due to incomplete implementation (such as > > > > > getTextContent() throws an abstract method error) and a narrower > API. > > > > > But NekoHTML fails on a lot of html oddities I was encountering, > which > > > > > JTidy deals with just fine and can clean up automatically. If I > tried > > > > > to use Neko alone, it failed on a significant portion of my test > > > > > documents. If I put JTidy in front of it, Neko always got an > input it > > > > > could read. > > > > > > > > > > I needed clean first, and then I needed to do fairly invasive > changes > > > > > to the contents of the dom, then save the result. > > > > > > > > > > > > > I would encourage you to post a bug report and attach sample HTML > files that > > > > NekoHTML fails to parse properly. The whole point of NekoHTML is to > parse HTML of > > > > any kind, clean or messy. If it can't parse some HTML, then it > should be enhanced > > > > to do so. You shouldn't need two tools. > > > > > > > > Jake > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > > > > Don't miss this year's exciting event. There's still time to save > $100. > > > > Use priority code J8TL2D2. > > > > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > > > _______________________________________________ > > > > nekohtml-user mailing list > > > > nek...@li... > > > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't > miss this year's exciting event. There's still time to save $100. Use > priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |