You can subscribe to this list here.
2007 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
(5) |
Feb
(13) |
Mar
(7) |
Apr
(23) |
May
(1) |
Jun
(1) |
Jul
(10) |
Aug
(2) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
(7) |
2009 |
Jan
(4) |
Feb
(2) |
Mar
|
Apr
(6) |
May
(8) |
Jun
|
Jul
(5) |
Aug
(5) |
Sep
(2) |
Oct
(1) |
Nov
(1) |
Dec
(1) |
2010 |
Jan
(12) |
Feb
(5) |
Mar
|
Apr
(4) |
May
(22) |
Jun
(3) |
Jul
(1) |
Aug
(3) |
Sep
(3) |
Oct
(1) |
Nov
(1) |
Dec
(2) |
2011 |
Jan
(10) |
Feb
|
Mar
(4) |
Apr
(2) |
May
|
Jun
(2) |
Jul
|
Aug
(3) |
Sep
(1) |
Oct
|
Nov
|
Dec
(3) |
2012 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
(2) |
2013 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Stephen G. W. <sg...@no...> - 2008-07-29 00:05:39
|
I'll even include the code I'm testing with. import javax.xml.xpath.XPath; import javax.xml.xpath.XPathConstants; import javax.xml.xpath.XPathExpression; import javax.xml.xpath.XPathFactory; import org.cyberneko.html.parsers.DOMParser; import org.w3c.dom.Document; import org.w3c.dom.NodeList; public class Test4 { public static void main(String[] args) { try { XPathFactory xpFactory = XPathFactory.newInstance(); XPath xpath = xpFactory.newXPath(); String expression = "//title"; XPathExpression xpathExpression = xpath.compile(expression); DOMParser parser = new DOMParser(); parser.setFeature("http://xml.org/sax/features/namespaces", false); parser.parse("./test2.html"); Document doc = parser.getDocument(); Object result = xpathExpression.evaluate(doc, XPathConstants.NODESET); NodeList nodes = (NodeList) result; for (int i = 0; i < nodes.getLength(); i++) { System.out.println(nodes.item(i).getNodeValue()); } } catch(Exception e) { e.printStackTrace(); } } } Test HTML <html> <head> <title>Test Page</title> </head> </body> <p>Foo</p> </body> </html> Returns an empty node set regardless of what I use for expression. I was originally using a more complex HTML but figured I'd simplify things until I got something working. I've also tried the method of using XPath included in the sample application ApplyXPathDOM in the Xalan package. However the compiled expression method is more ideal for my appication. Thanks ----------------------------------------------------------- - stephen.g.walizer - http://node777.net - sg...@no... ----------------------------------------------------------- On Jul 28, 2008, at 8:15 PM, Jacob Kjome wrote: > It would help if you provide an example document, an XPath > expression, and the > node you expect it to match. > > Jake > > Stephen G. Walizer wrote: >> Is there some incompatibility between NekoHTML and XPath as >> implemented by Xalan? I have tried several different methods of >> getting XPath expressions to work on NekoHTML produced documents and >> am having no luck. I can traverse the generated DOM tree, but XPATH >> expressions never produce any results. >> >> I have tried using both a compiled XPathExpression as well as an >> XPathEvaluator with no luck. >> >> I am using NekoHTML 1.9.8 and Xalan-J 2.7.1. >> >> Thank You, >> ----------------------------------------------------------- >> - stephen.g.walizer - http://node777.net - sg...@no... >> ----------------------------------------------------------- > > > ---------------------------------------------------------------------- > --- > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > Build the coolest Linux based applications with Moblin SDK & win > great prizes > Grand prize is a trip for two to an Open Source event anywhere in > the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Jacob K. <ho...@vi...> - 2008-07-28 23:13:49
|
It would help if you provide an example document, an XPath expression, and the node you expect it to match. Jake Stephen G. Walizer wrote: > Is there some incompatibility between NekoHTML and XPath as > implemented by Xalan? I have tried several different methods of > getting XPath expressions to work on NekoHTML produced documents and > am having no luck. I can traverse the generated DOM tree, but XPATH > expressions never produce any results. > > I have tried using both a compiled XPathExpression as well as an > XPathEvaluator with no luck. > > I am using NekoHTML 1.9.8 and Xalan-J 2.7.1. > > Thank You, > ----------------------------------------------------------- > - stephen.g.walizer - http://node777.net - sg...@no... > ----------------------------------------------------------- |
From: Stephen G. W. <sg...@no...> - 2008-07-28 20:46:48
|
Is there some incompatibility between NekoHTML and XPath as implemented by Xalan? I have tried several different methods of getting XPath expressions to work on NekoHTML produced documents and am having no luck. I can traverse the generated DOM tree, but XPATH expressions never produce any results. I have tried using both a compiled XPathExpression as well as an XPathEvaluator with no luck. I am using NekoHTML 1.9.8 and Xalan-J 2.7.1. Thank You, ----------------------------------------------------------- - stephen.g.walizer - http://node777.net - sg...@no... ----------------------------------------------------------- |
From: Marc G. <mgu...@ya...> - 2008-07-22 15:50:31
|
Hi all, release 1.9.8 of NekoHTML is now available. http://nekohtml.sourceforge.net This release contains different bug fixes and many improvements, particularly in incorrect html code handling. Thanks to everyone who contributed to this release. The maven bundle has been uploaded to NekoHTML repository and should become available in the main repository within a few hours. Enjoy! Marc. -- Blog: http://mguillem.wordpress.com |
From: Verena M. <ver...@ti...> - 2008-07-19 11:56:06
|
Hello, I'd like to extract a subtree out of the HTML code and get the code (as a String if this is possible) of the subtree. For example: If I have some HTML Code like this <html> <body> <div *class*="rightPattern">get <i>m</i>e</div> <div *class*="head1 title">Title</div> <div *class*="content">Many text.</div> </body> </html> I want to split it in these part <html> <body> <div *class*="head1 title">Title</div> <div *class*="content">Many text.</div> </body> </html> and these <div *class*="rightPattern">get <i>m</i>e</div> Is there a method which can do this for me? Thank you. Jetzt neu: Der Routenplaner von Tiscali http://www.tiscali.de/trav/routenplaner.html |
From: Arshan D. <ars...@gm...> - 2008-07-05 03:42:56
|
These are fairly serious issues and it's been a month since reporting - just looking for some kind of acknowledgement that they're under progress. Thanks, Arshan |
From: Justin N. <jus...@gm...> - 2008-06-03 00:10:25
|
I'm new to Neko, so this may be a silly question. I'm trying to use Neko in combination with XPath and I'm running into a few problems. Specifically, I'm trying to upgrade from an unknown very old version of nekohtml to 1.9.7. My code worked fine on the older version, but fails on 1.9.7. I would expect the following code to be able to find the tr tag as it does in the older version, but with 1.9.7 it cannot find the tag. I'm using Xalan-Java Version 2.7.1 and Xerces-Java 2.9.0. Am I doing something wrong? public static void main(String[] args) { try { String html = "<html xmlns=\"http://www.w3.org/1999/xhtml\"><head><title>Test</title></head><body><table><TR class=\"rowData\"></TR></table></body></html>"; DOMParser parser = new DOMParser(); parser.parse(new InputSource(new StringReader(html))); Node root = parser.getDocument(); CachedXPathAPI xpathAPI = new CachedXPathAPI(); NodeList hotelList = xpathAPI.selectNodeList(root, "//TR[@class='rowData']"); System.out.println("Size="+hotelList.getLength()); } catch (Exception e) { System.out.println(e.getMessage()); } } -- ************************************ Justin Nixon jus...@gm... ************************************ |
From: Guy V. d. B. <gu...@gm...> - 2008-05-18 23:38:49
|
Hey, Just wanted to let you know that I'm using NekoHtml in my HTML diffing library DaisyDiff: http://code.google.com/p/daisydiff/ Cheers, Guy |
From: Andy C. <an...@cy...> - 2008-04-18 23:25:26
|
Jenny Brown wrote: > Strange output is what I'm struggling with but I Strange output is definitely better than throwing an exception. But it all depends on what kind of output you expect. We've tried to make NekoHTML be highly performant while producing an HTML document with a structure as close to what the major browsers produce as possible. In your <li> example you were mentioning before, NekoHTML does *not* insert a parent <ul>/<ol> for this element because the browsers don't. Compare the DOM generated for the following two files: <!-- test1.html --> <li>Hello <!-- test2.html --> <ul> <li>Hello </ul> The structure is different even though they both display as a bulleted-item. But even the presenta- tion is different, too... > am having a heck of a time tracing exactly where the > differences are occurring, because the code (of mine) > that comes immediately after it is quite complex. > Careful use of test cases is getting me closer to > understanding but I'm not there yet; something subtle > is tripping up my code somewhere. > > > On Fri, Apr 18, 2008 at 6:07 PM, Andy Clark <an...@cy...> wrote: >> Quick question: does NekoHTML barf and die on your >> input documents? or does it just produce strange >> output? >> >> >> >> Jenny Brown wrote: >> >>> Blah - just a point to say how confusing tracing this is, that might >>> be a misidentified one... I think I was looking at the wrong portion >>> of the html. So I still don't know which area was causing my html >>> results to be weird. I'll have to set up more limited test cases. >>> >>> >>> On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: >>> >>>> I'll try to sort out my data files to figure out exactly what's >>>> breaking Neko, but it may be a bit. These are typically big pages >>>> (that I didn't write :) ) with a lot of complexity so sometimes it's >>>> hard to trace down what broke. I do have one example I just traced; >>>> I'm sure there are others and I'll keep my eyes open. >>>> >>>> In the following case, Neko doesn't add a parent UL or OL resulting in >>>> difficulty handling the result in a dom tree (since those li's have no >>>> parent list-grouping tag, and I was expecting one). Sure, this is >>>> kind of 'dumb' html but that's what the real world out there gives me. >>>> >>>> <div><li class="listNoImage"><a class="fooLink" >>>> href="http://blah.blah.com">Blah blah blah blah</a></li></div> >>>> <div><li class="listNoImage"><a class="fooLink" >>>> href="http://foo.foo.com/">Foo foo foo foo</a></li></div> >>>> >>>> I'll keep my eyes open for other cases. If I get a reasonably >>>> traceable list I'll put in a more formal report. >>>> >>>> Jenny Brown >>>> >>>> >>>> >>>> >>>> On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: >>>> > >>>> > >>>> > Jenny Brown wrote: >>>> > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> >> wrote: >>>> > >> Thanks for reporting back with your solution, and I'm glad you >> found one! Two >>>> > >> questions, though... >>>> > >> >>>> > >> 1. Why do you use both JTidy and NekoHTML? Normally one would >> use one or the >>>> > >> other. Both JTidy and NekoHTML allow you to generate a DOM from >> HTML. So why not >>>> > >> choose one and be done? >>>> > > >>>> > > They tackle different problems. NekoHTML gives me excellent >> ability >>>> > > to manipulate the DOM tree, adding and removing nodes, rewriting >>>> > > attributes, getting text content out of them, etc. JTidy has poor >> DOM >>>> > > manipulation due to incomplete implementation (such as >>>> > > getTextContent() throws an abstract method error) and a narrower >> API. >>>> > > But NekoHTML fails on a lot of html oddities I was encountering, >> which >>>> > > JTidy deals with just fine and can clean up automatically. If I >> tried >>>> > > to use Neko alone, it failed on a significant portion of my test >>>> > > documents. If I put JTidy in front of it, Neko always got an >> input it >>>> > > could read. >>>> > > >>>> > > I needed clean first, and then I needed to do fairly invasive >> changes >>>> > > to the contents of the dom, then save the result. >>>> > > >>>> > >>>> > I would encourage you to post a bug report and attach sample HTML >> files that >>>> > NekoHTML fails to parse properly. The whole point of NekoHTML is to >> parse HTML of >>>> > any kind, clean or messy. If it can't parse some HTML, then it >> should be enhanced >>>> > to do so. You shouldn't need two tools. >>>> > >>>> > Jake |
From: Jenny B. <sk...@gm...> - 2008-04-18 23:09:42
|
Strange output is what I'm struggling with but I am having a heck of a time tracing exactly where the differences are occurring, because the code (of mine) that comes immediately after it is quite complex. Careful use of test cases is getting me closer to understanding but I'm not there yet; something subtle is tripping up my code somewhere. On Fri, Apr 18, 2008 at 6:07 PM, Andy Clark <an...@cy...> wrote: > Quick question: does NekoHTML barf and die on your > input documents? or does it just produce strange > output? > > > > Jenny Brown wrote: > > > Blah - just a point to say how confusing tracing this is, that might > > be a misidentified one... I think I was looking at the wrong portion > > of the html. So I still don't know which area was causing my html > > results to be weird. I'll have to set up more limited test cases. > > > > > > On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: > > > > > I'll try to sort out my data files to figure out exactly what's > > > breaking Neko, but it may be a bit. These are typically big pages > > > (that I didn't write :) ) with a lot of complexity so sometimes it's > > > hard to trace down what broke. I do have one example I just traced; > > > I'm sure there are others and I'll keep my eyes open. > > > > > > In the following case, Neko doesn't add a parent UL or OL resulting in > > > difficulty handling the result in a dom tree (since those li's have no > > > parent list-grouping tag, and I was expecting one). Sure, this is > > > kind of 'dumb' html but that's what the real world out there gives me. > > > > > > <div><li class="listNoImage"><a class="fooLink" > > > href="http://blah.blah.com">Blah blah blah blah</a></li></div> > > > <div><li class="listNoImage"><a class="fooLink" > > > href="http://foo.foo.com/">Foo foo foo foo</a></li></div> > > > > > > I'll keep my eyes open for other cases. If I get a reasonably > > > traceable list I'll put in a more formal report. > > > > > > Jenny Brown > > > > > > > > > > > > > > > On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: > > > > > > > > > > > > Jenny Brown wrote: > > > > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> > wrote: > > > > >> Thanks for reporting back with your solution, and I'm glad you > found one! Two > > > > >> questions, though... > > > > >> > > > > >> 1. Why do you use both JTidy and NekoHTML? Normally one would > use one or the > > > > >> other. Both JTidy and NekoHTML allow you to generate a DOM from > HTML. So why not > > > > >> choose one and be done? > > > > > > > > > > They tackle different problems. NekoHTML gives me excellent > ability > > > > > to manipulate the DOM tree, adding and removing nodes, rewriting > > > > > attributes, getting text content out of them, etc. JTidy has poor > DOM > > > > > manipulation due to incomplete implementation (such as > > > > > getTextContent() throws an abstract method error) and a narrower > API. > > > > > But NekoHTML fails on a lot of html oddities I was encountering, > which > > > > > JTidy deals with just fine and can clean up automatically. If I > tried > > > > > to use Neko alone, it failed on a significant portion of my test > > > > > documents. If I put JTidy in front of it, Neko always got an > input it > > > > > could read. > > > > > > > > > > I needed clean first, and then I needed to do fairly invasive > changes > > > > > to the contents of the dom, then save the result. > > > > > > > > > > > > > I would encourage you to post a bug report and attach sample HTML > files that > > > > NekoHTML fails to parse properly. The whole point of NekoHTML is to > parse HTML of > > > > any kind, clean or messy. If it can't parse some HTML, then it > should be enhanced > > > > to do so. You shouldn't need two tools. > > > > > > > > Jake > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > > > > Don't miss this year's exciting event. There's still time to save > $100. > > > > Use priority code J8TL2D2. > > > > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > > > _______________________________________________ > > > > nekohtml-user mailing list > > > > nek...@li... > > > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > > > > > > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't > miss this year's exciting event. There's still time to save $100. Use > priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Andy C. <an...@cy...> - 2008-04-18 23:07:11
|
Quick question: does NekoHTML barf and die on your input documents? or does it just produce strange output? Jenny Brown wrote: > Blah - just a point to say how confusing tracing this is, that might > be a misidentified one... I think I was looking at the wrong portion > of the html. So I still don't know which area was causing my html > results to be weird. I'll have to set up more limited test cases. > > > On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: >> I'll try to sort out my data files to figure out exactly what's >> breaking Neko, but it may be a bit. These are typically big pages >> (that I didn't write :) ) with a lot of complexity so sometimes it's >> hard to trace down what broke. I do have one example I just traced; >> I'm sure there are others and I'll keep my eyes open. >> >> In the following case, Neko doesn't add a parent UL or OL resulting in >> difficulty handling the result in a dom tree (since those li's have no >> parent list-grouping tag, and I was expecting one). Sure, this is >> kind of 'dumb' html but that's what the real world out there gives me. >> >> <div><li class="listNoImage"><a class="fooLink" >> href="http://blah.blah.com">Blah blah blah blah</a></li></div> >> <div><li class="listNoImage"><a class="fooLink" >> href="http://foo.foo.com/">Foo foo foo foo</a></li></div> >> >> I'll keep my eyes open for other cases. If I get a reasonably >> traceable list I'll put in a more formal report. >> >> Jenny Brown >> >> >> >> >> On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: >> > >> > >> > Jenny Brown wrote: >> > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: >> > >> Thanks for reporting back with your solution, and I'm glad you found one! Two >> > >> questions, though... >> > >> >> > >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the >> > >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not >> > >> choose one and be done? >> > > >> > > They tackle different problems. NekoHTML gives me excellent ability >> > > to manipulate the DOM tree, adding and removing nodes, rewriting >> > > attributes, getting text content out of them, etc. JTidy has poor DOM >> > > manipulation due to incomplete implementation (such as >> > > getTextContent() throws an abstract method error) and a narrower API. >> > > But NekoHTML fails on a lot of html oddities I was encountering, which >> > > JTidy deals with just fine and can clean up automatically. If I tried >> > > to use Neko alone, it failed on a significant portion of my test >> > > documents. If I put JTidy in front of it, Neko always got an input it >> > > could read. >> > > >> > > I needed clean first, and then I needed to do fairly invasive changes >> > > to the contents of the dom, then save the result. >> > > >> > >> > I would encourage you to post a bug report and attach sample HTML files that >> > NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of >> > any kind, clean or messy. If it can't parse some HTML, then it should be enhanced >> > to do so. You shouldn't need two tools. >> > >> > Jake >> > >> > >> > >> > >> > >> > ------------------------------------------------------------------------- >> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference >> > Don't miss this year's exciting event. There's still time to save $100. >> > Use priority code J8TL2D2. >> > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone >> > _______________________________________________ >> > nekohtml-user mailing list >> > nek...@li... >> > https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user |
From: Jenny B. <sk...@gm...> - 2008-04-18 21:51:37
|
Blah - just a point to say how confusing tracing this is, that might be a misidentified one... I think I was looking at the wrong portion of the html. So I still don't know which area was causing my html results to be weird. I'll have to set up more limited test cases. On Fri, Apr 18, 2008 at 4:38 PM, Jenny Brown <sk...@gm...> wrote: > I'll try to sort out my data files to figure out exactly what's > breaking Neko, but it may be a bit. These are typically big pages > (that I didn't write :) ) with a lot of complexity so sometimes it's > hard to trace down what broke. I do have one example I just traced; > I'm sure there are others and I'll keep my eyes open. > > In the following case, Neko doesn't add a parent UL or OL resulting in > difficulty handling the result in a dom tree (since those li's have no > parent list-grouping tag, and I was expecting one). Sure, this is > kind of 'dumb' html but that's what the real world out there gives me. > > <div><li class="listNoImage"><a class="fooLink" > href="http://blah.blah.com">Blah blah blah blah</a></li></div> > <div><li class="listNoImage"><a class="fooLink" > href="http://foo.foo.com/">Foo foo foo foo</a></li></div> > > I'll keep my eyes open for other cases. If I get a reasonably > traceable list I'll put in a more formal report. > > Jenny Brown > > > > > On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: > > > > > > Jenny Brown wrote: > > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: > > >> Thanks for reporting back with your solution, and I'm glad you found one! Two > > >> questions, though... > > >> > > >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the > > >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not > > >> choose one and be done? > > > > > > They tackle different problems. NekoHTML gives me excellent ability > > > to manipulate the DOM tree, adding and removing nodes, rewriting > > > attributes, getting text content out of them, etc. JTidy has poor DOM > > > manipulation due to incomplete implementation (such as > > > getTextContent() throws an abstract method error) and a narrower API. > > > But NekoHTML fails on a lot of html oddities I was encountering, which > > > JTidy deals with just fine and can clean up automatically. If I tried > > > to use Neko alone, it failed on a significant portion of my test > > > documents. If I put JTidy in front of it, Neko always got an input it > > > could read. > > > > > > I needed clean first, and then I needed to do fairly invasive changes > > > to the contents of the dom, then save the result. > > > > > > > I would encourage you to post a bug report and attach sample HTML files that > > NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of > > any kind, clean or messy. If it can't parse some HTML, then it should be enhanced > > to do so. You shouldn't need two tools. > > > > Jake > > > > > > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > > Don't miss this year's exciting event. There's still time to save $100. > > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Jenny B. <sk...@gm...> - 2008-04-18 21:38:12
|
I'll try to sort out my data files to figure out exactly what's breaking Neko, but it may be a bit. These are typically big pages (that I didn't write :) ) with a lot of complexity so sometimes it's hard to trace down what broke. I do have one example I just traced; I'm sure there are others and I'll keep my eyes open. In the following case, Neko doesn't add a parent UL or OL resulting in difficulty handling the result in a dom tree (since those li's have no parent list-grouping tag, and I was expecting one). Sure, this is kind of 'dumb' html but that's what the real world out there gives me. <div><li class="listNoImage"><a class="fooLink" href="http://blah.blah.com">Blah blah blah blah</a></li></div> <div><li class="listNoImage"><a class="fooLink" href="http://foo.foo.com/">Foo foo foo foo</a></li></div> I'll keep my eyes open for other cases. If I get a reasonably traceable list I'll put in a more formal report. Jenny Brown On Thu, Apr 17, 2008 at 10:21 PM, Jacob Kjome <ho...@vi...> wrote: > > > Jenny Brown wrote: > > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: > >> Thanks for reporting back with your solution, and I'm glad you found one! Two > >> questions, though... > >> > >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the > >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not > >> choose one and be done? > > > > They tackle different problems. NekoHTML gives me excellent ability > > to manipulate the DOM tree, adding and removing nodes, rewriting > > attributes, getting text content out of them, etc. JTidy has poor DOM > > manipulation due to incomplete implementation (such as > > getTextContent() throws an abstract method error) and a narrower API. > > But NekoHTML fails on a lot of html oddities I was encountering, which > > JTidy deals with just fine and can clean up automatically. If I tried > > to use Neko alone, it failed on a significant portion of my test > > documents. If I put JTidy in front of it, Neko always got an input it > > could read. > > > > I needed clean first, and then I needed to do fairly invasive changes > > to the contents of the dom, then save the result. > > > > I would encourage you to post a bug report and attach sample HTML files that > NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of > any kind, clean or messy. If it can't parse some HTML, then it should be enhanced > to do so. You shouldn't need two tools. > > Jake > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > |
From: Jacob K. <ho...@vi...> - 2008-04-18 02:21:32
|
Jenny Brown wrote: > On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: >> Thanks for reporting back with your solution, and I'm glad you found one! Two >> questions, though... >> >> 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the >> other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not >> choose one and be done? > > They tackle different problems. NekoHTML gives me excellent ability > to manipulate the DOM tree, adding and removing nodes, rewriting > attributes, getting text content out of them, etc. JTidy has poor DOM > manipulation due to incomplete implementation (such as > getTextContent() throws an abstract method error) and a narrower API. > But NekoHTML fails on a lot of html oddities I was encountering, which > JTidy deals with just fine and can clean up automatically. If I tried > to use Neko alone, it failed on a significant portion of my test > documents. If I put JTidy in front of it, Neko always got an input it > could read. > > I needed clean first, and then I needed to do fairly invasive changes > to the contents of the dom, then save the result. > I would encourage you to post a bug report and attach sample HTML files that NekoHTML fails to parse properly. The whole point of NekoHTML is to parse HTML of any kind, clean or messy. If it can't parse some HTML, then it should be enhanced to do so. You shouldn't need two tools. Jake |
From: Jenny B. <sk...@gm...> - 2008-04-17 23:50:10
|
Correction - I found where I'm setting this and can simply turn it off. Easy. On Thu, Apr 17, 2008 at 6:28 PM, Jenny Brown <sk...@gm...> wrote: > Tho I do have a question still remaining there on > whether I should be concerned with seeing this at the beginning of my > output file: > <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> > > I suspect there's another setting somewhere I need to flip to make > that go away (and I am guessing that it should go away), but I'm not > especially familiar with xml, xhtml, namespaces, and doctypes - still > learning on this piece. |
From: Jenny B. <sk...@gm...> - 2008-04-17 23:28:33
|
On Thu, Apr 17, 2008 at 7:05 PM, Jacob Kjome <ho...@vi...> wrote: > Thanks for reporting back with your solution, and I'm glad you found one! Two > questions, though... > > 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the > other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not > choose one and be done? They tackle different problems. NekoHTML gives me excellent ability to manipulate the DOM tree, adding and removing nodes, rewriting attributes, getting text content out of them, etc. JTidy has poor DOM manipulation due to incomplete implementation (such as getTextContent() throws an abstract method error) and a narrower API. But NekoHTML fails on a lot of html oddities I was encountering, which JTidy deals with just fine and can clean up automatically. If I tried to use Neko alone, it failed on a significant portion of my test documents. If I put JTidy in front of it, Neko always got an input it could read. I needed clean first, and then I needed to do fairly invasive changes to the contents of the dom, then save the result. > 2. Did you find out whether Xalan's Serializer has special handling for XHTML > -vs- other types of XML documents? Browsers are quirky about how they parse > XHTML. For instance, some browsers deal better with <script></script> than > <script/>, especially browsers that don't really understand XHTML at all like IE > (they treat it as HTML). I didn't ask. Tho I do have a question still remaining there on whether I should be concerned with seeing this at the beginning of my output file: <!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"> I suspect there's another setting somewhere I need to flip to make that go away (and I am guessing that it should go away), but I'm not especially familiar with xml, xhtml, namespaces, and doctypes - still learning on this piece. Jenny Brown |
From: Jacob K. <ho...@vi...> - 2008-04-17 23:05:13
|
Thanks for reporting back with your solution, and I'm glad you found one! Two questions, though... 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not choose one and be done? 2. Did you find out whether Xalan's Serializer has special handling for XHTML -vs- other types of XML documents? Browsers are quirky about how they parse XHTML. For instance, some browsers deal better with <script></script> than <script/>, especially browsers that don't really understand XHTML at all like IE (they treat it as HTML). Jake Jenny Brown wrote: > I'm reporting back in on the final solution to this, so it's in the > archives if someone hits a similar issue in the future. The Xalan > people helped me out, by observing there was a namespace being set on > the <html> element, like so: > <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en"> > > That was forcing the Transformer to do an xml output, hence the xml > style empty tags I was getting. The solution was to figure out where > that namespace was coming from. My steps were: > 1. Messy HTML is a string in memory > 2. Send messy HTML through JTidy parsing > 3. Call JTidy pprint (pretty-print) to get it back as a String > (validated and cleaned up) > (Observed that after this point the namespace was visible) > 4. Send validated HTML to NekoHTML (misled Neko into using the namespace) > 5. Neko's DOM got sent to the Transformer for output, and the > transformer responded to the namespace > > So I needed to tell JTidy to use a different config... I had been wrongly using > tidy.setXHTML(true); > due to a misunderstanding of its requirements. I changed it to this > and the problem namespace cleared up: > tidy.setXHTML(false); > tidy.setXmlOut(false); > > That meant Neko got handed clean html with no namespace in it, and > thus output was in html not xml. Incidentally, the way I could > observe the namespace from Java code was either to print the html > string, or to call this to read it out of the HTML node of the dom: > > System.out.println("NAMESPACE: " + > documentRoot.getFirstChild().getNamespaceURI()); > > Hope that helps someone else someday. :) Thanks for your help here > too. My code works now. > > Jenny Brown > > > On Wed, Apr 16, 2008 at 8:42 PM, Jacob Kjome <ho...@vi...> wrote: >> Jenny Brown wrote: >> > Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 >> > and from a JUnit test case. I don't know if there's any chance of a >> > jar conflict within Eclipse itself. >> >> I would recommend running this in a clean environment. I've seen lots of cases >> where people say it doesn't work when running under their IDE and it usually ends >> up being the IDE's fault. It's seems like it's probably a classpath issue. You >> might even try putting Xerces, Xalan, and Serializer jars into >> JAVA_HOME/jre/lib/endorsed, just to make sure you are actually using Apache's >> Xalan and not the old buggy one included in the JDK. Beyond that, I really don't >> have any other suggestions. >> >> For further help, you should probably ping the Xalan-user list, as they are the >> experts on the Transformer and Serializer stuff. >> >> >> Jake >> >> >> >> > >> > >> > On Wed, Apr 16, 2008 at 11:26 AM, Jenny Brown <sk...@gm...> wrote: >> >> On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: >> >> > Yeah, if you have "html" as the output type, then it should use the HTML >> >> > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? >> >> > You don't want that for HTML and, actually, not even for XHTML. Browsers >> >> > don't handle HTML/XHTML documents with the XML declaration very well. >> >> >> >> > Also, I don't recommend using the StringWriter for output in a servlet. I would >> >> > think you'd want to pass in the ServletOutputStream into the StreamResult. >> >> >> >> Ok I just fixed the xml declaration thing. >> >> >> >> I'm using StringWriter because this is actually used in a batch mode >> >> (not servlet at all) and it's in the middle of the pipeline of modules >> >> handling the data. I know for sure that the incoming data is a String >> >> in UTF-8 and that the next item in line will also want it as a String >> >> in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 >> >> even if I make it so myself.) Eventually in the long term a browser >> >> may see the html that results, but not immediately; many other things >> >> happen to the data first. I need the html in memory for a while yet >> >> after, so, String. >> >> >> >> >> >> >> >> > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough >> >> > > code base I'm pretty sure there are no jar conflicts sneaking in old >> >> > > versions. Rather I suspect I'm misunderstanding something about the >> >> > > serialization process or xml / html specifications. >> >> > > >> >> > >> >> > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the >> >> > classpath? >> >> >> >> I just double checked that this morning, and I'm seeing the same >> >> behavior (XML style serialization) after specifically putting Xalan's >> >> copies of everything in place. >> >> >> >> Any more ideas? Are there get methods or debug info that I can use >> >> with the Transformer to find out what it thinks it's using / supposed >> >> to be using, so I can see if something specific is going wrong? >> >> >> >> Thank you. >> >> >> >> >> >> Jenny Brown >> >> >> > >> >> >>> ------------------------------------------------------------------------- >> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference >> > Don't miss this year's exciting event. There's still time to save $100. >> > Use priority code J8TL2D2. >> > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone >> > _______________________________________________ >> > nekohtml-user mailing list >> > nek...@li... >> > https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > >> > >> > >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference >> Don't miss this year's exciting event. There's still time to save $100. >> Use priority code J8TL2D2. >> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Jenny B. <sk...@gm...> - 2008-04-17 22:40:16
|
I'm reporting back in on the final solution to this, so it's in the archives if someone hits a similar issue in the future. The Xalan people helped me out, by observing there was a namespace being set on the <html> element, like so: <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en"> That was forcing the Transformer to do an xml output, hence the xml style empty tags I was getting. The solution was to figure out where that namespace was coming from. My steps were: 1. Messy HTML is a string in memory 2. Send messy HTML through JTidy parsing 3. Call JTidy pprint (pretty-print) to get it back as a String (validated and cleaned up) (Observed that after this point the namespace was visible) 4. Send validated HTML to NekoHTML (misled Neko into using the namespace) 5. Neko's DOM got sent to the Transformer for output, and the transformer responded to the namespace So I needed to tell JTidy to use a different config... I had been wrongly using tidy.setXHTML(true); due to a misunderstanding of its requirements. I changed it to this and the problem namespace cleared up: tidy.setXHTML(false); tidy.setXmlOut(false); That meant Neko got handed clean html with no namespace in it, and thus output was in html not xml. Incidentally, the way I could observe the namespace from Java code was either to print the html string, or to call this to read it out of the HTML node of the dom: System.out.println("NAMESPACE: " + documentRoot.getFirstChild().getNamespaceURI()); Hope that helps someone else someday. :) Thanks for your help here too. My code works now. Jenny Brown On Wed, Apr 16, 2008 at 8:42 PM, Jacob Kjome <ho...@vi...> wrote: > Jenny Brown wrote: > > Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 > > and from a JUnit test case. I don't know if there's any chance of a > > jar conflict within Eclipse itself. > > I would recommend running this in a clean environment. I've seen lots of cases > where people say it doesn't work when running under their IDE and it usually ends > up being the IDE's fault. It's seems like it's probably a classpath issue. You > might even try putting Xerces, Xalan, and Serializer jars into > JAVA_HOME/jre/lib/endorsed, just to make sure you are actually using Apache's > Xalan and not the old buggy one included in the JDK. Beyond that, I really don't > have any other suggestions. > > For further help, you should probably ping the Xalan-user list, as they are the > experts on the Transformer and Serializer stuff. > > > Jake > > > > > > > > > On Wed, Apr 16, 2008 at 11:26 AM, Jenny Brown <sk...@gm...> wrote: > >> On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: > >> > Yeah, if you have "html" as the output type, then it should use the HTML > >> > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? > >> > You don't want that for HTML and, actually, not even for XHTML. Browsers > >> > don't handle HTML/XHTML documents with the XML declaration very well. > >> > >> > Also, I don't recommend using the StringWriter for output in a servlet. I would > >> > think you'd want to pass in the ServletOutputStream into the StreamResult. > >> > >> Ok I just fixed the xml declaration thing. > >> > >> I'm using StringWriter because this is actually used in a batch mode > >> (not servlet at all) and it's in the middle of the pipeline of modules > >> handling the data. I know for sure that the incoming data is a String > >> in UTF-8 and that the next item in line will also want it as a String > >> in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 > >> even if I make it so myself.) Eventually in the long term a browser > >> may see the html that results, but not immediately; many other things > >> happen to the data first. I need the html in memory for a while yet > >> after, so, String. > >> > >> > >> > >> > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough > >> > > code base I'm pretty sure there are no jar conflicts sneaking in old > >> > > versions. Rather I suspect I'm misunderstanding something about the > >> > > serialization process or xml / html specifications. > >> > > > >> > > >> > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the > >> > classpath? > >> > >> I just double checked that this morning, and I'm seeing the same > >> behavior (XML style serialization) after specifically putting Xalan's > >> copies of everything in place. > >> > >> Any more ideas? Are there get methods or debug info that I can use > >> with the Transformer to find out what it thinks it's using / supposed > >> to be using, so I can see if something specific is going wrong? > >> > >> Thank you. > >> > >> > >> Jenny Brown > >> > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > > Don't miss this year's exciting event. There's still time to save $100. > > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > |
From: Jenny B. <sk...@gm...> - 2008-04-17 01:00:15
|
On Wed, Apr 16, 2008 at 8:42 PM, Jacob Kjome <ho...@vi...> wrote: > Jenny Brown wrote: > > Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 > > and from a JUnit test case. I don't know if there's any chance of a > > jar conflict within Eclipse itself. > > I would recommend running this in a clean environment. I've seen lots of cases > where people say it doesn't work when running under their IDE and it usually ends > up being the IDE's fault. It's seems like it's probably a classpath issue. You > might even try putting Xerces, Xalan, and Serializer jars into > JAVA_HOME/jre/lib/endorsed, just to make sure you are actually using Apache's > Xalan and not the old buggy one included in the JDK. Beyond that, I really don't > have any other suggestions. > > For further help, you should probably ping the Xalan-user list, as they are the > experts on the Transformer and Serializer stuff. I ran in a clean environment (command line, where I exactly control the classpath) and get the same behavior, so I'll move on over to the Xalan list and pursue more details there. Thanks for your help - I'm farther along in my project as a result! :) Jenny |
From: Jacob K. <ho...@vi...> - 2008-04-17 00:43:05
|
Jenny Brown wrote: > Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 > and from a JUnit test case. I don't know if there's any chance of a > jar conflict within Eclipse itself. I would recommend running this in a clean environment. I've seen lots of cases where people say it doesn't work when running under their IDE and it usually ends up being the IDE's fault. It's seems like it's probably a classpath issue. You might even try putting Xerces, Xalan, and Serializer jars into JAVA_HOME/jre/lib/endorsed, just to make sure you are actually using Apache's Xalan and not the old buggy one included in the JDK. Beyond that, I really don't have any other suggestions. For further help, you should probably ping the Xalan-user list, as they are the experts on the Transformer and Serializer stuff. Jake > > > On Wed, Apr 16, 2008 at 11:26 AM, Jenny Brown <sk...@gm...> wrote: >> On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: >> > Yeah, if you have "html" as the output type, then it should use the HTML >> > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? >> > You don't want that for HTML and, actually, not even for XHTML. Browsers >> > don't handle HTML/XHTML documents with the XML declaration very well. >> >> > Also, I don't recommend using the StringWriter for output in a servlet. I would >> > think you'd want to pass in the ServletOutputStream into the StreamResult. >> >> Ok I just fixed the xml declaration thing. >> >> I'm using StringWriter because this is actually used in a batch mode >> (not servlet at all) and it's in the middle of the pipeline of modules >> handling the data. I know for sure that the incoming data is a String >> in UTF-8 and that the next item in line will also want it as a String >> in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 >> even if I make it so myself.) Eventually in the long term a browser >> may see the html that results, but not immediately; many other things >> happen to the data first. I need the html in memory for a while yet >> after, so, String. >> >> >> >> > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough >> > > code base I'm pretty sure there are no jar conflicts sneaking in old >> > > versions. Rather I suspect I'm misunderstanding something about the >> > > serialization process or xml / html specifications. >> > > >> > >> > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the >> > classpath? >> >> I just double checked that this morning, and I'm seeing the same >> behavior (XML style serialization) after specifically putting Xalan's >> copies of everything in place. >> >> Any more ideas? Are there get methods or debug info that I can use >> with the Transformer to find out what it thinks it's using / supposed >> to be using, so I can see if something specific is going wrong? >> >> Thank you. >> >> >> Jenny Brown >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |
From: Jenny B. <sk...@gm...> - 2008-04-16 16:36:48
|
Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 and from a JUnit test case. I don't know if there's any chance of a jar conflict within Eclipse itself. On Wed, Apr 16, 2008 at 11:26 AM, Jenny Brown <sk...@gm...> wrote: > On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: > > Yeah, if you have "html" as the output type, then it should use the HTML > > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? > > You don't want that for HTML and, actually, not even for XHTML. Browsers > > don't handle HTML/XHTML documents with the XML declaration very well. > > > Also, I don't recommend using the StringWriter for output in a servlet. I would > > think you'd want to pass in the ServletOutputStream into the StreamResult. > > Ok I just fixed the xml declaration thing. > > I'm using StringWriter because this is actually used in a batch mode > (not servlet at all) and it's in the middle of the pipeline of modules > handling the data. I know for sure that the incoming data is a String > in UTF-8 and that the next item in line will also want it as a String > in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 > even if I make it so myself.) Eventually in the long term a browser > may see the html that results, but not immediately; many other things > happen to the data first. I need the html in memory for a while yet > after, so, String. > > > > > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough > > > code base I'm pretty sure there are no jar conflicts sneaking in old > > > versions. Rather I suspect I'm misunderstanding something about the > > > serialization process or xml / html specifications. > > > > > > > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the > > classpath? > > I just double checked that this morning, and I'm seeing the same > behavior (XML style serialization) after specifically putting Xalan's > copies of everything in place. > > Any more ideas? Are there get methods or debug info that I can use > with the Transformer to find out what it thinks it's using / supposed > to be using, so I can see if something specific is going wrong? > > Thank you. > > > Jenny Brown > |
From: Jenny B. <sk...@gm...> - 2008-04-16 16:26:02
|
On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: > Yeah, if you have "html" as the output type, then it should use the HTML > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? > You don't want that for HTML and, actually, not even for XHTML. Browsers > don't handle HTML/XHTML documents with the XML declaration very well. > Also, I don't recommend using the StringWriter for output in a servlet. I would > think you'd want to pass in the ServletOutputStream into the StreamResult. Ok I just fixed the xml declaration thing. I'm using StringWriter because this is actually used in a batch mode (not servlet at all) and it's in the middle of the pipeline of modules handling the data. I know for sure that the incoming data is a String in UTF-8 and that the next item in line will also want it as a String in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 even if I make it so myself.) Eventually in the long term a browser may see the html that results, but not immediately; many other things happen to the data first. I need the html in memory for a while yet after, so, String. > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough > > code base I'm pretty sure there are no jar conflicts sneaking in old > > versions. Rather I suspect I'm misunderstanding something about the > > serialization process or xml / html specifications. > > > > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the > classpath? I just double checked that this morning, and I'm seeing the same behavior (XML style serialization) after specifically putting Xalan's copies of everything in place. Any more ideas? Are there get methods or debug info that I can use with the Transformer to find out what it thinks it's using / supposed to be using, so I can see if something specific is going wrong? Thank you. Jenny Brown |
From: Jacob K. <ho...@vi...> - 2008-04-16 03:48:54
|
Jenny Brown wrote: > On Wed, Apr 9, 2008 at 1:02 AM, Jacob Kjome <ho...@vi...> wrote: >> You should avoid the direct use of implementation classes. Go through standard >> API's. And if you put xalan-2.7.1.jar and serializer.jar (and, I suggest, >> xercesImpl-2.9.1.jar) in the classpath, you will end up using the very latest >> implementations (better than the buggy versions that ship with the JVM). >> >> //Using String writer for output for convenience. >> //Usually better to use an OutputStream. >> StringWriter sw = new StringWriter(); >> >> >> //JAXP Transformer API >> Transformer t = TransformerFactory.newInstance().newTransformer(); >> >> //for HTML output >> t.setOutputProperty(OutputKeys.METHOD, "html"); >> t.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html"); >> t.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1"); > > > > Thanks for the tip there. That approach is working for me now. I'm > running into a quirk of how it's output some things though, that I'm > not sure how to interpret. (Background: I've got a lot of Java > servlet and web programming experience, but less with xml and the > various versions of xhtml and related specifications. So I'm a bit > lost on what to expect of the browser from this.) > > I have a dom document. I've passed it through JTidy and NekoHTML for > cleanup, and the result is pretty nice. However, in the original html > I was parsing, there were some situations like this: > > <P>Some text goes <strong></strong> here making a paragraph.</P> > > When that's coming back out of the serializer, it's coming out as > this, which Firefox chokes on: > > <P>Some text goes <strong/> here making a paragraph.</P> > > Likewise for <textarea /> and some other tags - Firefox rendering gets > completely thrown off when it encounters a few certain tags in > empty-tag XML style rather than html style. The code I'm using to set > up the transformer for output is this: > Seems like it's using an XML serializer > transformer.setOutputProperty(OutputKeys.METHOD, "html"); > transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html"); > transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); > transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); > transformer.transform(new DOMSource(domDocument), new > StreamResult(stringWriter)); > > So I'm not sure why I'm getting things that look like XML when the tag is empty. > Yeah, if you have "html" as the output type, then it should use the HTML serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? You don't want that for HTML and, actually, not even for XHTML. Browsers don't handle HTML/XHTML documents with the XML declaration very well. Also, I don't recommend using the StringWriter for output in a servlet. I would think you'd want to pass in the ServletOutputStream into the StreamResult. > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough > code base I'm pretty sure there are no jar conflicts sneaking in old > versions. Rather I suspect I'm misunderstanding something about the > serialization process or xml / html specifications. > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the classpath? > Thanks for any help you can provide. > > > Jenny Brown Jake |
From: Jenny B. <sk...@gm...> - 2008-04-15 20:08:09
|
On Wed, Apr 9, 2008 at 1:02 AM, Jacob Kjome <ho...@vi...> wrote: > You should avoid the direct use of implementation classes. Go through standard > API's. And if you put xalan-2.7.1.jar and serializer.jar (and, I suggest, > xercesImpl-2.9.1.jar) in the classpath, you will end up using the very latest > implementations (better than the buggy versions that ship with the JVM). > > //Using String writer for output for convenience. > //Usually better to use an OutputStream. > StringWriter sw = new StringWriter(); > > > //JAXP Transformer API > Transformer t = TransformerFactory.newInstance().newTransformer(); > > //for HTML output > t.setOutputProperty(OutputKeys.METHOD, "html"); > t.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html"); > t.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1"); Thanks for the tip there. That approach is working for me now. I'm running into a quirk of how it's output some things though, that I'm not sure how to interpret. (Background: I've got a lot of Java servlet and web programming experience, but less with xml and the various versions of xhtml and related specifications. So I'm a bit lost on what to expect of the browser from this.) I have a dom document. I've passed it through JTidy and NekoHTML for cleanup, and the result is pretty nice. However, in the original html I was parsing, there were some situations like this: <P>Some text goes <strong></strong> here making a paragraph.</P> When that's coming back out of the serializer, it's coming out as this, which Firefox chokes on: <P>Some text goes <strong/> here making a paragraph.</P> Likewise for <textarea /> and some other tags - Firefox rendering gets completely thrown off when it encounters a few certain tags in empty-tag XML style rather than html style. The code I'm using to set up the transformer for output is this: transformer.setOutputProperty(OutputKeys.METHOD, "html"); transformer.setOutputProperty(OutputKeys.MEDIA_TYPE, "text/html"); transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no"); transformer.transform(new DOMSource(domDocument), new StreamResult(stringWriter)); So I'm not sure why I'm getting things that look like XML when the tag is empty. I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough code base I'm pretty sure there are no jar conflicts sneaking in old versions. Rather I suspect I'm misunderstanding something about the serialization process or xml / html specifications. Thanks for any help you can provide. Jenny Brown |
From: Michelle H. <cs...@us...> - 2008-04-12 03:45:27
|
Here is the font problem. I use the parser to parse https://www.google.com/accounts/ServiceLoginBox?service=analytics&nui=1&hl=en&continue=http://www.google.com/analytics/home/%3Fet%3Dreset%26hl%3D. I get the text node "Sign in to Google Analytics with your", in the HTML code I see <font size="-1">Sign in to Google Analytics with your</font> But when I textnode.getParentNode().getNodeName(), I got TR which is the parent of font. Would you please help to check? Regards, Michelle On Sat, Apr 12, 2008 at 1:14 AM, Andy Clark <an...@cy...> wrote: > Michelle Hong wrote: > > I am quite interested in your neko parser. I now have serveral questions: > > > > 1. How do you handle the font and span? I try to parse a document use > DOMParser parser = new DOMParser(); > > parser.setFeature > > ("http://cyberneko.org/html/features/scanner/script/strip- > > comment-delims", true); > > parser.setFeature > > ("http://cyberneko.org/html/features/scanner/ignore- > > specified-charset",false); > > parser.setProperty > > ("http://cyberneko.org/html/properties/default- > > encoding", "UTF-8"); > > > > And I found that that after the parse. The font tag is going. Can I keep > all the tags? > > > > Are you saying that <font> tags are gone after > you parse a document? This should not happen. Can > you send a small sample document that demonstrates > the problem? > > > > 2. About the encoding. I find a popular hong Kong webpage which declare > the encoding is big5 however they use utf-8 in practice. Can you handle this > problem? > > > > If you don't want the parser to switch the > encoding when it finds a <meta> tag with a charset, > then you should use a Reader object to parse the > document. When you do this, you are responsible > for picking the correct encoding for reading. For > example: > > InputStream stream = new FileInputStream("index.html"); > Reader reader = new InputStreamReader(stream, "big5"); > > InputSource source = new InputSource(reader); > parser.parse(source); > > > > Thank you very much for your help. > > > > You're welcome. > > If you are going to have more questions about > NekoHTML, please send them to the mailing list > (nek...@li...) so that > everyone has a chance to answer (and learn from) > your questions. > > -AndyC > |