From: Jacob K. <ho...@vi...> - 2008-04-17 23:05:13
|
Thanks for reporting back with your solution, and I'm glad you found one! Two questions, though... 1. Why do you use both JTidy and NekoHTML? Normally one would use one or the other. Both JTidy and NekoHTML allow you to generate a DOM from HTML. So why not choose one and be done? 2. Did you find out whether Xalan's Serializer has special handling for XHTML -vs- other types of XML documents? Browsers are quirky about how they parse XHTML. For instance, some browsers deal better with <script></script> than <script/>, especially browsers that don't really understand XHTML at all like IE (they treat it as HTML). Jake Jenny Brown wrote: > I'm reporting back in on the final solution to this, so it's in the > archives if someone hits a similar issue in the future. The Xalan > people helped me out, by observing there was a namespace being set on > the <html> element, like so: > <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en"> > > That was forcing the Transformer to do an xml output, hence the xml > style empty tags I was getting. The solution was to figure out where > that namespace was coming from. My steps were: > 1. Messy HTML is a string in memory > 2. Send messy HTML through JTidy parsing > 3. Call JTidy pprint (pretty-print) to get it back as a String > (validated and cleaned up) > (Observed that after this point the namespace was visible) > 4. Send validated HTML to NekoHTML (misled Neko into using the namespace) > 5. Neko's DOM got sent to the Transformer for output, and the > transformer responded to the namespace > > So I needed to tell JTidy to use a different config... I had been wrongly using > tidy.setXHTML(true); > due to a misunderstanding of its requirements. I changed it to this > and the problem namespace cleared up: > tidy.setXHTML(false); > tidy.setXmlOut(false); > > That meant Neko got handed clean html with no namespace in it, and > thus output was in html not xml. Incidentally, the way I could > observe the namespace from Java code was either to print the html > string, or to call this to read it out of the HTML node of the dom: > > System.out.println("NAMESPACE: " + > documentRoot.getFirstChild().getNamespaceURI()); > > Hope that helps someone else someday. :) Thanks for your help here > too. My code works now. > > Jenny Brown > > > On Wed, Apr 16, 2008 at 8:42 PM, Jacob Kjome <ho...@vi...> wrote: >> Jenny Brown wrote: >> > Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 >> > and from a JUnit test case. I don't know if there's any chance of a >> > jar conflict within Eclipse itself. >> >> I would recommend running this in a clean environment. I've seen lots of cases >> where people say it doesn't work when running under their IDE and it usually ends >> up being the IDE's fault. It's seems like it's probably a classpath issue. You >> might even try putting Xerces, Xalan, and Serializer jars into >> JAVA_HOME/jre/lib/endorsed, just to make sure you are actually using Apache's >> Xalan and not the old buggy one included in the JDK. Beyond that, I really don't >> have any other suggestions. >> >> For further help, you should probably ping the Xalan-user list, as they are the >> experts on the Transformer and Serializer stuff. >> >> >> Jake >> >> >> >> > >> > >> > On Wed, Apr 16, 2008 at 11:26 AM, Jenny Brown <sk...@gm...> wrote: >> >> On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: >> >> > Yeah, if you have "html" as the output type, then it should use the HTML >> >> > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? >> >> > You don't want that for HTML and, actually, not even for XHTML. Browsers >> >> > don't handle HTML/XHTML documents with the XML declaration very well. >> >> >> >> > Also, I don't recommend using the StringWriter for output in a servlet. I would >> >> > think you'd want to pass in the ServletOutputStream into the StreamResult. >> >> >> >> Ok I just fixed the xml declaration thing. >> >> >> >> I'm using StringWriter because this is actually used in a batch mode >> >> (not servlet at all) and it's in the middle of the pipeline of modules >> >> handling the data. I know for sure that the incoming data is a String >> >> in UTF-8 and that the next item in line will also want it as a String >> >> in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 >> >> even if I make it so myself.) Eventually in the long term a browser >> >> may see the html that results, but not immediately; many other things >> >> happen to the data first. I need the html in memory for a while yet >> >> after, so, String. >> >> >> >> >> >> >> >> > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough >> >> > > code base I'm pretty sure there are no jar conflicts sneaking in old >> >> > > versions. Rather I suspect I'm misunderstanding something about the >> >> > > serialization process or xml / html specifications. >> >> > > >> >> > >> >> > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the >> >> > classpath? >> >> >> >> I just double checked that this morning, and I'm seeing the same >> >> behavior (XML style serialization) after specifically putting Xalan's >> >> copies of everything in place. >> >> >> >> Any more ideas? Are there get methods or debug info that I can use >> >> with the Transformer to find out what it thinks it's using / supposed >> >> to be using, so I can see if something specific is going wrong? >> >> >> >> Thank you. >> >> >> >> >> >> Jenny Brown >> >> >> > >> >> >>> ------------------------------------------------------------------------- >> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference >> > Don't miss this year's exciting event. There's still time to save $100. >> > Use priority code J8TL2D2. >> > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone >> > _______________________________________________ >> > nekohtml-user mailing list >> > nek...@li... >> > https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > >> > >> > >> >> ------------------------------------------------------------------------- >> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference >> Don't miss this year's exciting event. There's still time to save $100. >> Use priority code J8TL2D2. >> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone >> _______________________________________________ >> nekohtml-user mailing list >> nek...@li... >> https://lists.sourceforge.net/lists/listinfo/nekohtml-user >> > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > |