From: Jenny B. <sk...@gm...> - 2008-04-17 22:40:16
|
I'm reporting back in on the final solution to this, so it's in the archives if someone hits a similar issue in the future. The Xalan people helped me out, by observing there was a namespace being set on the <html> element, like so: <HTML xmlns="http://www.w3.org/1999/xhtml" lang="en"> That was forcing the Transformer to do an xml output, hence the xml style empty tags I was getting. The solution was to figure out where that namespace was coming from. My steps were: 1. Messy HTML is a string in memory 2. Send messy HTML through JTidy parsing 3. Call JTidy pprint (pretty-print) to get it back as a String (validated and cleaned up) (Observed that after this point the namespace was visible) 4. Send validated HTML to NekoHTML (misled Neko into using the namespace) 5. Neko's DOM got sent to the Transformer for output, and the transformer responded to the namespace So I needed to tell JTidy to use a different config... I had been wrongly using tidy.setXHTML(true); due to a misunderstanding of its requirements. I changed it to this and the problem namespace cleared up: tidy.setXHTML(false); tidy.setXmlOut(false); That meant Neko got handed clean html with no namespace in it, and thus output was in html not xml. Incidentally, the way I could observe the namespace from Java code was either to print the html string, or to call this to read it out of the HTML node of the dom: System.out.println("NAMESPACE: " + documentRoot.getFirstChild().getNamespaceURI()); Hope that helps someone else someday. :) Thanks for your help here too. My code works now. Jenny Brown On Wed, Apr 16, 2008 at 8:42 PM, Jacob Kjome <ho...@vi...> wrote: > Jenny Brown wrote: > > Also, if it's any help -- I'm running my code under Eclipse 3.3.1.1 > > and from a JUnit test case. I don't know if there's any chance of a > > jar conflict within Eclipse itself. > > I would recommend running this in a clean environment. I've seen lots of cases > where people say it doesn't work when running under their IDE and it usually ends > up being the IDE's fault. It's seems like it's probably a classpath issue. You > might even try putting Xerces, Xalan, and Serializer jars into > JAVA_HOME/jre/lib/endorsed, just to make sure you are actually using Apache's > Xalan and not the old buggy one included in the JDK. Beyond that, I really don't > have any other suggestions. > > For further help, you should probably ping the Xalan-user list, as they are the > experts on the Transformer and Serializer stuff. > > > Jake > > > > > > > > > On Wed, Apr 16, 2008 at 11:26 AM, Jenny Brown <sk...@gm...> wrote: > >> On Tue, Apr 15, 2008 at 11:49 PM, Jacob Kjome <ho...@vi...> wrote: > >> > Yeah, if you have "html" as the output type, then it should use the HTML > >> > serializer. Although, why do you specify "no" to OutputKeys.OMIT_XML_DECLARATION? > >> > You don't want that for HTML and, actually, not even for XHTML. Browsers > >> > don't handle HTML/XHTML documents with the XML declaration very well. > >> > >> > Also, I don't recommend using the StringWriter for output in a servlet. I would > >> > think you'd want to pass in the ServletOutputStream into the StreamResult. > >> > >> Ok I just fixed the xml declaration thing. > >> > >> I'm using StringWriter because this is actually used in a batch mode > >> (not servlet at all) and it's in the middle of the pipeline of modules > >> handling the data. I know for sure that the incoming data is a String > >> in UTF-8 and that the next item in line will also want it as a String > >> in UTF-8. (And I'm making sure the meta tag charset also says UTF-8 > >> even if I make it so myself.) Eventually in the long term a browser > >> may see the html that results, but not immediately; many other things > >> happen to the data first. I need the html in memory for a while yet > >> after, so, String. > >> > >> > >> > >> > > I'm using Xalan 2.7.1 and Xerces 2.9.1, and this is a small enough > >> > > code base I'm pretty sure there are no jar conflicts sneaking in old > >> > > versions. Rather I suspect I'm misunderstanding something about the > >> > > serialization process or xml / html specifications. > >> > > > >> > > >> > Are you sure you have serializer.jar from the Xalan-2.7.1 distribution in the > >> > classpath? > >> > >> I just double checked that this morning, and I'm seeing the same > >> behavior (XML style serialization) after specifically putting Xalan's > >> copies of everything in place. > >> > >> Any more ideas? Are there get methods or debug info that I can use > >> with the Transformer to find out what it thinks it's using / supposed > >> to be using, so I can see if something specific is going wrong? > >> > >> Thank you. > >> > >> > >> Jenny Brown > >> > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > > Don't miss this year's exciting event. There's still time to save $100. > > Use priority code J8TL2D2. > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > > _______________________________________________ > > nekohtml-user mailing list > > nek...@li... > > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > > > > > > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > _______________________________________________ > nekohtml-user mailing list > nek...@li... > https://lists.sourceforge.net/lists/listinfo/nekohtml-user > |