From: Derek A. <d.a...@ls...> - 2009-07-24 12:59:02
|
Hi, Am processing some invalid xhtml files that aren't even well-formed and am hoping NekoHTML can help. My main aim is to make them well-formed with the minimum possible changes. I've written a simple test app that uses org.cyberneko.html.filters.Writer to process one of the xhtml source files and output a cleaned version. It currently does this: public static void main(String[] args) throws Exception { XMLParserConfiguration parser = new HTMLConfiguration(); parser.setFeature("http://apache.org/xml/features/scanner/notify-char-refs", true); parser.setFeature("http://cyberneko.org/html/features/scanner/notify-builtin-refs", true); parser.setFeature("http://cyberneko.org/html/features/report-errors", true); parser.setFeature("http://cyberneko.org/html/features/balance-tags", true); parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower"); String iencoding = null; String oencoding = "Windows-1252"; java.util.Vector filtersVector = new java.util.Vector(2); filtersVector.addElement(new Purifier()); filtersVector.addElement(new Writer(System.out, oencoding)); XMLDocumentFilter[] filters = new XMLDocumentFilter[filtersVector.size()]; filtersVector.copyInto(filters); parser.setProperty("http://cyberneko.org/html/properties/filters", filters); XMLInputSource source = new XMLInputSource(null, args[0], null); source.setEncoding(iencoding); parser.parse(source); } A few problems with the output from this that I need to resolve: 1. The doctype from the source file doesn't appear in the target file: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> It is the first line in the source. Get an odd error reported about this too: [Error] source.xhtml:1:110: DOCTYPE declaration found inside document content. 2. The main problem with the source files I am trying to fix is that they contain attribute values with bare ampersands in them. This causes normal xml parsing with Xerces to fail. Here's an example: href="http://somewhere.com/form?this=that&foo=baa" Get warnings for this: [Warning] source.xhtml:476:108: Bare ampersand found. [Warning] source.xhtml:476:108: Unknown general entity "email". I would have thought this should be an error, however, all that is important to me is to find a way to have these fixed in the output, e.g: href="http://somewhere.com/form?this=that&foo=baa" I tried setting the http://cyberneko.org/html/features/scanner/normalize-attrs feature to true but that just caused an ArrayOutOfBoundsException so I removed it. 3. I must also be missing something obvious in my usage of NekoHTML as, the output file contains unbalanced <br> elements. Would appreciate any advice on fixing these things. Thanks, Derek Please access the attached hyperlink for an important electronic communications disclaimer: http://www.lse.ac.uk/collections/secretariat/legal/disclaimer.htm |