HTML Parser / Discussion / Help: Using XMLReader to convert HTML to SAX events

Trejkaz - 2007-03-09

The following example code attempts to convert an HTML file to XSLFO via the CSSToXSLFO library.

The code completes successfully but the resulting FO file was empty, so to investigate the problem I added a feature where if the document starts with "<?xml", it goes through the normal Java XML parser.

It turns out that using Java's XML parser makes it work, so I'm led to believe HTMLParser's XMLReader class is outputting the wrong sequence of SAX events.

EXAMPLE CODE:

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.DataInputStream;
import java.io.OutputStream;
import java.nio.charset.Charset;

import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.parsers.SAXParserFactory;
import javax.xml.parsers.ParserConfigurationException;

import be.re.css.CSSToXSLFOFilter;
import be.re.css.CSSToXSLFOException;
import org.xml.sax.XMLReader;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class TestHtmlRendering {

 public static void main(String[] args) {
 if (args.length != 1) {
 System.err.println("Usage: java TestHtmlRendering infile");
 System.exit(1);
 }

 try {
 render(new File(args[0]), System.out);
 } catch (Exception e) {
 e.printStackTrace();
 System.exit(2);
 }
 }

 public static void render(File inputFile, OutputStream out)
 throws IOException, CSSToXSLFOException, ParserConfigurationException,
 TransformerConfigurationException, SAXException {

 InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFile));
 try {
 // Check the first few bytes of the file. If it's XML then use a normal XML parser.
 // Otherwise use the HTMLParser project's XMLReader implementation which handles legacy HTML.
 byte[] buf = new byte[5];
 inputStream.mark(5);
 new DataInputStream(inputStream).readFully(buf);
 inputStream.reset();

 XMLReader underlyingParser = null;
 if (new String(buf, Charset.forName("US-ASCII")).equals("<?xml")) {
 System.err.println("Using standard XML parser");
 SAXParserFactory parserFactory = SAXParserFactory.newInstance();
 parserFactory.setNamespaceAware(true);
 underlyingParser = parserFactory.newSAXParser().getXMLReader();
 } else {
 System.err.println("Using legacy HTML parser");
 underlyingParser = new org.htmlparser.sax.XMLReader();
 }

 // Create a filter to convert the (X)HTML+CSS to XSL-FO.
 CSSToXSLFOFilter filter = new CSSToXSLFOFilter(inputFile.toURI().toURL(), underlyingParser);

 // Create an identity transformer handler. These convert SAX events to a Result.
 //SAXTransformerFactory transformerFactory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
 SAXTransformerFactory transformerFactory = (SAXTransformerFactory) TransformerFactory.newInstance();
 TransformerHandler transformerHandler = transformerFactory.newTransformerHandler();

 // Wire the result of the transform up to the output file.
 transformerHandler.setResult(new StreamResult(out));

 // Wire the filter so that the parsed elements are passed to the transform.
 filter.setContentHandler(transformerHandler);

 // Parse the input stream.
 filter.parse(new InputSource(inputStream));
 } finally {
 inputStream.close();
 }
 }
}

EXAMPLE DOCUMENT:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
 <title>Test document</title>
</head>
<body>
 <div>
 Here is some green text
 Here is some bold text
 </div>
</body>
</html>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Trejkaz - 2007-03-09
 
 Issues confirmed so far:
 
 1. The SAX events output put the elements in uppercase. Ideally there would be a setting for this, perhaps even a setting for the namespace to output.
 2. Bogus attributes called "#text" are found in the Attributes object.
 3. Values in the text are not properly unescaped. (SAX callbacks are not supposed to contain & escapes.)
 
 The above _three_ issues can be worked around by creating a custom XMLFilter implementation to fix each of them.
 
 4. When the source HTML only contained a start tag, the XMLReader doesn't output the end tag. This is extremely bad and there's no easy way to work around the issue. IMHO, something which outputs SAX events should insert the missing end tags.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Derrick Oswald - 2007-03-11
 
 For 1), there is already a request for enhancement #958708 Add Simple API for XML (SAX) support, and #886885 namespace support.
 For 2), these are the whitespace between elements (not just attributes) necessary to support conversion back to HTML maintaining the same format.
 For 3), log it as a bug.
 For 4), is there an end tag in the source? If so log a bug. If not, are the end tags supposed to be generated?
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Trejkaz - 2007-03-11
 
 2):
 
 what I mean is that in the Attributes object passed into characters() there is an attribute called "#text" (I've noticed that if there are two real attributes, it also generates two #text attributes.) This attribute value seems to always be empty, and is different from the characters() event I get between tags where I get "\r\n " as expected. Related to the whitespace though I opened a feature request to output ignorableWhitespace() instead where possible, but that isn't going to be easy to implement (if it were I would do it myself.) Luckily my intended output is SAX events for a document which should be nearly-valid XHTML, so whitespace doesn't matter for me.
 
 4):
 
 There isn't an end tag in the source, but since the events are supposed to be SAX events and not SAS (I'm sure this doesn't really exist) events, it should at least generate the same number of start tags and end tags.
 
 I tried to fix this myself by not checking if (null == end) and just using the name of the start tag to generate the matching endElement() call, but the problem is inline tags like ... don't match up, i.e. the text inside them isn't a child of the start tag. It generates a , then the text, then the ... which is really hard to work with but probably possible in some fashion. I'm still trying anyway.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Using XMLReader to convert HTML to SAX events

Forums

Help

Using XMLReader to convert HTML to SAX events document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Using XMLReader to convert HTML to SAX events