Sorry I haven't replied to this one earlier. Still catching up with the
> -----Original Message-----
> From: saxon-help-admin@...
> [mailto:saxon-help-admin@... Behalf Of
> Sent: 03 September 2001 21:56
> To: saxon-help@...
> Subject: [saxon] International Characters and EUC support
> The recent topic on this list about Big-5 got me thinking.
> We're in the process of converting our server to use Saxon, and some
> potential customers are asking about Japanese EUC encoding.
> we've done Japanese with UTF-8 or UTF-16 encodings, and I
> don't think that
> we will have any problems with Unicode encodings.
> Our environment includes Sun JRE 1.3, the XML4J (which
> includes an older
> version of Xerces) parser and Java internationalization support
> (i18n.jar), on both NT/2000 and Solaris.
> I have a few questions:
> 1) The Saxon documentation lists the standard Java encodings
> as supported.
> As you know, EUC isn't one of the standard ones, support for
> it is included
> with i18n.jar. What do we have to do to make Saxon generate
> EUC encoded output?
There are two aspects to the problem. Firstly, Saxon tries to create a Java
Writer using the encoding specified in the xsl:output statement. So your
Java VM must support this encoding. Secondly, before passing characters to
the Java Writer, Saxon needs to know which characters are supported in that
encoding, because characters that aren't supported will be represented by
character references (e.g. ሴ). So if the encoding isn't one of those
built in to Saxon, you need to write a Java module that implements the
com.icl.saxon.charcode.PluggableCharacterSet interface. It needs to
implement two methods, inCharSet(), which determines whether a given
character is supported by the encoding, and getEncodingName(), which returns
the name of the encoding as known to the Java VM (which is not always the
same as the ISO encoding name). To tell Saxon where to load the
PluggableCharacterSet implementation from, you need to set up a system
property - details are in the extensibility.html file.
> 2) Your reply about the AElfred parser got me thinking. In our
> environment, we build the XML document with Xerces, and we
> can build any
> encoding (including EUC) that Xerces supports. We pass the
> XML document to
> Saxon as a DOMSource, so no character conversion or parsing should be
> required. However, we're passing the stylesheet to Saxon as
> a StreamSource
> (actually, we make a javax.xml.transform.Templates object,
> which we store
> in our internal cache, when the user calls for a stylesheet,
> we look up the
> saved Templates object and generate a new Tranformer from it). If the
> Japanese data is in the stylesheet, not the XML document,
> what can we do to
> ensure that it gets converted properly? If the XSL file is
> unicode(UTF-8), AElfred should work, right?
Yes, AElfred will handle UTF-8 without problem. And you're right, if you
pass the document as a DOMSource or SAXSource (or even as a StreamSource
using a Reader) then Saxon doesn't need to know anything about the original
encoding. Of course, you can use any parser you like for the stylesheet,
just as you can for the source document.
> Will using a DOMSource for the stylesheet slow the process down any?
> Templates generation should happen very infrequently (only at startup
Using a SAXSource is better than a DOMSource. But for the stylesheet it
doesn't make much difference. It makes a bigger difference for the source
> Alternatively, what's the best way to tell Saxon to use
> Xerces instead of
> AElfred? I've successfully used
> TransformFactory.SetAttribute() in the past.
If you want to use the same parser for the source document and the
stylesheet (and any other documents loaded using document(), for example),
the simplest way is to set the system properties
javax.xml.parsers.SAXParserFactory or ...DocumentBuilderFactory. You can do
this from your source code (System.setProperty() method) or from the command
line. Saxon's TransformerFactory.setAttribute() method is an alternative
when you want to use different parsers for the source document and