I'm using Saxon to convert from XML into a textual CSV format. In that process I also want to convert to the ISO-8859-1/Windows-1252 encoding and have therefore set the "encoding" property to one of these encodings. Unfortunately, my XML input may contain Unicode characters that do not exist in one of the output character sets. This is not a fatal error to me, but it is to saxon! Is there any way of catching these errors and output e.g. question marks instead of the illegal characters? I'm calling Saxon from Java using the transform() method.
You can use the "double-translate" method:
translate($x, $allowed-chars, '') gives you the characters that are NOT allowed, and
translate($x, translate($x, $allowed-chars, ''), '?') translates the disallowed characters into "?" characters.
Alternatively, you could do something using regular expressions and replace in Saxon 7.4
Another 7.4 option is
for $c in string-to-unicode($x)
if ($c < 256) then $c else 63)
Thankyou for the reply. However, I was really looking for a way to catch and ignore the error in Java. Maybe by adding a custom character encoder that replaces the standard encoder? Would it be possible to inherit from the iso-8859-1/windows 1252 encoder?
If you supply a Writer as the output destination, Saxon will not attempt to do any encoding of the characters, and you can then do anything you like with them.
Thank you, that was exactly what I wanted.
I tried to use Writer as the output destination in the transform method, but
it still defaulted to Saxon's implementation. Is there any other way to make
this work? The following is a code snippet, based on the recommendation above:
FileOutputStream fos = new FileOutputStream(tmpFile);
Charset charset = Charset.forName("US-ASCII");
CharsetEncoder encoder = charset.newEncoder();
OutputStreamWriter dos = new OutputStreamWriter(fos, charset);
transformer.transform(new StreamSource(xmlFile), new StreamResult(dos));
I'm not sure that adding to an 8-year-old thread is a particularly smart thing
to do, but never mind...
Please be more specific about what you are trying to do and how the observed
effect differs from your expected effect. Ideally, please supply a
freestanding JUnit test case that I can run and that fails one of its
assertions. I can then determine whether Saxon is doing something wrong, or
whether your expectations of what it should do are wrong. (It's worth noting,
however, that this is almost nothing to do with Saxon - when Saxon writes
things to a Writer, it isn't going to behave differently from anyone else
writing the same things to the same Writer.)
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.