How to catch output character encoding errors

Help
2003-03-13
2012-10-08
  • Bjarne Hald

    Bjarne Hald - 2003-03-13

    I'm using Saxon to convert from XML into a textual CSV format. In that process I also want to convert to the ISO-8859-1/Windows-1252 encoding and have therefore set the "encoding" property to one of these encodings. Unfortunately, my XML input may contain Unicode characters that do not exist in one of the output character sets. This is not a fatal error to me, but it is to saxon! Is there any way of catching these errors and output e.g. question marks instead of the illegal characters? I'm calling Saxon from Java using the transform() method.

     
    • Michael Kay

      Michael Kay - 2003-03-13

      You can use the "double-translate" method:

      <xsl:variable name="allowed-chars">abcde..ABCDE..123...</xsl:variable>

      then:

      translate($x, $allowed-chars, '') gives you the characters that are NOT allowed, and

      translate($x, translate($x, $allowed-chars, ''), '?') translates the disallowed characters into "?" characters.

      Alternatively, you could do something using regular expressions and replace in Saxon 7.4

      Another 7.4 option is

      unicode-to-string(
      for $c in string-to-unicode($x)
      if ($c < 256) then $c else 63)

      Michael Kay

       
    • Bjarne Hald

      Bjarne Hald - 2003-03-15

      Thankyou for the reply. However, I was really looking for a way to catch and ignore the error in Java. Maybe by adding a custom character encoder that replaces the standard encoder? Would it be possible to inherit from the iso-8859-1/windows 1252 encoder?

      Bjarne Hald

       
      • Michael Kay

        Michael Kay - 2003-03-15

        If you supply a Writer as the output destination, Saxon will not attempt to do any encoding of the characters, and you can then do anything you like with them.

        Michael Kay

         
    • Bjarne Hald

      Bjarne Hald - 2003-04-09

      Thank you, that was exactly what I wanted.

      Bjarne Hald

       
  • Kavita Cardoz

    Kavita Cardoz - 2011-10-19

    I tried to use Writer as the output destination in the transform method, but
    it still defaulted to Saxon's implementation. Is there any other way to make
    this work? The following is a code snippet, based on the recommendation above:

    FileOutputStream fos = new FileOutputStream(tmpFile);
    Charset charset = Charset.forName("US-ASCII");
    CharsetEncoder encoder = charset.newEncoder();
    encoder.replaceWith("?".getBytes());
    encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
    OutputStreamWriter dos = new OutputStreamWriter(fos, charset);
    transformer.transform(new StreamSource(xmlFile), new StreamResult(dos));

     
  • Michael Kay

    Michael Kay - 2011-10-20

    I'm not sure that adding to an 8-year-old thread is a particularly smart thing
    to do, but never mind...

    Please be more specific about what you are trying to do and how the observed
    effect differs from your expected effect. Ideally, please supply a
    freestanding JUnit test case that I can run and that fails one of its
    assertions. I can then determine whether Saxon is doing something wrong, or
    whether your expectations of what it should do are wrong. (It's worth noting,
    however, that this is almost nothing to do with Saxon - when Saxon writes
    things to a Writer, it isn't going to behave differently from anyone else
    writing the same things to the same Writer.)

     

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks