Menu

not UTF-8 characters in documents

Help
2003-12-16
2003-12-19
  • Hagop Chemedikian

    Hi,
    I'm using babeldoc 1.2 to convert documents from EDIFACT (plian text files). The conversion consists in 2 stages. First I convert the input EDIFACT  document into XML format and then the XML doc into another EDIFACT document using the XSLTransform pipeline stage.
    The XML intermediate document is memorized in the system for backup reasons, so I absolutely need it. 

    The problem is that some EDIFACT input documents contains not UTF-8 characters, so babeldoc handles an error and stops conversion. I 'attach the error trace at the and of this post.

    Somebody knows if there's a way to escape the not UTF-8 characters or to just to substitute them with blank spaces using some feature of babeldoc.
    Thanks Hagop

    ERROR TRACE ..

    xml2xml-2 Error: com.babeldoc.core.pipeline.PipelineException: [XslTransformPipelineStage.process]
    <2003-12-16 17:04:59,159> ERROR [Thread-0] :  [AsynchronousFeeder$1.run]
    com.babeldoc.core.pipeline.PipelineException: [XslTransformPipelineStage.process]
            at com.babeldoc.core.pipeline.stage.XslTransformPipelineStage.process(Unknown Source)
            at com.babeldoc.core.pipeline.PipelineStage.processStage(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResult(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStageResults(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.processPipelineStage(Unknown Source)
            at com.babeldoc.core.pipeline.processor.SyncPipelineStageProcessor.process(Unknown Source)
            at com.babeldoc.core.pipeline.PipelineStageFactory.process(Unknown Source)
            at com.babeldoc.core.pipeline.PipelineFactory.process(Unknown Source)
            at com.babeldoc.core.pipeline.PipelineFactoryFactory.process(Unknown Source)
            at com.babeldoc.core.pipeline.feeder.SynchronousFeeder.process(Unknown Source)
            at com.babeldoc.core.pipeline.feeder.AsynchronousFeeder.actuallyProcess(Unknown Source)
            at com.babeldoc.core.pipeline.feeder.AsynchronousFeeder$1.run(Unknown Source)
            at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown Source)
            at java.lang.Thread.run(Thread.java:534)
    Caused by: javax.xml.transform.TransformerException: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.
            at org.apache.xalan.transformer.TransformerImpl.fatalError(TransformerImpl.java:741)
            at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:715)
            at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1129)
            at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:1107)
            at com.babeldoc.core.pipeline.stage.XslTransformPipelineStage.transformInputStream(Unknown Source)
            at com.babeldoc.core.pipeline.stage.XslTransformPipelineStage.transformDocument(Unknown Source)
            ... 28 more
    Caused by: java.io.UTFDataFormatException: Invalid byte 1 of 1-byte UTF-8 sequence.
            at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
            at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
            at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
            at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
            at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
            at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
            at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
            at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
            at org.apache.xerces.parsers.DTDConfiguration.parse(Unknown Source)
            at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
            at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
            at org.apache.xml.dtm.ref.DTMManagerDefault.getDTM(DTMManagerDefault.java:495)
            at org.apache.xalan.transformer.TransformerImpl.transform(TransformerImpl.java:658)
            ... 32 more

     
    • Hagop Chemedikian

      I find the cause of the problem. The FLATTOXML pipeline stage always creates documents with UTF-8 encoding. if in the input file there are not UTF-8 characters  the output of the  FLATTOXML stage is not consitent because declares to be UTF-8 but contains other characters.
      Is there a way to declare the encoding type  of the output document in the FLATTOXML stage like in  XlsToXml where exists the pipeline encoding parameters for example?

      Thanks in advance for the help
      Hagop

       
    • Dejan Krsmanovic

      I haven't used this pipeline stage but I guess you should be able to set encoding in XML conversion document. As I could see the FlatToXML stage uses DigesterConversionMarshaller class (com.babeldoc.conversion.flatfile.digester.DigesterConversionUnmarshaller)
      As I could see from the source code UTF-8 is default encoding but it isn't hardcoded and you can specify it with header of conversion xml document.

      I guess you can get more information from Bruce, but he is pretty bussy these days with other things. Try to ask this question on mailing list.

       
    • Hagop Chemedikian

      I found the solution.
      In the header of the flattoxml I must set the encoding tag to the encoding I need.

      Thanks for the help
      Hagop

       
    • Mitch Christensen

      I also need to process EDI (820) documents, and was considering FlatToXml.  Would you be willing to share pointers and/or post your mapping file?

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.