XML serialization: invalid character

Help
2013-01-31
2013-04-25
  • Jörg Tiedemann

    Jörg Tiedemann - 2013-01-31

    For many PDF-files, the conversion fails because of an XML-serialization problem. The following error message is generated:

    Couldn't serialize document: The character '^B' is an invalid XML character
    Exception in thread "main" java.io.IOException: The character '^B' is an invalid XML character
            at org.apache.xml.serialize.BaseMarkupSerializer.fatalError(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.surrogates(Unknown Source)
            at org.apache.xml.serialize.XMLSerializer.printText(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.characters(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.serializeNode(Unknown Source)
            at org.apache.xml.serialize.XMLSerializer.serializeElement(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.serializeNode(Unknown Source)
            at org.apache.xml.serialize.XMLSerializer.serializeElement(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.serializeNode(Unknown Source)
            at org.apache.xml.serialize.XMLSerializer.serializeElement(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.serializeNode(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.serializeNode(Unknown Source)
            at org.apache.xml.serialize.BaseMarkupSerializer.serialize(Unknown Source)
            at at.ac.tuwien.dbai.pdfwrap.ProcessFile.serializeXML(ProcessFile.java:683)
            at at.ac.tuwien.dbai.pdfwrap.ProcessFile.main(ProcessFile.java:552)

    I could provide example PDF-files if necessary. Here is a link to one of them:

    http://bookshop.europa.eu/is-bin/INTERSHOP.enfinity/WFS/EU-Bookshop-Site/en_GB/-/EUR/ViewPublication-Start;pgid=y8dIS7GUWMdSR0EAlMEUUsWb0000HtZnxf0p;sid=Hz_BHcXrVlrBCIlroXFfuqfOvfy9TcI4Gew=?PublicationKey=A13008132&CatalogCategoryID=QN4KABste0YAAAEjFZEY4e5L

     
  • Jörg Tiedemann

    Jörg Tiedemann - 2013-02-01

    I changed  src/at/ac/tuwien/dbai/pdfwrap/model/document/TextSegment.java to make it more robust:

    128c128,130
    <         this.text = text;
    --
    >     String cleanText = text.replaceAll("\\p{Cntrl}", "");
    >         this.text = cleanText;

    But now I get some invalid UTF-8 characters sometimes (for example \U+0E00). But at least I get some output and the tool does not crash.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks