It's often useful to start addressing this kind of problem by a change
of terminology and mind-set: instead of referring to it as an XML file,
start calling it a non-XML file. This corrects the perspective on the
problem: it's the data that's wrong, not the parser, and it's the data
that needs to be fixed. It's also worth realizing that Saxon is reading
the file using an off-the-shelf XML parser (you can choose which one to
use), and that the content is being rejected by the XML parser, not by
There are various ways you can attempt to repair bad XML (that is, turn
non-XML into XML). The best way, of course, is to fix the program that
is producing the bad data in the first place. If you can't do that, then
your strategy has to be focused on the kind of problems you need to
repair. If it's only invalid characters like this, the best way is
probably to write an implementation of InputStream that filters an
underlying InputStream to remove or substitute characters that are not
legal in XML. You can then use this filtered input stream as the input
to your document builder.
On 12/12/2012 09:58, Rajath.Sakkari@... wrote:
> I am using saxon 9 to parse XML file and reading an XML is done in
> following way.
> Processor proc = new Processor(false);
> DocumentBuilder *builder*= proc.newDocumentBuilder();
> *XdmNode doc *= builder.build(new File("*testFile.xml*"));
> and querying the following way
> //xapth query
> xpath = proc.newXPathCompiler();
> XPathSelector *selector *= xpath.compile(anXPathExpr).load();
> This throws an error if XML file being read has non unicode characters,
> 1) is there a possibility (in saxon) to create temp XML in proper
> format considering I do not have rights on the original XML files
> being read.
> I also tried something like this,
> FileInputStream fis = new FileInputStream(fileIN);
> byte contents = new byte[fis.available()];
> fis.read(contents, 0, contents.length);
> String asString = new String(contents, "ISO8859_1");
> byte newBytes = asString.getBytes("UTF8");
> File *tempXML* = new File("temp.xml");
> FileOutputStream fos = new FileOutputStream(tempXML.getAbsolutePath());
> but it throws an error as *An invalid XML character (Unicode: 0xb) was
> found in the element content of the document*.
> when *selector.setContextItem(getDoc());* is called.
> is there a possibility to check if there are any characters of these
> kind in XML file and modify them,
> considering i have rights on original XML files being read.
> When I open XML file i see as below
> Thanks in advance.
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> saxon-help mailing list archived at http://saxon.markmail.org/