#2 Exception in client returning XML with accented characters

I_CAN'T_WORK
open
nobody
None
5
2004-10-15
2004-10-15
Roberto Giaccio
No

I had problems in performing remote searches returning XML
records with accented characters.
The stack trace was the following:

java.io.UTFDataFormatException: Invalid byte 2 of 3-byte UTF-8
sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown
Source)
at org.apache.xerces.impl.io.UTF8Reader.read(Unknown
Source)
at org.apache.xerces.impl.XMLEntityScanner.load(Unknown
Source)
at
org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown
Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanCo
ntent(Unknown Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$Fragm
entContentDispatcher.dispatch(Unknown Source)
at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanD
ocument(Unknown Source)
at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at
org.apache.xerces.parsers.XML11Configuration.parse(Unknown
Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown
Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
at
com.k_int.IR.Syntaxes.XMLRecord.getDocument(XMLRecord.java:
114)

The problem occurred in com.k_int.IR.Syntaxes.XMLRecord.
Further analysis showed that the XML string was send correctly,
that is, the "orig" variable in
XMLRecord.getDocument() was correct, so the problem accurred in
the DOM conversion.
The DOM conversion is performed in the following line of
XMLRecord.getDocument():

doc = docBuilder.parse(new
ByteArrayInputStream(orig.getBytes()));

What happens is that he XML metadata record (as string) is
converted to a byte array by orig.getBytes() using the platform's
default character encoding, which in my case is not UTF-8; then
docBuilder.parse() assumes from the XML metadata header that
the document is UTF8; but the previous conversion to byte array
created incorrect UTF8 character sequences, which causes an
exception while parsing.
This can be (obviously) solved with as follows

doc = docBuilder.parse(new
ByteArrayInputStream(orig.getBytes("UTF8")));

I tested the fix and the XML Metadata string is correctly converted
from and to a byte array, and all records are returned to the client.
Hence, if you agree with the analysis, I suggest to apply the fix in
the next release of your library.
Thanks

Discussion