Hi,
I'm using XMLUnit primarily for HTMLDocumentBuilder and
TolerantSaxDocumentBuilder (nice tools btw!). If I build a Document
from html with in it the String contents of the Node in question
have weird bytes where the space should be. I ran into this trying to
split a resulting string on whitespace.
For example, with <body>test after</body>, the body text string I
get has the following utf-8 bytes:
bytes: 116 101 115 116 -62 -96 97 102 116 101 114
I was expecting to find 32 where the -62 and -96 are. Bug?
I'm using latest version with java 1.6.0.16.
Thanks,
Tony Rozga
Here is a test (not JUnit though :):
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathFactory;
import org.custommonkey.xmlunit.HTMLDocumentBuilder;
import org.custommonkey.xmlunit.TolerantSaxDocumentBuilder;
import org.custommonkey.xmlunit.XMLUnit;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class XmlUnitBug {
public static void main(String[] args) {
try {
String html = "test after";
TolerantSaxDocumentBuilder tolerantSaxDocumentBuilder = new
TolerantSaxDocumentBuilder(XMLUnit.newTestParser());
HTMLDocumentBuilder builder = new
HTMLDocumentBuilder(tolerantSaxDocumentBuilder);
Document doc = builder.parse(html);
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
XPathExpression expr = xpath.compile("/html/body");
String body = ((NodeList) expr.evaluate(doc,
XPathConstants.NODESET)).item(0).getTextContent();
System.out.println("body: " + body);
System.out.print("bytes: ");
byte[] bytes = body.getBytes("UTF-8");
for (byte b : bytes) {
System.out.print(b);
System.out.print(" ");
}
System.out.println("");
} catch (Exception ex) {
System.out.println("whoops: " + ex);
}
}
}
|