From: Tony R. <ton...@ni...> - 2009-10-20 22:10:42
|
Hi, I'm using XMLUnit primarily for HTMLDocumentBuilder and TolerantSaxDocumentBuilder (nice tools btw!). If I build a Document from html with in it the String contents of the Node in question have weird bytes where the space should be. I ran into this trying to split a resulting string on whitespace. For example, with <body>test after</body>, the body text string I get has the following utf-8 bytes: bytes: 116 101 115 116 -62 -96 97 102 116 101 114 I was expecting to find 32 where the -62 and -96 are. Bug? I'm using latest version with java 1.6.0.16. Thanks, Tony Rozga Here is a test (not JUnit though :): import javax.xml.xpath.XPath; import javax.xml.xpath.XPathConstants; import javax.xml.xpath.XPathExpression; import javax.xml.xpath.XPathFactory; import org.custommonkey.xmlunit.HTMLDocumentBuilder; import org.custommonkey.xmlunit.TolerantSaxDocumentBuilder; import org.custommonkey.xmlunit.XMLUnit; import org.w3c.dom.Document; import org.w3c.dom.NodeList; public class XmlUnitBug { public static void main(String[] args) { try { String html = "test after"; TolerantSaxDocumentBuilder tolerantSaxDocumentBuilder = new TolerantSaxDocumentBuilder(XMLUnit.newTestParser()); HTMLDocumentBuilder builder = new HTMLDocumentBuilder(tolerantSaxDocumentBuilder); Document doc = builder.parse(html); XPathFactory factory = XPathFactory.newInstance(); XPath xpath = factory.newXPath(); XPathExpression expr = xpath.compile("/html/body"); String body = ((NodeList) expr.evaluate(doc, XPathConstants.NODESET)).item(0).getTextContent(); System.out.println("body: " + body); System.out.print("bytes: "); byte[] bytes = body.getBytes("UTF-8"); for (byte b : bytes) { System.out.print(b); System.out.print(" "); } System.out.println(""); } catch (Exception ex) { System.out.println("whoops: " + ex); } } } |