From: john d. <jo...@ma...> - 2004-01-31 04:36:54
|
i'm trying to use jtidy(r7) in order to clean up/sanitize user inputted html before displaying it on a website(in this case www.indymedia.org). i have run into some serious issues with character encoding which i have been unable to resolve. specifically, i want utf8 input text to remain utf8 after being passed through jtidy. at the moment, i cannot get this to happen, instead, all utf8 characters *and* entity references like é show up in the output as ? more detail: basically the approach i am using is to parse/clean using jtidy, then walk the dom tree, outputing tags/attributes if they are approved. the code which sets up the walk is as follows: ByteArrayOutputStream result=new ByteArrayOutputStream(); PrintWriter out = new PrintWriter(result); Tidy tidy = new Tidy(); ByteArrayInputStream in = new ByteArrayInputStream(aText.getBytes()); tidy.setMakeClean(true); tidy.setXmlOut(true); print(tidy.parseDOM(in, null),out); return result.toString(); and the "print" method which recursively processes nodes looks like(cut down to just the part dealing with text nodes for brevity): private void print(Node node,PrintWriter out) { int type = node.getNodeType(); <snip> case Node.TEXT_NODE: out.print(node.getNodeValue()); break; } <snip> } what am I doing wrong? i've tried numerous combinations of jtidy options with no success...perhaps I missed something? thanks for any help you can offer, john duda john--at--manifestor.org (please cc me on any replies as i am not subscribed to the list) -- this is where my public key can be found: gpg --keyserver pgp.mit.edu --recv-keys 03817826 Key fingerprint = 6C11 8D70 2ADE EFA9 498D 72CB 77EA 391A 0381 7826 |