Is there a way to tell the parser to handle international characters.
If I parse this into a dom..
<td valign="top" class="tdbgsr"><b>Durée du contrat :</b></td>
And then get it out with this from the org.w3c.dom.Document methods...
if (curNode.getNodeType() == Node.TEXT_NODE && curNode.getNodeValue() != null && curNode.getNodeValue().trim().length() >0) { String xx = curNode.getNodeValue().trim(); System.out.println(xx); }
I get
Dur e du contrat :
Indicating that it split the text into two nodes and dropped the accented "é" which was encoded "é".
I got around this by replacing the "é" with "é" but this seems hokey.
Thanks
Bob
have you got a solution for it yet? I got the same issue and i haven't found a workaround. I tried with
Document d = _parser.parse( myString.getBytes(), "utf-8" );
but it seems that character encoding 8 doesn't work.
Let me know if you have got your solution.
regards,
-Hoang Long
Log in to post a comment.
Is there a way to tell the parser to handle international characters.
If I parse this into a dom..
<td valign="top" class="tdbgsr"><b>Durée du contrat :</b></td>
And then get it out with this from the org.w3c.dom.Document methods...
if (curNode.getNodeType() == Node.TEXT_NODE && curNode.getNodeValue() != null && curNode.getNodeValue().trim().length() >0) {
String xx = curNode.getNodeValue().trim();
System.out.println(xx);
}
I get
Dur
e du contrat :
Indicating that it split the text into two nodes and dropped the accented "é"
which was encoded "é".
I got around this by replacing the "é" with "é" but this seems hokey.
Thanks
Bob
have you got a solution for it yet? I got the same issue and i haven't found a workaround. I tried with
but it seems that character encoding 8 doesn't work.
Let me know if you have got your solution.
regards,
-Hoang Long