International characters

Help
Bob Sparks
2008-09-09
2013-04-17
  • Bob Sparks
    Bob Sparks
    2008-09-09

    Is there a way to tell the parser to handle international characters.

    If I parse this into a dom..

    <td valign="top" class="tdbgsr"><b>Dur&#233;e du contrat :</b></td>

    And then get it out with this from the org.w3c.dom.Document methods...

    if (curNode.getNodeType() == Node.TEXT_NODE && curNode.getNodeValue() != null && curNode.getNodeValue().trim().length() >0) {
        String xx = curNode.getNodeValue().trim();
        System.out.println(xx);
    }

    I get

       Dur
       e du contrat :

    Indicating that it split the text into two nodes and dropped the accented "é"
    which was encoded "&#233;".

    I got around this by replacing the "&#233;" with "é" but this seems hokey.

    Thanks

    Bob

     
  • have you got a solution for it yet? I got the same issue and i haven't found a workaround. I tried with

    Document d  =  _parser.parse( myString.getBytes(), "utf-8" );  

    but it seems that character encoding 8 doesn't work.

    Let me know if you have got your solution.

    regards,

    -Hoang Long