Java Mozilla Html Parser / Discussion / Help: International characters

International characters

Forum: Help

Creator: Bob Sparks

Created: 2008-09-09

Updated: 2013-04-17

Bob Sparks - 2008-09-09

Is there a way to tell the parser to handle international characters.

If I parse this into a dom..

<td valign="top" class="tdbgsr"><b>Durée du contrat :</b></td>

And then get it out with this from the org.w3c.dom.Document methods...

if (curNode.getNodeType() == Node.TEXT_NODE && curNode.getNodeValue() != null && curNode.getNodeValue().trim().length() >0) {
    String xx = curNode.getNodeValue().trim();
    System.out.println(xx);
}

I get

   Dur
   e du contrat :

Indicating that it split the text into two nodes and dropped the accented "é"
which was encoded "é".

I got around this by replacing the "é" with "é" but this seems hokey.

Thanks

Bob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hoang Long Nguyen - 2010-09-23

have you got a solution for it yet? I got the same issue and i haven't found a workaround. I tried with

Document d = _parser.parse( myString.getBytes(), "utf-8" );

but it seems that character encoding 8 doesn't work.

Let me know if you have got your solution.

regards,

-Hoang Long

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.