unicde chars lost?

Brought to you by: derrickoswald

unicde chars lost?

Forum: Help

Creator: Hoang Long Nguyen

Created: 2010-09-19

Updated: 2013-04-27

Hoang Long Nguyen - 2010-09-19

Hi all,

i created a DOM doc like this:

MozillaParser parser = new MozillaParser();
doc = parser.parse( htmlContent.getBytes(), htmlEncoding );

then after a query to extract content (Node, NodeList,,,),, and i use getTextContent() to see the content. The problem is that if the htmlContent has unicode character, the result of getTextContent() has text with "?" character even the htmlEncoding is parsed as same as the encoding of the html document. Do you see any possible issue related to HTML parser ?

Thank you for your kind help,

regards

-Hoang Long

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hoang Long Nguyen - 2010-09-20

i am thinking a workaround is that to encode all the text content except the html tags. And then after performing xpath query, the result of getTextContent() must be decoded to rechieve the original text. It seems a complex solution,,

And i am still looking for the true cause of my problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hoang Long Nguyen - 2010-09-24

Ooops, this forum is for htmlparser. anyway, htmlParser is the answer for my question.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.