then after a query to extract content (Node, NodeList,,,),, and i use getTextContent() to see the content. The problem is that if the htmlContent has unicode character, the result of getTextContent() has text with "?" character even the htmlEncoding is parsed as same as the encoding of the html document. Do you see any possible issue related to HTML parser ?
Thank you for your kind help,
regards
-Hoang Long
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i am thinking a workaround is that to encode all the text content except the html tags. And then after performing xpath query, the result of getTextContent() must be decoded to rechieve the original text. It seems a complex solution,,
And i am still looking for the true cause of my problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
i created a DOM doc like this:
then after a query to extract content (Node, NodeList,,,),, and i use getTextContent() to see the content. The problem is that if the htmlContent has unicode character, the result of getTextContent() has text with "?" character even the htmlEncoding is parsed as same as the encoding of the html document. Do you see any possible issue related to HTML parser ?
Thank you for your kind help,
regards
-Hoang Long
i am thinking a workaround is that to encode all the text content except the html tags. And then after performing xpath query, the result of getTextContent() must be decoded to rechieve the original text. It seems a complex solution,,
And i am still looking for the true cause of my problem.
Ooops, this forum is for htmlparser. anyway, htmlParser is the answer for my question.