Menu

unicde chars lost?

Help
2010-09-19
2013-04-27
  • Hoang Long Nguyen

    Hi all,

    i created a DOM doc like this:

    MozillaParser parser = new MozillaParser();
    doc = parser.parse( htmlContent.getBytes(),  htmlEncoding );

    then after a query to extract content (Node, NodeList,,,),, and i use getTextContent() to see the content. The problem is that if the htmlContent has unicode character, the result of getTextContent() has text with "?" character even the htmlEncoding  is parsed as same as the encoding of the html document. Do you see any possible issue related to HTML parser ?

    Thank you for your kind help,

    regards

    -Hoang Long

     
  • Hoang Long Nguyen

    i am thinking a workaround is that to encode all the text content except the html tags. And then after performing xpath query, the result of getTextContent() must be decoded to rechieve the original text. It seems a complex solution,,

    And i am still looking for the true cause of my problem.

     
  • Hoang Long Nguyen

    Ooops, this forum is for htmlparser. anyway, htmlParser is the answer for my question.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.