VTD-XML: The Future of XML Processing / Discussion / Open Discussion: Non-English characters not supported

Non-English characters not supported

Forum: Open Discussion

Creator: Pallav Sipani

Created: 2009-04-29

Updated: 2013-05-15

Pallav Sipani - 2009-04-29

Hi,
I tried parsing a UTF-8 encoded xml file using RSSReader.java given in the sourcefourge website and I get an exception which reads:
"com.ximpleware.ParseException: UTF 8 encoding error: should never happen"

The line due to which its failing is

<description>The Pentagon has cultivated â€œmilitary analystsâ€? in a campaign to generate favorable news coverage of the Bush administrationâ€™s wartime performance.</description>

As is obvious the problem is due to "â€".

I got the above utf-8 encoded xml as response from a website. When i checked the original content in the website I noticed ("") was replaced with (â€) due to utf-8 encoding.

The original content was:
<description>The Pentagon has cultivated “military analysts” in a campaign to generate favorable news coverage of the Bush administration’s wartime performance.</description>

Another example of the occurence of the above exception is while parsing the line:

<nick>à°ªà±?à°°à°µà±€à°£à±?</nick>

In this case the original content was a regional language(Kannada(an Indian language) I think).

Any idea why this exception should occur? What is the fix for this?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- jimmy zhang - 2009-04-29
  
  that seems not a part of UTF-8 encoding...according to XML spec, if you don't declare the encoding, the default is uTF-8...
  so you have to declare the encodig of XML to the right encoding (e.g iso-8859)
  the problem should go away.. let me know it works or not...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pallav Sipani - 2009-04-30
  
  Please find my comments on the same in the bugs section.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Pallav Sipani - 2009-05-08
  
  I think you were right about the encoding not being proper. Thanks for the help. Really appreciate it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Non-English characters not supported

Forums

Help

Non-English characters not supported

Non-English characters not supported

Forums

Help

Non-English characters not supported document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Non-English characters not supported