TinyXML / Discussion / Developer: Accessing non-standard english characters

Anonymous - 2006-01-07

I am using TinyXML as the base for my localisation code and having trouble when creating the files that use character's outside the English language.

For example I have the string
<String Name="DesignedFor">Entworfen für eine Auflösung von 1024x768 oder grösser</String>

The problem comes when TinyXML is trying to access the ü and ö.

Now I use the following code to access the string (obviously this is simplified, but shows the basics).

stringElement = static_cast<TiXmlElement*>( tableRoot->FirstChild("String") );

// Save the actual string
tmpStr = stringElement->GetText();

Now when tmpStr is assigned a value, instead of it being what is between the tags, I get the following

"Entworfen fÃ¼r eine AuflÃ¶sung von 1024x768 oder grÃ¶sser"

As you can see, the ü and ö have come out as different characters. But instead of just being one different character, there are actually two!

Forgive my ignorance, but I tried to step through the parsing code, but failed miserably. Is there a quick fix to this? Am I doing something wrong or is it a bug in the code?

Hopefully this can be a quick fix, as the code needs to be completed as soon as possible.

Thanks
Spree

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2006-01-07
  
  I'm not an expert on different encodings but some comments in case they're helpful...
  
  TinyXml, being tiny, isn't intended to cover all encodings. Is the document you are parsing *definitely* in UTF-8? Have you supplied a encoding='UTF-8' declaration? Does the file view correctly in both IE and FireFox (the more browsers that parse correctly, the stronger the indication that the file itself is good).
  
  IIRC the characters you are referring to, in UTF-8, *are* sent as two bytes. Which implies that the decoder isn't attempting to convert the two bytes to one logical char, which implies that the encoding isn't set.
  
  (Mind you, it could be something totally different...)
  
  Also, as a tip, this:
  
  stringElement = static_cast<TiXmlElement*>( tableRoot->FirstChild("String"))
  
  should be:
  
  stringElement = tableRoot->FirstChild("String").ToElement()
  
  or
  
  stringElement = tableRoot->FirstChildElement("String")
  
  HTH
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2006-01-07
  
  I hadn't supplied a encoding='UTF-8' declaration, but even when this is included in the header, the return value is still as I described.
  
  Both IE and FF parse the file correctly (they are the ownly browsers I have on my system), but as both ü and ö are part of the ASCII character set, I would be suprised if Tiny was unable to parse them.
  
  Obviously GetString returns a const char*, and that may be the problem (maybe internally the code is converting any negative character values into something else - hence the result)?
  
  Spree
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2006-01-07
  
  That reminds me... I *think* if 'encoding=...' is not included, that it defaults to UTF-8 anyway.
  
  Note that *no* umlaut characters are part of ASCII.
  There is 'extended' ASCII, which IIRC is the same (or almost the same) as ISO-8859-1, which allows those chars to be stored as one byte. See http://www.lookuptables.com/
  
  If you hexdump the contents of the buffer I think you'll see that there are two bytes that correspond exactly to each of the umlaut chars that you have.
  
  What I don't know is whether TinyXml should convert the bytes to a different representation of that char or whether it leaves it in the src (UTF-8) format.
  
  What I can say for sure is that one time I was stumped for a week with a parser not converting umlaut chars right (not TinyXml) and it turned out the parser was right all along, but the console was not able to print chars in that encoding.
  
  I recommmend a hexdump of the string, and verifying what Tiny does (convert chars to another rep, or leave in src form)
  
  HTH
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2006-01-07
  
  Have a look at the unittest code, xmlttest.cpp. Search for "UTF-8". At a quick glance that indicates to me that Tiny keeps the string in the source encoding.
  
  I think if you hexdump (or view) your input XML file you'll see that the umlaut chars are stored in 2 bytes each. In which case Tiny is just loading your data exactly as you store it.
  
  WDYT?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2006-01-07
  
  I have viewed the source XML file in various editors, and in all cases, the text is displayed as I expect, rather than the dual characters I am getting.
  
  I have found a solution to the problem though, and it is due to the encoding of the file.
  
  Instead of using encoding="UTF-8", I have to use encoding="ISO-8859-1". This allows the file to be parsed correctly.
  
  Funny thing is though that even with the UTF-8 encoding stated, all the other parses were able to parse the file correctly, except for TinyXML
  
  This maybe a bug with tinyXML?
  
  Thanks for the advice, I found the solution in the xmlttest.cpp file.
  
  Spree
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2006-01-07
  
  I'm glad you found a solution.
  But I don't think you understood my points.
  
  Did you look at your original UTF-8 XML file in a *hex viewer*. I am fairly sure you will see that the umlaut chars are *two* bytes. Naturally, any good editor will show those two bytes in the user form, which appears to you as a single umlaut char. Nonetheless, in UTF-8, those chars are stored with 2 bytes each. The editor is correctly hiding this from you.
  
  If I understand it correctly, TinyXml is doing exactly the *correct* thing, but your program is not able to use the UTF-8 strings Tiny gives you. Perhaps your program is expecting the UTF-8 strings to be converted to something else, like extASCII/8859?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- igorv007 - 2006-01-07
  
  SpreeTree, if I may make a suggestion.
  
  Why don't you use a CDATA section? It should resolve your issue.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ronald Fenner Jr - 2006-01-08
    
    Or you could do what i did to support other language strings in an xml file which was to encode it into like base64 so that the string isn't really messed with. The you just decode the string and recast it to whatever.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- otan - 2006-03-10
  
  Hello,
  
  for a project I am working on, I have made an extension in the TinyXml-sources (using v2.4.3), so that umlaute and some special characters can be written like "ä" in the xml-document to load. After loading a xml-file, you get an "ä" when calling ...->Value(). (In the saved file, it is reconverted to "ä" again)
  
  In our project, we need this for reading/writing XHTML-files and it works fine as far as I have tested it. Also the space character " " can be used.
  
  The files I changed are here:
  http://www.lrz-muenchen.de/~tobiasnadler/qa/tinyext/tinyxml.h
  http://www.lrz-muenchen.de/~tobiasnadler/qa/tinyext/tinyxml.cpp
  http://www.lrz-muenchen.de/~tobiasnadler/qa/tinyext/tinyxmlparser.cpp
  
  To use the extensions, you have to define TIXML_XHTML.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- TopFire3 - 2006-04-14
  
  This is a Typical encoding conversion. TinyXML works well. Your problem is that you are getting the UTF8 value for your string. You must to implement a function to convert your string to 1252 encoding If you want to see your string in a friendly way using an 1252 page code windows.
  
  newString = ConvertFromUTF8(oldString).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Accessing non-standard english characters

Forums

Help

Accessing non-standard english characters

Accessing non-standard english characters

Forums

Help

Accessing non-standard english characters document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Accessing non-standard english characters