Menu

Accessing non-standard english characters

Developer
Anonymous
2006-01-07
2013-05-20
  • Anonymous

    Anonymous - 2006-01-07

    I am using TinyXML as the base for my localisation code and having trouble when creating the files that use character's outside the English language.

    For example I have the string
    <String Name="DesignedFor">Entworfen für eine Auflösung von 1024x768 oder grösser</String>

    The problem comes when TinyXML is trying to access the ü and ö.

    Now I use the following code to access the string (obviously this is simplified, but shows the basics).

    stringElement = static_cast<TiXmlElement*>( tableRoot->FirstChild("String") );

    // Save the actual string
    tmpStr = stringElement->GetText();

    Now when tmpStr is assigned a value, instead of it being what is between the tags, I get the following

    "Entworfen für eine Auflösung von 1024x768 oder grösser"

    As you can see, the ü and ö have come out as different characters.  But instead of just being one different character, there are actually two!

    Forgive my ignorance, but I tried to step through the parsing code, but failed miserably.  Is there a quick fix to this?  Am I doing something wrong or is it a bug in the code?

    Hopefully this can be a quick fix, as the code needs to be completed as soon as possible.

    Thanks
    Spree

     
    • Ellers

      Ellers - 2006-01-07

      I'm not an expert on different encodings but some comments in case they're helpful...

      TinyXml, being tiny, isn't intended to cover all encodings. Is the document you are parsing *definitely* in UTF-8? Have you supplied a encoding='UTF-8' declaration? Does the file view correctly in both IE and FireFox (the more browsers that parse correctly, the stronger the indication that the file itself is good).

      IIRC the characters you are referring to, in UTF-8, *are* sent as two bytes. Which implies that the decoder isn't attempting to convert the two bytes to one logical char, which implies that the encoding isn't set.

      (Mind you, it could be something totally different...)

      Also, as a tip, this:

      stringElement = static_cast<TiXmlElement*>( tableRoot->FirstChild("String"))

      should be:

      stringElement = tableRoot->FirstChild("String").ToElement()

      or

      stringElement = tableRoot->FirstChildElement("String")

      HTH

       
    • Anonymous

      Anonymous - 2006-01-07

      I hadn't supplied a encoding='UTF-8' declaration, but even when this is included in the header, the return value is still as I described.

      Both IE and FF parse the file correctly (they are the ownly browsers I have on my system), but as both ü and ö are part of the ASCII character set, I would be suprised if Tiny was unable to parse them.

      Obviously GetString returns a const char*, and that may be the problem (maybe internally the code is converting any negative character values into something else - hence the result)?

      Spree

       
    • Ellers

      Ellers - 2006-01-07

      That reminds me... I *think* if 'encoding=...' is not included, that it defaults to UTF-8 anyway.

      Note that *no* umlaut characters are part of ASCII.
      There is 'extended' ASCII, which IIRC is the same (or almost the same) as ISO-8859-1, which allows those chars to be stored as one byte. See http://www.lookuptables.com/

      If you hexdump the contents of the buffer I think you'll see that there are two bytes that correspond exactly to each of the umlaut chars that you have.

      What I don't know is whether TinyXml should convert the bytes to a different representation of that char or whether it leaves it in the src (UTF-8) format.

      What I can say for sure is that one time I was stumped for a week with a parser not converting umlaut chars right (not TinyXml) and it turned out the parser was right all along, but the console was not able to print chars in that encoding.

      I recommmend a hexdump of the string, and verifying what Tiny does (convert chars to another rep, or leave in src form)

      HTH

       
    • Ellers

      Ellers - 2006-01-07

      Have a look at the unittest code, xmlttest.cpp. Search for "UTF-8". At a quick glance that indicates to me that Tiny keeps the string in the source encoding.

      I think if you hexdump (or view) your input XML file you'll see that the umlaut chars are stored in 2 bytes each. In which case Tiny is just loading your data exactly as you store it.

      WDYT?

       
    • Anonymous

      Anonymous - 2006-01-07

      I have viewed the source XML file in various editors, and in all cases, the text is displayed as I expect, rather than the dual characters I am getting.

      I have found a solution to the problem though, and it is due to the encoding of the file.

      Instead of using encoding="UTF-8", I have to use encoding="ISO-8859-1".  This allows the file to be parsed correctly.

      Funny thing is though that even with the UTF-8 encoding stated, all the other parses were able to parse the file correctly, except for TinyXML

      This maybe a bug with tinyXML?

      Thanks for the advice, I found the solution in the  xmlttest.cpp file.

      Spree

       
    • Ellers

      Ellers - 2006-01-07

      I'm glad you found a solution.
      But I don't think you understood my points.

      Did you look at your original UTF-8 XML file in a *hex viewer*. I am fairly sure you will see that the umlaut chars are *two* bytes. Naturally, any good editor will show those two bytes in the user form, which appears to you as a single umlaut char. Nonetheless, in UTF-8, those chars are stored with 2 bytes each. The editor is correctly hiding this from you.

      If I understand it correctly, TinyXml is doing exactly the *correct* thing, but your program is not able to use the UTF-8 strings Tiny gives you. Perhaps your program is expecting the UTF-8 strings to be converted to something else, like extASCII/8859?

       
    • igorv007

      igorv007 - 2006-01-07

      SpreeTree, if I may make a suggestion.

      Why don't you use a CDATA section? It should resolve your issue.

       
      • Ronald Fenner Jr

        Or you could do what i did to support other language strings in an xml file which was to encode it into like base64 so that the string isn't really messed with. The you just decode the string and recast it to whatever.

         
    • otan

      otan - 2006-03-10

      Hello,

      for a project I am working on, I have made an extension in the TinyXml-sources (using v2.4.3), so that umlaute and some special characters can be written like "&auml;" in the xml-document to load. After loading a xml-file, you get an "ä" when calling ...->Value(). (In the saved file, it is reconverted to "&auml;" again)

      In our project, we need this for reading/writing XHTML-files and it works fine as far as I have tested it. Also the space character "&nbsp;" can be used.

      The files I changed are here:
      http://www.lrz-muenchen.de/~tobiasnadler/qa/tinyext/tinyxml.h
      http://www.lrz-muenchen.de/~tobiasnadler/qa/tinyext/tinyxml.cpp
      http://www.lrz-muenchen.de/~tobiasnadler/qa/tinyext/tinyxmlparser.cpp

      To use the extensions, you have to define TIXML_XHTML.

       
    • TopFire3

      TopFire3 - 2006-04-14

      This is a Typical encoding conversion. TinyXML works well. Your problem is that you are getting the UTF8 value for your string. You must to implement a function to convert your string to 1252 encoding If you want to see your string in a friendly way using an 1252 page code windows.

      newString = ConvertFromUTF8(oldString).

       

Log in to post a comment.