TinyXML / Discussion / Open Discussion: Forthcoming Unicode support in TinyXML

Lee Thomason - 2004-04-25

By far, the biggest drawback to tinyxml at this point, is that is doesn't correctly (or at least fully) support non-english languages.

XML is, by default, represented in UTF-8 (a multibyte encoding of Unicode). The next version of tinyxml will make the switch, and parse files assuming they are UTF-8 in and out.

For English XML, you'll never know the difference. The encoding will be exactly the same. For non-English users, the parser will correctly output high-ascii (non-english) and not have intermittent parsing errors.

In either case, there's nothing you'll need to do differenty to use the library - it should just fix bugs. It is possible that, for non-english users, you are actually relying on errors in the output in your code: but it's an important change and worth making.

The code should be done summer some time.

lee

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- John-Philip Leonard Johansson - 2004-05-30
  
  Hi!
  
  I was just thinking about this the other day, about using other languages in TiXML. How it would affect my surrounding code and such. I've never coded in swedish or portuguese but I'm guessing that wchar would be the way to go over char. So what would happen to all the functions returning/taking const char* ? or std::string?
  
  Well I stumbled upon this article:
  http://www.joelonsoftware.com/articles/Unicode.html
  It was very good, but I'm still stumped. Maybe you already know everything in the article, never the less, it's good reading during coffee break =)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Lee Thomason - 2004-06-02
  
  The advantage of UTF-8 is that ASCII is a subset of it. So for English, you don't have to do anything at all.
  
  For glyphs in other languages - Swedish, Portugese, Simplified Chinese, Swahili - they are encoding into an 8 bit format (superset of ascii). So an individual glyph (character, if you prefer) can be multiple bytes. The net result is that it is all based on char*. The TinyXml interface will not change at all for UTF-8 support.
  
  UTF-8 is the multibyte encoding for unicode. UTF-16 is the "wide character" encoding. Often, "Unicode" is used to refer to UTF-16. (Java does it). This is imprecise, in my opinion. wchar is usually not UTF-16...but that depends on the OS.
  
  lee
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2004-06-05
  
  John-Philip -
  
  Joel's article is good, but I'll see if I can distill the main points about charsets. I feel like a challenge, so here is...
  
  "Ellers' guide to Why Charsets are Needed"
  
  Some of this will sound really dumb but bear with me...
  
  The basic building block of all files is a byte. You get 8 bits allowing values 0 .. 255.
  
  Until the next revolution in computing all files will be built on this basic building block. If the document is English, Russian, Chinese or Klingon, it will be stored as a sequence of bytes.
  
  Say you have an English phrase like "Hello, World" that you need to store in a file. No computer inherently knows what a letter is. So the computer genuises of old mapped letters to bytes. But what byte should the letter "H" gets mapped to? At some point they made some arbitrary choices and came up with ASCII, where H gets mapped to 0x48 or 72 decimal. (There were mappings/encodings before ASCII like EBCDIC but lets ignore that for now.)
  
  ASCII allows any document written in English to be saved to disk (and much more of course).
  
  But say you're German, French or Norwegian - you have letters that aren't in ASCII. You simply can't store letters like "" and "" in any files!
  
  Someone realised this, and realised that ASCII only goes about to 0x7F/127, leaving values >=0x80 up for grabs. "Extended ASCII" allowed to be stored as 0xFC / 252. But this is not "proper" ASCII.
  
  Remember the old DOS screens with lines and boxes? That charset defined boxes/lines using bytes values above 0x7F. I'm guessing you can't display a "" letter on those screens.
  
  So already ASCII is limited, and European languages have sortof been squeezed in, more or less (I recall from somewhere that certain French characters never made it in, but I could be wrong).
  
  But what about other languages? "Ahh", I hear the old designers of ASCII say, "there ARE other languages?" Chinese, Thai, Russian, Arabic etc. Even if you throw out ASCII, these languages have alphabets that laugh at the number "256". Immediately any file saving letters in those languages will need more than one byte.
  
  But do you store each letter in 2 bytes (65536 possible values), 3 bytes, 4 bytes etc?
  
  Like so many computer science problems, there are many valid answers. The problem is picking AN answer, not figuring out THE answer.
  
  Unicode is one answer; UTF-8 is another answer, and a very clever one at that. Joel's Unicode article may be a good read at this point.
  
  UTF-8 supports the ASCII one-byte letters as-is. The clever thing is that UTF-8 also allows you to slot in 2, 3 and 4 byte letters in the same stream.
  
  ISO-8859-1 is, it seems to me, the ASCII most of us know, with as 0xFC. It is a one-byte-per-letter encoding, simple in that respect but can't handle letters outside of that range.
  
  Microsoft came up with its own answer(s), and I'm fairly sure Apple had a very thorough answer very early on (no surprises there).
  
  The future as I see it is UTF-8 in particular and Unicode in general.
  
  In terms of _programming_, at the end of the day you're still dealing with a stream of bytes. The main thing to understand is that one byte MIGHT = one letter, or two bytes MIGHT = one letter, or 3, or 4. It depends how the language has been encoded into bytes.
  
  If you have taken a UTF-8 file and loaded it into a char buffer in C, you've effectively got a UTF-8 string in memory and it should be treated as such.
  
  One solution is to do all your coding with Unicode strings, and ensure you decode/encode that to disk as you want (e.g. UTF-8->unicode when you load; unicode -> UTF-8 when you save).
  
  Currently, this is where my understanding of encodings stops. I know that Unicode is often referred to as 2-bytes-per-letter, but I'm not so sure its that simple.
  
  Anyway, thats all for now. Hope its useful.
  
  Ellers
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Yves Berquin - 2004-06-16
    
    I think that what Microsoft refers to, when it says "Unicode" is what is known in the XML world as "UTF-16". It is an encoding where all the characters in the valid alphabets are stored on 2 bytes.
    There's a weird game to play with the first character in a UTF-16 file though (Byte Order Mark), otherwize it's not a valid UTF-16 XML file.
    
    More on this on http://www.w3.org/TR/2004/REC-xml11-20040204/#charencoding
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2004-06-05
  
  I just re-read Joel's article and I realised I covered a lot of ground he already discussed.
  
  I guess my main point is that ultimately all data boils down to a sequence of plain boring bytes, and to be aware that the ASCII encoding of letters into bytes is not the only (nor best) way.
  
  :)
  Ellers
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Lee Thomason - 2004-06-09
  
  I still think everyone is best off using UTF-8...but tinyxml will also support "legacy encoding" which is what it does today.
  
  lee
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ellers - 2004-06-10
  
  I definitely agree that UTF-8 as the mimimal, out-of-the-box supported encoding is a very good choice.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Forthcoming Unicode support in TinyXML

Forums

Help

Forthcoming Unicode support in TinyXML

Forthcoming Unicode support in TinyXML

Forums

Help

Forthcoming Unicode support in TinyXML document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Forthcoming Unicode support in TinyXML