Menu

aliens strike back - _UNICODE compile

Developer
2004-06-10
2004-06-16
  • Jerker Bäck

    Jerker Bäck - 2004-06-10

    It's regrettable that tinyXML (and many other "OS-independent" projects ie Scintilla) is not "UNICODE compliant". It also seems to be a confusion of what this really mean.

    For a Windows developer it usually means if to compile with the _UNICODE flag or not. It may seem as a minor issue, but is in fact a crucial performance choice. In this sense it has nothing todo with language or encoding.

    Simply it means that the expression:
    #define _UNICODE
    MessageBox("Using tinyXML"); will cause an errror

    In Windows NT-family OS (NT4/2K/XP/2003) all strings are handled (at kernel-level) as a structure like this:
    typedef struct _UNICODE_STRING {
        unsigned short   Length;
        unsigned short   MaximumLength;
        unsigned short*  Buffer;
    } UNICODE_STRING;
    Including filenames, registry, I/O, string resources etc

    At user-level (the Win32 library) all strings are zero-terminated "C-strings" and all library calls/structures containing strings are present in two versions: the -A (ANSI)version and the -W (UNICODE)version. At some point all strings must be translated AND allocated as an UNICODE_STRING. This is true whether you use Visual basic, COM, NET, MFC, STL, C-runtime, console or some other very smart thing. Assuming that a wide string is the least painful to translate (probably true), the obvious choice is to always compile with the _UNICODE flag. Since the Win32 libraries are compiled with the _UNICODE flag some translations may also be avoided. The ANSI-version is not even supported in windows CE and will probably be consider obsolete in future versions of Windows. The MessageBox - function above will then ALWAYS cause an error.

    To make things more confusing, it is common to save text streams as ANSI-text even if your application internally handles all strings as wide strings ie NOTEPAD.EXE. That's for saving space.

    So, to be really useful in Windows, wide character support in the sense I pointed here is vital and not that much work to implement.
    Proposal for tinyXML:
    #include <tchar>    // or OS-independent replacement for C-runtime library calls
    #if defined(_UNICODE)
    #define TIXML_CHAR    wchar_t
    #define TIXML_STRING    std::wstring
    #else
    #define TIXML_CHAR    char
    #define TIXML_STRING    std::string
    #endif
    #endif
    + a little bit more

    To see an example of the impact of translations - look at the (bad) implementation of the registry script parser (rgs) in ATL (STATIC_REGISTRY). You can even have a lethal insight at 
    <http://www.joelonsoftware.com/articles/fog0000000319.html>

    Jerker Bck, La Sude

     
    • Lee Thomason

      Lee Thomason - 2004-06-10

      Okay - before this spirals out of control - I'm simply not going to put UTF-16 support into the project. (Which is more or less what Windows means by _UNICODE). Why?

      Too hard to test.

      I'd have to write the 16-bit support, tests, and test with the #define on and off. Too much work for the benefit. Put another way - it takes the "Tiny" out of "TinyXml".

      Other issues:

      - _UNICODE encodes in UTF-16 bit same code points as UTF-8, which tinyxml now supports.

      - UTF-16 is very useful for programming - I agree with you fully on that - but is utterly useless for network transmission. Network transmission is pretty important for an XML processor.

      A really good utility class to convert from UTF-8 to 16 and back would be a way to use tinyxml with a 16 bit text application. You'de have to convert on input and output, but it's quite solveable.

      Also, the previous post, or at least the title, impies that the language support in TinyXml is deficient or broken. I do understand text encoding - quite well - and while you point out a programming issue (16 bit strings are easier to use) there isn't anything deficient in the capability of the UTF-8 TinyXml. It can fully represent and process the entire range of unicode code points. (Well...character entities are limited to the first 64K code points. Nothing is perfect.) So while the interface may not be what you want, the functionality is correct.

      lee

       
    • Jerker Bäck

      Jerker Bäck - 2004-06-11

      No, you don't understand what mean. I'm saying that - to be useful in Windows development, wide character support is vital. It is also rather easy to implement - at least in a Windows NT environment. The long story about the internal structure of the Windows NT OS was for trying to explain WHY. Again this is NOT about encoding text streams - it's simply about if you store the strings (whatever the contain) in a wide buffer or a char buffer. Simple as that - the size of the buffer.

      I really can't see how this would make any impact on your work other than make it more portable and adding more functionalty. It won't make it more complex or bigger in any way.

      Jerker

       
    • Jerker Bäck

      Jerker Bäck - 2004-06-11

      By the way - sorry about the titel - I was just spinning on a previous post in this forum.

       
    • Ellers

      Ellers - 2004-06-11

      On the one hand...

      - I use TinyXml fine on Win2000 without unicode - ie it is clearly NOT "vital".

      - If TinyXml is adjusted to have unicode builds it doesn't really make sense that it "won't make it complex or bigger in any way". Perhaps not "that" much bigger nor "that" much more complex, but definitely more complex.

      On the other hand...

      - while I haven't needed unicode, my project strictly sticks within latin (English) encoding.

      - I don't think it was clear at first that you did not mean UTF-16, just that the build would use "wide chars" not char.

      Why don't you do the changes for unicode yourself, keep it really tight and minimal, then contribute a patch.

      Then Lee (who I guess is the main developer?) can make a judgement as to whether its useful for the project.

      That way you've got it for your stuff and given the project the opportunity to expand in a direction others may find useful.

      Ellers

       
    • Ellers

      Ellers - 2004-06-11

      An addendum -

      I don't think "too hard to test" is a good criterion for judging whether to work on a feature.

      I can understand "bloats library", "distracts from tiny objective" etc.

      But "too hard to test"... hmm not really. Not to be a nag, but it indicates that the test setup is too fragile/inflexible

       
    • Lee Thomason

      Lee Thomason - 2004-06-16

      Actually, that's great feedback. The "too hard to test" issue has come up a bunch over the years, and really deserves an answer from me. I posted "TinyXml Development Secrets" in "Developer" to give a little context on that.

      A patch is a *great* way to support stuff like this. It keeps the main code simple, but helps out other people.

      lee

       

Log in to post a comment.

MongoDB Logo MongoDB