Re: [libxml++] UTF8 support

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Stefan Seefeld <se...@sy...> writes:

> Ole Laursen wrote:
> 
> >>Please don't abuse std::string in such a horrible way.
> > You are exaggerating, really.
> > Remember that many string algorithms can be used without any trouble
> > if the std::string contains UTF-8. Copying, concatenation and even
> > replacement (of ASCII characters) all work.
> 
> No, replacement will *not* work, precisely because there is no simple
> mapping of 'character' to 'byte'.

Replacement of ASCII characters?

> > That covers 95%-100% of
> > the cases for typical GUI usage, I think.
> 
> I don't know how you come up with these numbers, and what you mean
> by 'typical'. As soon as you get into non-ascii regions you'll be
> in deep trouble.

std::string is useless if you're programming a GUI library. But my
point is that if you is just using one, as an application programmer,
most of the time you'll be copying complete strings back and forth,
perhaps doing some concatenations or replacing a few ASCII strings in
the interest of good i18n. That's what my experience tells my. You
seldomly need to consider individual characters.

Of course, if you do, you must find yourself a proper UTF-8 string.

For the Jabber client I'm writing, I use std::string for the back-end
library which uses libxml, and Glib::ustring for everything above that
in the hierarchy, including the GUI. It works fine - the XML parsing
is really a minor part of the program, and one that doesn't require
working with individual characters at all.

> You are ignoring my argument: the std::string API is completely
> inappropriate to deal with the utf8 encoding. All but data() and
> length() will cause undefined behavior.

Why? UTF-8 is not magic. Granted, it's a couple of years ago I studied
the subject, but if I understand the issue correctly UTF-8 simply
encodes ASCII characters as ASCII characters and all other characters
as a sequence of bytes with their two highest bit set, the first one
with some extra magic to distinguish it. So a Latin-1 character would
be two bytes like

   1101 1010   1010 1010
   ^^^         ^^
   high order bit magic

As long as you don't split these byte sequences, you're fine. That's
how I understand it, at least.

I even found a definition if you want the accurate explanation:

   In UTF-8, characters are encoded using sequences of 1 to 6 octets.
   The only octet of a "sequence" of one has the higher-order bit set to
   0, the remaining 7 bits being used to encode the character value. In
   a sequence of n octets, n>1, the initial octet has the n higher-order
   bits set to 1, followed by a bit set to 0.  The remaining bit(s) of
   that octet contain bits from the value of the character to be
   encoded.  The following octet(s) all have the higher-order bit set to
   1 and the following bit set to 0, leaving 6 bits in each to contain
   bits from the character to be encoded.

   [ftp://ftp.rfc-editor.org/in-notes/rfc2279.txt]

So there. :-)

I do apology if you already knew this, but your statement "All but
data() and length() will cause undefined behavior." leads me to
believe that perhaps you didn't.

I do agree that std::string is inappropriate as a UTF-8 "std::string".
But you can think of it as a UTF-8 byte string. Think about it. That's
what I meant when I compared std::string to 'char *'.

-- 
Ole Laursen
http://www.cs.auc.dk/~olau/