From: Ole L. <ol...@ha...> - 2003-02-25 22:04:58
|
Stefan Seefeld <se...@sy...> writes: > Ole Laursen wrote: > > >>Please don't abuse std::string in such a horrible way. > > You are exaggerating, really. > > Remember that many string algorithms can be used without any trouble > > if the std::string contains UTF-8. Copying, concatenation and even > > replacement (of ASCII characters) all work. > > No, replacement will *not* work, precisely because there is no simple > mapping of 'character' to 'byte'. Replacement of ASCII characters? > > That covers 95%-100% of > > the cases for typical GUI usage, I think. > > I don't know how you come up with these numbers, and what you mean > by 'typical'. As soon as you get into non-ascii regions you'll be > in deep trouble. std::string is useless if you're programming a GUI library. But my point is that if you is just using one, as an application programmer, most of the time you'll be copying complete strings back and forth, perhaps doing some concatenations or replacing a few ASCII strings in the interest of good i18n. That's what my experience tells my. You seldomly need to consider individual characters. Of course, if you do, you must find yourself a proper UTF-8 string. For the Jabber client I'm writing, I use std::string for the back-end library which uses libxml, and Glib::ustring for everything above that in the hierarchy, including the GUI. It works fine - the XML parsing is really a minor part of the program, and one that doesn't require working with individual characters at all. > You are ignoring my argument: the std::string API is completely > inappropriate to deal with the utf8 encoding. All but data() and > length() will cause undefined behavior. Why? UTF-8 is not magic. Granted, it's a couple of years ago I studied the subject, but if I understand the issue correctly UTF-8 simply encodes ASCII characters as ASCII characters and all other characters as a sequence of bytes with their two highest bit set, the first one with some extra magic to distinguish it. So a Latin-1 character would be two bytes like 1101 1010 1010 1010 ^^^ ^^ high order bit magic As long as you don't split these byte sequences, you're fine. That's how I understand it, at least. I even found a definition if you want the accurate explanation: In UTF-8, characters are encoded using sequences of 1 to 6 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character value. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the value of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded. [ftp://ftp.rfc-editor.org/in-notes/rfc2279.txt] So there. :-) I do apology if you already knew this, but your statement "All but data() and length() will cause undefined behavior." leads me to believe that perhaps you didn't. I do agree that std::string is inappropriate as a UTF-8 "std::string". But you can think of it as a UTF-8 byte string. Think about it. That's what I meant when I compared std::string to 'char *'. -- Ole Laursen http://www.cs.auc.dk/~olau/ |