From: Kai <ka...@Ra...> - 2007-06-25 12:25:19
|
Hi Sean, hi Eric, I fully agree that there are way too many string classes already - guess the= reason is that there are so many different forces influencing string class = design. Some thoughts about the issues: Interface structure: Keeping the string classes themselfes as lean as possible and adding= additional functionality and convenience via nonmember nonfriend functions= is probably the best way to go. It nicely allows a layered approach by= providing e.g. the classes and different sets of free functions in separate= headers. I am sure you are aware of GotW #84: Monoliths "Unstrung" (http://www.gotw.c= a/gotw/084.htm) or a similar dissection of std::basic_string<> into a= minimal class plus free functions. Level of functionality: I'd say that at least complete string concatenation, insert, delete and= replace must be provided to promote real use of the classes. (An= alternative would be a strictly immutable design with all its benefits, but= that would defeat movability and therefore necessitate reference counting:= a very different approach.) So there should be (at least) one general replace() function (see below= about UTF8 concerns). Adding separate insert(), delete() or append() member= functions may or may not make sense performance-wise; the functionality can= of course be provided by free functions mapping to replace. (By the way, using append() plus std::rotate() to emulate replace() would= need non-const iterators, doesn't it? So far only const versions of begin()= et al are supplied.) Compare and find should be implementable as free functions without problems.= Which makes it straight forward to provide different variants of this= functionality in different headers: a simple byte-compare version, one= using C++ char_traits and locales or one using IBMs ICU. Regarding operator use, I agree that this has been decided by std::string.= What I am not sure about (didn't look into your move library yet), wouldn't= be the move semantics in operator+ be surprising for some users? Is there= any provision to make it obvious (at least at runtime) that one of the= source strings is gone? (Not that I am against move, getting rid of the= temporaries is very important, and be it only for psychological acceptance.= ) Unicode: As a European with a German Umlaut (=FC) in my name, I fully agree that the= only viable text encoding should be unicode, period. Nevertheless I doubt= that users of the library will use the string classes with unicode only, if= not forced in some way. And be it only to pass text in other encodings= around for conversion to unicode. Therefore the assumption that string_t always contains UTF-8 is not valid in= my view: as you say, it is basically a vector<char>, nothing "better". Even is UTF-8 is used, the developer will need to be aware of this in more= cases than insert()/replace(). One example being that string_t::size() does= not return the number of characters in the string. I'm afraid as long as= the string class should behave basically as vector<char>, you can't shield= users from making mistakes regarding UTF-8 (or UTF-16, too, and even more= hidden due to the relative rareness of characters outside of the BMP). It might be valuable to provide a set of free functions or wrapper classes= which impose a unicode character-level interface on string_t and= string16_t, possible including checks for the wellformedness of incoming da= ta. Unfortunately such an interface will need either some internal caching or a= restricted set of functionality to avoid immediate introduction of= quadratic complexity by innocent users. I have been contemplating a while about a unicode string class which= dynamically changes the width of its elements to accomodate needs between= 8, 16 or 32 bit. This would provide a one-to-one mapping between unicode= code points and the underlying vector, but of course unicode characters can= again be composed of multiple code points, therefore the value of such a= class is unclear. Hope this makes some sense, Kai >Hi Kai, > >On Jun 22, 2007, at 9:05 AM, Kai Br=FCning wrote: > >> knowing the problem of binary compatibility from >> hard learned experience, I am very interested in >> your version_0 approach. > >Great! Check it out and ask lots of questions - I'm not going to have >much of a chance to work on it for 1.0.29 (neither is Mat, we made >lots of promises but we're both on vacation...) - but finishing >version_0 is our top priority (from 1.0.28 on we will be maintaining >binary compatibility). > >> And string16_t is one reason more to look at it, >> because std::wstring is rather unusable due to >> the different ideas of the size of wchar_t on >> different platforms/compilers. > >I'm a fan of using UTF-8, it has many advantages over UTF-16 >including being byte order independent and sorting lexicographically >the same as UTF-32. Eric Berdahl did the work for string[16]_t though >and he had a need for UTF-16 - going forward we'll keep string_t and >string16_t on par (we may add a string32_t if needed/desired). > >> >> After a first glance at string[16]_t I have the >> impression that the functionality is very basic. >> Nothing against keeping these classes simply and >> not copying the monolithic design of >> std::basic_string, but shouldn't there be at >> least a little bit more? Like concatenation, for >> instance? > >I wouldn't mind adding a bit to the interface - we intentionally kept >it very minimal on the first release. I'd like to try and strike a >balance on the interface between a minimal and sufficient interface >(meaning providing just enough so you can write stand along function >that can fully manipulate the string) and providing more >compatibility with std::basic_string<> to make it easier to migrate. > >Concatenation is a good example - some options: > >1) provide insert() - very powerful but runs into the abuse problem >for people who don't know the issues with inserting into a UTF string. >2) provide append() - this is a safer option then insert() and for >those who really need insert() append can be used with std::rotate >fairly effectively. >3) name append +=3D - as a node to std::string. >4) provide an operator+(). Because of the move semantics operator+() >on string_t doesn't need to suffer the efficiency problems of >std::string - written like this it can eliminate all unnecessary >temporary copies: > >string_t operator+(string_t x, const string_t& y) >{ x +=3D y; return move(x); } > >Any input on which of the above 4 options (they aren't mutually >exclusive) you would like to see? > >It is interesting to note that even without move(), std::basic_string >could eliminate unnecessary copies by implementing + like this: >basic_string operator+(basic_string x, const basic_string& y) >{ basic_string result; swap(result, x); result +=3D y; return result; } > >I don't think the library writers have yet figure out how to leverage >RVO. > >I hate the fact that std::string calls this operator+() because + >should always be a commutative but that battle has long been lost. > >> >> What are your plans regarding string[16]_t? Will >> they stay as they are, which would make them in >> my opinion more suited for interchange accross >> module borders and less for use as internal >> representation? > >As I said, I'm open to extending the interface. I'd like to do so >with some amount of caution though and not recreate the basic_string >mess. > >Sean > -- Kai Br=FCning RagTime GmbH * http://www.ragtime.de Neustra=DFe 69 * 40721 Hilden * Deutschland Tel: [49](0)2103 9657-0 * Fax: [49](0)2103 9657-96 Sitz der Gesellschaft: Hilden * Amtsgericht D=FCsseldorf HRB 45697 Gesch=E4ftsf=FChrer: Helmut Tschemernjak |