Re: [Adobe-source-devel] version_0 string[16]_t future?

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Sean, hi Eric,

I fully agree that there are way too many string classes already - guess the=
 reason is that there are so many different forces influencing string class =
design.

Some thoughts about the issues:

Interface structure:

Keeping the string classes themselfes as lean as possible and adding=
 additional functionality and convenience via nonmember nonfriend functions=
 is probably the best way to go. It nicely allows a layered approach by=
 providing e.g. the classes and different sets of free functions in separate=
 headers.

I am sure you are aware of GotW #84: Monoliths "Unstrung" (http://www.gotw.c=
a/gotw/084.htm) or a similar dissection of std::basic_string<> into a=
 minimal class plus free functions.

Level of functionality:

I'd say that at least complete string concatenation, insert, delete and=
 replace must be provided to promote real use of the classes. (An=
 alternative would be a strictly immutable design with all its benefits, but=
 that would defeat movability and therefore necessitate reference counting:=
 a very different approach.)

So there should be (at least) one general replace() function (see below=
 about UTF8 concerns). Adding separate insert(), delete() or append() member=
 functions may or may not make sense performance-wise; the functionality can=
 of course be provided by free functions mapping to replace.
(By the way, using append() plus std::rotate() to emulate replace() would=
 need non-const iterators, doesn't it? So far only const versions of begin()=
 et al are supplied.)

Compare and find should be implementable as free functions without problems.=
 Which makes it straight forward to provide different variants of this=
 functionality in different headers: a simple byte-compare version, one=
 using C++ char_traits and locales or one using IBMs ICU.

Regarding operator use, I agree that this has been decided by std::string.=
 What I am not sure about (didn't look into your move library yet), wouldn't=
 be the move semantics in operator+ be surprising for some users? Is there=
 any provision to make it obvious (at least at runtime) that one of the=
 source strings is gone? (Not that I am against move, getting rid of the=
 temporaries is very important, and be it only for psychological acceptance.=
)

Unicode:

As a European with a German Umlaut (=FC) in my name, I fully agree that the=
 only viable text encoding should be unicode, period. Nevertheless I doubt=
 that users of the library will use the string classes with unicode only, if=
 not forced in some way. And be it only to pass text in other encodings=
 around for conversion to unicode.
Therefore the assumption that string_t always contains UTF-8 is not valid in=
 my view: as you say, it is basically a vector<char>, nothing "better".
Even is UTF-8 is used, the developer will need to be aware of this in more=
 cases than insert()/replace(). One example being that string_t::size() does=
 not return the number of characters in the string. I'm afraid as long as=
 the string class should behave basically as vector<char>, you can't shield=
 users from making mistakes regarding UTF-8 (or UTF-16, too, and even more=
 hidden due to the relative rareness of characters outside of the BMP).

It might be valuable to provide a set of free functions or wrapper classes=
 which impose a unicode character-level interface on string_t and=
 string16_t, possible including checks for the wellformedness of incoming da=
ta.
Unfortunately such an interface will need either some internal caching or a=
 restricted set of functionality to avoid immediate introduction of=
 quadratic complexity by innocent users.
I have been contemplating a while about a unicode string class which=
 dynamically changes the width of its elements to accomodate needs between=
 8, 16 or 32 bit. This would provide a one-to-one mapping between unicode=
 code points and the underlying vector, but of course unicode characters can=
 again be composed of multiple code points, therefore the value of such a=
 class is unclear.

Hope this makes some sense,
Kai

>Hi Kai,
>
>On Jun 22, 2007, at 9:05 AM, Kai Br=FCning wrote:
>
>> knowing the problem of binary compatibility from
>> hard learned experience, I am very interested in
>> your version_0 approach.
>
>Great! Check it out and ask lots of questions - I'm not going to have 
>much of a chance to work on it for 1.0.29 (neither is Mat, we made 
>lots of promises but we're both on vacation...) - but finishing 
>version_0 is our top priority (from 1.0.28 on we will be maintaining 
>binary compatibility).
>
>> And string16_t is one reason more to look at it,
>> because std::wstring is rather unusable due to
>> the different ideas of the size of wchar_t on
>> different platforms/compilers.
>
>I'm a fan of using UTF-8, it has many advantages over UTF-16 
>including being byte order independent and sorting lexicographically 
>the same as UTF-32. Eric Berdahl did the work for string[16]_t though 
>and he had a need for UTF-16 - going forward we'll keep string_t and 
>string16_t on par (we may add a string32_t if needed/desired).
>
>>
>> After a first glance at string[16]_t I have the
>> impression that the functionality is very basic.
>> Nothing against keeping these classes simply and
>> not copying the monolithic design of
>> std::basic_string, but shouldn't there be at
>> least a little bit more? Like concatenation, for
>> instance?
>
>I wouldn't mind adding a bit to the interface - we intentionally kept 
>it very minimal on the first release. I'd like to try and strike a 
>balance on the interface between a minimal and sufficient interface 
>(meaning providing just enough so you can write stand along function 
>that can fully manipulate the string) and providing more 
>compatibility with std::basic_string<> to make it easier to migrate.
>
>Concatenation is a good example - some options:
>
>1) provide insert() - very powerful but runs into the abuse problem 
>for people who don't know the issues with inserting into a UTF string.
>2) provide append() - this is a safer option then insert() and for 
>those who really need insert() append can be used with std::rotate 
>fairly effectively.
>3) name append +=3D - as a node to std::string.
>4) provide an operator+(). Because of the move semantics operator+() 
>on string_t doesn't need to suffer the efficiency problems of 
>std::string - written like this it can eliminate all unnecessary 
>temporary copies:
>
>string_t operator+(string_t x, const string_t& y)
>{ x +=3D y; return move(x); }
>
>Any input on which of the above 4 options (they aren't mutually 
>exclusive) you would like to see?
>
>It is interesting to note that even without move(), std::basic_string 
>could eliminate unnecessary copies by implementing + like this:
>basic_string operator+(basic_string x, const basic_string& y)
>{ basic_string result; swap(result, x); result +=3D y; return result; }
>
>I don't think the library writers have yet figure out how to leverage 
>RVO.
>
>I hate the fact that std::string calls this operator+() because + 
>should always be a commutative but that battle has long been lost.
>
>>
>> What are your plans regarding string[16]_t? Will
>> they stay as they are, which would make them in
>> my opinion more suited for interchange accross
>> module borders and less for use as internal
>> representation?
>
>As I said, I'm open to extending the interface. I'd like to do so 
>with some amount of caution though and not recreate the basic_string 
>mess.
>
>Sean
>

-- 
Kai Br=FCning

RagTime GmbH * http://www.ragtime.de

Neustra=DFe 69 * 40721 Hilden * Deutschland
Tel: [49](0)2103 9657-0 * Fax: [49](0)2103 9657-96

Sitz der Gesellschaft: Hilden * Amtsgericht D=FCsseldorf HRB 45697
Gesch=E4ftsf=FChrer: Helmut Tschemernjak