From: Donal K. F. <don...@ma...> - 2013-01-18 15:13:13
|
On 18/01/2013 12:22, Martin Lemburg wrote: > I have an eventually naive opinion to this topic: > > 1. string commands don't manipulate the given data, but return results > of string operations on the given data, right? Ah, but what is a string and what is a manipulation? > 2. string commands could work on the UTF-8 string representations of > Tcl_Objs or on the hold Unicode string within of Tcl_Objs, right? There's a few other representations they can work with in the current implementation. Right now, we have: * UTF-8 representation — can coexist with any other representation * "Unicode" representation — 16 bits per character, replaces any other non-UTF8 representation if requested. * Byte-array representation — 8 bits per character, only handles restricted range of characters, replaces any other non-UTF8 representation if requested. All fundamental string operations can be defined on any of the above (though we might have omitted actually doing that for some) and user code is not supposed to care about the difference under normal circumstances. The cost of conversion between representations is normally linear in the length of the string. Oh, and there's also the empty string. That's a special case as Tcl uses the empty string so often to mean other things (like the "null" result). > 3. if a string representation does not exist, it is (re)created, right? Yes. This is a expensive operation potentially. > 4. so why the string commands need to shimmer the Tcl_Objs anyway? Depends on what the operation is, of course. If you're asking for the last character of the string ([string index $str end]) you need the string to be able to get that information. Computing the last character of a UTF-8 string isn't that expensive really, but handling the general case of an arbitrary index is. Other operations also need indexing (e.g., the taking of a substring) or the computing of the length of the string, and this is information that the UTF-8 representation does not provide us with cheaply (the 'length' field of the Tcl_Obj record describes the number of bytes used, less 1 for the terminating NUL, and not the number of characters this means). As it happens (and all IIRC) the presence of a "string" representation record doesn't imply that we have actually allocated a buffer for the "Unicode" data array. It could just be recording the fact that the data is really ASCII-and-not-NULL and so that the 'bytes' field is directly usable and that the 'length' field is meaningful. > 5. shimmering costs runtime, and working on the UTF-8 encoded string > representation too. Is it sure, that shimmering costs less runtime? It's usually the case that if someone uses a value as if it was of some type (e.g., "integer", "list" or "string") then they will do so in the future as well. This is the principle on which the Tcl_Obj mechanism is built. The only curious thing is that there's a separate "string" representation as well as the standard UTF-8 rep; that was introduced in 8.2 or 8.3 (I forget which) to address the killer performance problems of 8.1. > 6. developer habbits, like always using "[string length $t] == 0" > instead of "$t eq {}" are killing internal representations, but this > is not expected by many devs and probably contra productive. > Especially if as example the internal representation is a 3D body of > facets (data of facets after a tessalation from within a CAD system, > with a name or token (like the file handles/tokens) as string > representation). The real irritating thing is that we can't currently detect such tests and replace them with more efficient checks. > 7. Why not simplyfing all this by … > 1. … defaulting the operation base for string operations to the > string representation, if not of the type string, and otherwise > to the typed string? I don't really understand what you mean by this. The problem is that some operations that developers tend to assume as being O(1) complexity are actually O(N) when dealing with pure UTF-8 data, the key operations being "get string length" and "get pointer to X'th character". The changes that made those two operations be normally constant time were a key part of how we addressed the performance problems of the switch to Unicode characters (switch in Tcl 8.1, additional representation work in 8.2 or 8.3). The big deal for changing O(1) to O(N)? Easy; algorithms that use the ops go from O(N) to O(N**2), or O(N**2) to O(N**3), and that's *awful*! > 2. … preventing any "string-shimmering" to prevent the continious > destruction and reconstruction of internal structures (e.g. > nested dict structures), that may happen because developers > don't care about shimmering, don't know about reasons of > shimmering, or are not willing/able to change habbits, or are > C(++) developers, which a bit of work in tcl - so no deeply > involved tcl'ers! A constraint we had was that no string operation could fail; the type signature of the public string operations in the Tcl library has no option for failure, no mechanism for reporting a problem. The other constraint was that we could not alter the size of the Tcl_Obj structure in any way (particularly in a non-debugging build) due to the widespread use of that structure in third-party code. Aside from that, I really don't understand what you're looking for there. Donal. |