Re: [TCLCORE] Agressive shimmering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 18/01/2013 12:22, Martin Lemburg wrote:
> I have an eventually naive opinion to this topic:
>
>  1. string commands don't manipulate the given data, but return results
>     of string operations on the given data, right?

Ah, but what is a string and what is a manipulation?

>  2. string commands could work on the UTF-8 string representations of
>     Tcl_Objs or on the hold Unicode string within of Tcl_Objs, right?

There's a few other representations they can work with in the current
implementation. Right now, we have:

   * UTF-8 representation — can coexist with any other representation
   * "Unicode" representation — 16 bits per character, replaces any
     other non-UTF8 representation if requested.
   * Byte-array representation — 8 bits per character, only handles
     restricted range of characters, replaces any other non-UTF8
     representation if requested.

All fundamental string operations can be defined on any of the above
(though we might have omitted actually doing that for some) and user
code is not supposed to care about the difference under normal
circumstances. The cost of conversion between representations is
normally linear in the length of the string.

Oh, and there's also the empty string. That's a special case as Tcl uses
the empty string so often to mean other things (like the "null" result).

>  3. if a string representation does not exist, it is (re)created, right?

Yes. This is a expensive operation potentially.

>  4. so why the string commands need to shimmer the Tcl_Objs anyway?

Depends on what the operation is, of course. If you're asking for the
last character of the string ([string index $str end]) you need the
string to be able to get that information. Computing the last character
of a UTF-8 string isn't that expensive really, but handling the general
case of an arbitrary index is. Other operations also need indexing
(e.g., the taking of a substring) or the computing of the length of the
string, and this is information that the UTF-8 representation does not
provide us with cheaply (the 'length' field of the Tcl_Obj record
describes the number of bytes used, less 1 for the terminating NUL, and
not the number of characters this means).

As it happens (and all IIRC) the presence of a "string" representation
record doesn't imply that we have actually allocated a buffer for the
"Unicode" data array. It could just be recording the fact that the data
is really ASCII-and-not-NULL and so that the 'bytes' field is directly
usable and that the 'length' field is meaningful.

>  5. shimmering costs runtime, and working on the UTF-8 encoded string
>     representation too. Is it sure, that shimmering costs less runtime?

It's usually the case that if someone uses a value as if it was of some
type (e.g., "integer", "list" or "string") then they will do so in the
future as well. This is the principle on which the Tcl_Obj mechanism is
built. The only curious thing is that there's a separate "string"
representation as well as the standard UTF-8 rep; that was introduced in
8.2 or 8.3 (I forget which) to address the killer performance problems
of 8.1.

>  6. developer habbits, like always using "[string length $t] == 0"
>     instead of "$t eq {}" are killing internal representations, but this
>     is not expected by many devs and probably contra productive.
>     Especially if as example the internal representation is a 3D body of
>     facets (data of facets after a tessalation from within a CAD system,
>     with a name or token (like the file handles/tokens) as string
>     representation).

The real irritating thing is that we can't currently detect such tests
and replace them with more efficient checks.

>  7. Why not simplyfing all this by …
>      1. … defaulting the operation base for string operations to the
>         string representation, if not of the type string, and otherwise
>         to the typed string?

I don't really understand what you mean by this. The problem is that
some operations that developers tend to assume as being O(1) complexity
are actually O(N) when dealing with pure UTF-8 data, the key operations
being "get string length" and "get pointer to X'th character". The
changes that made those two operations be normally constant time were a
key part of how we addressed the performance problems of the switch to
Unicode characters (switch in Tcl 8.1, additional representation work in
8.2 or 8.3).

The big deal for changing O(1) to O(N)? Easy; algorithms that use the
ops go from O(N) to O(N**2), or O(N**2) to O(N**3), and that's *awful*!

>      2. … preventing any "string-shimmering" to prevent the continious
>         destruction and reconstruction of internal structures (e.g.
>         nested dict structures), that may happen because developers
>         don't care about shimmering, don't know about reasons of
>         shimmering, or are not willing/able to change habbits, or are
>         C(++) developers, which a bit of work in tcl - so no deeply
>         involved tcl'ers!

A constraint we had was that no string operation could fail; the type 
signature of the public string operations in the Tcl library has no 
option for failure, no mechanism for reporting a problem. The other 
constraint was that we could not alter the size of the Tcl_Obj structure 
in any way (particularly in a non-debugging build) due to the widespread 
use of that structure in third-party code.

Aside from that, I really don't understand what you're looking for there.

Donal.

Re: [TCLCORE] Agressive shimmering

The Tool Command Language implementation

Re: [TCLCORE] Agressive shimmering