Re: [Tcl9-cloverfield] Happy new year!

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

"Frédéric Bonnet" <fre...@fr...> wrote:
> Still alive ;-)

"I'm doing science and I'm still alive!"

> What you describe sounds like skiplists.

Yeah, I think you're right.

> OTOH my implementation is "container" based (if I understand what
> you mean by container),

I was thinking of a "classical" tree implementation wherein the tree nodes are malloc'ed on the heap, contain pointers to each other, and have pointers to individually malloc'ed string data.  I was contrasting that with the string pointers pointing into flat data already allocated elsewhere in large chunks, so that the number of string allocations is (or can be) much smaller than the number of leaf nodes.

> I get very good performances thanks to the allocator and GC. With the
> stock allocator I think I'll get the locality problems that you
> mention.

I will definitely have to check out your code.  Perhaps tonight after work.  Or maybe if I get sufficiently bored, I won't wait until I get home. :^)  That's not likely to happen; I have to port a huge application from Qt3 to Qt4, and it uses Qt Designer and custom widgets.  By "that's not likely to happen", I of course mean that it's not likely I will ever get home. ;^)

> There would be no difference between a rope-based string and a
> btree-based list, aside from memory management: as lists contain
> objects, they must be traversed as well by a mark-and-sweep GC.

I suspect performance would suffer a little bit if the garbage collector had to separately track each character. ;^)  I'm wondering if the GC can be extended to treat sequential blocks of objects as a single entity, so that it can handle both a character string and a single large object without special casing.  This might not worthwhile if character strings are the only place where the number of objects in the block is greater than one.  Try to think of cases where it will be useful for the GC to treat an array of something other than characters as a single logical unit.

> Regarding performances, some benchmarking I've run shows a real bonus
> when handling large objects:

Interesting.

How will the block size limit impact the extension programmer when dealing with large objects?  I guess it will be possible to subdivide large objects into smaller objects with some kind of master object that points to them.  And arrays (which I find are the main reason why objects get to be large) can be handled by this new code you're about to write.

> - traversal is near linear in most case because we can keep contextual
> info in iterator structures and directly access the leaves that
> contain actual data

Can you explain this for me?

> Since memory block size is bounded, heap fragmentation is less
> critical, as large data can be spread over several disjoint blocks.

Awesome. ;^)  I remember dealing with a certain much-hated image generator that sometimes would SIGBUS on startup when run on IRIX.  Eventually I figured out that this was due to a debug option that caused it to malloc a single block of memory many hundreds of megabytes in size.  Sometimes IRIX couldn't find a place in memory for it to go, so the malloc failed, but the image generator assumed success.  To fix, I had to reduce the size of the array, but this also reduced the period of time when the system would be able to record debug data.  Since the developers apparently had never heard of ring buffers, after this time expired, it just stopped capturing data, so I had to work fast.  And also it would collect data during all the time it spent "sitting on the runway" while waiting on the host system to come up to speed, the operator to create an exercise, and the pilot to preflight the jet, ignite the engines, and take off.  For some reason, having the image generator guys fix their stupid code was out of the question, even though they should have been paying me for (black-box!) debugging and profiling their system for them.  Shrug!

> memory allocation is actually much faster with ropes than with flat
> lists, which more than compensates for the overhead.

The primary overhead is additional burden placed on the programmer to keep everything within size limits.

> So in practice flat allocations delegate the algorithmic complexity to
> the heap allocator.

The traditional approach is to abstract stuff like this away so that the programmer doesn't have to worry about it.  But the costs don't go away; they just get hidden and forgotten.  Some of the most interesting designs I've seen arose from reëvaluating the historical layers of abstraction.  Simple example: simplify the CPU by requiring that things be aligned, use the extra die area for registers or cache or execution units, and improve the compiler to statically deal with alignment.  Basically relocate complexity from a place where it is expensive (in the CPU) to a place where it is cheap (in the compiler).

> Moreover a tree-based rope means that several ropes can share the same
> data.

I'm counting on this. ;^)  One rope will contain the entire script, another will contain a procedure body, a third will contain a list, a fourth will contain a string.  The procedure is a subset of the script, the list is a literal entity in the procedure, and the string is an element in the list.  Yet across all these ropes there is only one copy of the string I mentioned.  Even after being concatenated with something else, there is only one copy of it, even if that concatenation is used as a second procedure body in programmatically generated code.

> Regarding Tcl, this means that the string rep of a container can
> share a large part of its elements' and take up much less memory
> than flat strings (besides being faster to allocate and build).

Will there be heuristics to decide when to join lots of tiny strings into a new array?  If the program concatenates thousands of single-character strings, will this be efficient with your current rope code?

> "Andy Goth" <unu...@ai...> a écrit :
> > Argument expansion has proven to be very valuable to Tcl, so of
> > course it stays.  But I don't extend it to allow multiple levels of
> > expansion.
> 
> In Cloverfield it is only extended to be an accepted markup syntax in
> list string reps.

set a (0 1 2)
set b ({*}$a 3 4 5)
set c {{*}$a 3 4 5}
puts [lindex $b 0]
puts [lindex $c 0]

What will this print?  Pay careful attention to the use of parentheses versus braces.

> I don't think multiple levels of expansion is useful
> (not sure what you mean by that).

set a (((0 1) (2 3)) ((4 5) (6 7)))
set b {*}$a
set c {*}{*}$a

would set a to ((0 1) (2 3) (4 5) (6 7)) and b to (0 1 2 3 4 5 6 7).

But I don't think this is worth having.  The Tcl approach is nice and simple.  Just recognize it once.

> #{ }# should nest so that the following works as expected:
> 
>     #{ if {$cond1} {  }#
>     if {$cond2} {
>         ...
>     }
> 
> This means that sole braces don't match comment openings, inline
> comments are delimited by complete digrams.

That's a better idea; I hadn't considered the need to put unbalanced braces inside inline comments.  Which is silly of me, since I know we support unbalanced braces inside whole-line comments, even when those comments are inside brace-quoted script bodies.

While on the subject of comments, there's something I forgot to mention in my previous email.  Like in Cloverfield, I want to allow ordinary whole-line comments to begin anywhere a word can.  In Tcl they can only begin where a command can.  This is confusing to many people.

Although, to be perfectly honest, restricting them to only begin at word starts is confusing too, but only because this is contrary to many other popular languages that allow them everywhere except inside quotes.  But I think this restriction is perfectly valid, because quoting is also only allowed to begin at word starts, and I like the symmetry.  (Shell-style languages allow both comments and quotes to begin anywhere, so they're symmetric as well.)

> > One kind, which somewhat corresponds to C pointers, requires
> > explicit dereferencing.  The string representation of such a
> > reference is "object 123" or similar.  The other kind, which
> > somewhat corresponds to C++ references, is dereferenced
> > automatically.  The string representation is the instantaneous
> > value of the referent variable.
> 
> Your references seem close to Cloverfield's. Your pointers seem to be
> opaque handles like Jim's references.

Yes, the pointers are opaque handles.  The language directly supports "dereferencing" them, by mixing @ into a $ substitution.  But I have to be careful with my terms here, since "dereferencing" a pointer will yield a (transparent, implicit) reference to the variable, which "decays" to the instantaneous value of the pointed-to variable.  (A pointer is an opaque, explicit reference.)  So is "dereferencing" the right word to use for the @?  I probably should find a different word, like "traverse" or "follow".  You "follow" a pointer to obtain a (transparent) reference to the variable it points to.

> I don't get all of it but it sounds interesting. Example code
> welcome!

I wrote this last night, but didn't save it.  I'll type it from memory.  It demonstrates references (not pointers) and the funky expanded proc notation.

proc &whatever (a? (b default_value) c d* e?) {
    if {[info exists &a]} {
        puts a=$a
    }
    puts b=$b
    puts c=$c
    puts d=$d
    if {[info exists &e]} {
        puts e=$e
    }
}
whatever c
whatever a c
whatever a b c
whatever a b c e
whatever a b c d1 e
whatever a b c d1 d2 e

All the object names are written either as a reference or a substitution, so that the interpreter knows that they are names.  I believe this is important for indexing.  Note: proc's second argument does not do this, because otherwise proc would receive a list of references to global variables.  I hope this seeming inconsistency is not too confusing. :^(

The proc dispatch code first binds all required arguments to variables.  Next it binds optional arguments, from left to right.  Last it binds variadic arguments.  This approach is backward-compatible with the traditional Tcl/C++/Python style of requiring required arguments, optional arguments, and variadic arguments to appear in that order in the proc or function definition, but it maps more closely with code that takes a required argument at the end and supports options at the beginning.

Here's more.

set &a 1
set &mydict (alpha &a bravo b charlie &c)
puts $mydict -> error: reference to nonexistent variable "c"
puts $mydict(alpha) -> 1
puts $mydict(bravo) -> b
puts $mydict(charlie) -> error: reference to nonexistent variable "c"
puts $c -> error: no such variable "c"
set &c 0
puts $mydict -> alpha 1 bravo b charlie 0
set &mydict(alpha) 2
set &mydict(bravo) 3
set &mydict(charlie) 4
puts $mydict(alpha) -> 2
puts $mydict(bravo) -> 3
puts $mydict(charlie) -> 4
puts $a -> 2
puts $c -> 4

Clearly, it's problematic to store references to nonexistent variables.  Especially bad is the interactive case, since the shell will try to print the string representation of the return value of every command, but this is an error for the second [set].  The user will think that the error was because [set] failed, but in fact [set] succeeded but the shell failed.  I'm tempted to call the whole thing off, but references to nonexistent variables are essential, or else it would be impossible to ever create a variable in the first place!  So I think that shells and interactive debuggers need an alternative to taking the string representation, something that breaks down the structure a little bit more.  I'm thinking of how Tkcon has "hot" errors that can be clicked to get the backtrace.  If the above was entered into a program like Tkcon, the displayed return value for the second line would be "alpha 1 bravo b charlie _&c_", where _&c_ is underlined, in red, and can be clicked for more information.  Tricky stuff!  I prefer EIAS, but I don't know how to do it in this specific case.  There can be an [info] command to get the reference status of a word, returning a tagged list giving the partial string representation and information about the references.  I should note that most programs won't have to bother with this, only interactive consoles that take it upon themselves to print all return values, even when the user doesn't ask for them.

Now for pointer trickery.  Thankfully, pointers always have string representations, even when they point to variables that don't exist yet. ;^)

set &mydict(alpha) @mydict(charlie)
# (object 1) is now a pointer to &mydict(charlie)
puts $mydict -> alpha (object 1) bravo 3 charlie 4
puts $mydict(alpha) -> object 1
puts $mydict(alpha)@ -> 4
puts $mydict(alpha)@@ -> error: invalid pointer "4"
set $mydict(alpha)@ @d
# (object 2) is now a pointer to &d
puts $mydict -> alpha (object 1) bravo 3 charlie (object 2)
puts $mydict(alpha) -> object 1
puts $mydict(alpha)@ -> object 2
puts $mydict(alpha)@@ -> error: pointer to nonexistent variable "d"
puts $d -> error: no such variable "d"
set $mydict(alpha)@@ @a
# (object 3) is now a pointer to &a
puts $mydict -> alpha (object 1) bravo 3 charlie (object 2)
puts $d -> object 3
puts $mydict(alpha) -> object 1
puts $mydict(alpha)@ -> object 2
puts $mydict(alpha)@@ -> object 3
puts $mydict(alpha)@@@ -> object 1
puts $mydict(alpha)@@@@ -> object 2
puts $mydict(alpha)@@@@@ -> object 3

Let's analyze the line "set $mydict(alpha)@@ @a".  "$mydict(alpha)" substitutes to "object 1", which is followed (first @) to obtain "object 2", which is followed again (second @) to obtain a reference to d.  So set's first argument is &d.  "@a" substitutes to "object 3", and as a side effect this pointer is added to the interpreter's pointer lookup table.  set creates d ('cuz it doesn't already exist) and stores into it the string "object 3" with internal representation "pointer to a".

Remember that $mydict(alpha) is &a, so "object 3" points to the same data as $mydict(alpha).  This explains the circular reference.

Another:

set &mydict (alpha @a bravo @b)
set &a (1 2 3 4)
set &b (5 6 7 8)
puts $mydict -> alpha (object 1) bravo (object 2)
puts $mydict(alpha) -> object 1
puts $mydict(alpha)@ -> 1 2 3 4
puts $mydict(alpha)@{0} -> 1
puts $mydict(alpha)@(1) -> 2

See how easily @ can be intermixed with indexing?  Also it's possible to create multiple pointers or references to the same variable before actually creating the variable.  Pointers to a and b are obtained on line 1, then references to those same variables are obtained on lines 2 and 3.  Moreover, see how much more sane this last example is?  The only place it ever puts references to nonexistent variables is into the first argument of set, which never tries to take the string representation of its first argument.

Now let me try out the "anonymous references" feature.  Which is probably not a good name for it.

proc &whatever () {return (5 4 3 2 1 0)}
set &data [whatever]
puts $data{0} -> 5
puts $&[whatever]{0} -> 5
puts $&(5 4 3 2 1 0){0} -> 5

Look ma, no [lindex]!

> > - Numbers can be expressed in sexagesimal notation: 12'34'56.78 . 
> 
> Interesting. Consider also Eiffel-style grouping using underscore,
> e.g. 1_000_000 is one million. This improves readability a lot.

Are the underscores simply ignored, or are there rules about where they can be placed?

> > - Variables, procs, commands, channels, etc. are all in a single
> > object store.  Objects can be local to a stack frame.
> 
> Would this allow RAII style programming?

Yes, the objects will have suitable destructors which will be executed when the garbage collector reaps the final reference.  For example, a channel will be closed.  Having the garbage collector work on more than just data objects will be a major boon to functional programming.

Something similar to TclOO should be exposed at the script level, except that the user-defined destructor is automatically invoked by the garbage collector when the last reference goes away due to unset or falling off the stack.

> > - An object can have multiple representations, not just two. 
> 
> I plan the same feature for Cloverfield. I'd suggest using simple
> arrays or linked lists instead of hash tables, since the number of
> internal reps would be typically small. A hash table would be overkill
> for that IMHO.

Yeah, a short array would probably be a better choice.  There will only ever be a handful of valid representations at a time, typically just one or two.  Would it be worthwhile to keep the string separate from the array of internal representations, or can it be lumped with the rest as "just another representation"?  I think I favor the latter.  What is the benefit of special-casing the string representation?

> > - Word origins are tracked and accessed with [info origin].  This
> > facilitates syntax error messages.
> 
> This would serve the same purpose of Cloverfield's {metadata}.
> Typically a debug-only feature.

We will have to investigate the run-time overhead of supporting this.  I had the hope that this feature can always be available, since it will make it possible to put filenames and line numbers in backtraces.  When code or data comes from something other than a file, things get really interesting.  If it's read from a network socket, what are the filename and line number?  Perhaps the IP address and timestamp.  If it's read from the GUI, now what is the line number?  Heh.

> > And that's what I have so far...
> 
> Can I steal a couple of ideas? ;-)

No, you can't have them!  They're MINE!! ;^)

Well, maybe just a few.  If you promise to share some of yours. :^)

I'm a couple hours late for work.  Time to go!!

Thanks for taking the time to read my super-long emails.

-- 
Andy Goth | http://andy.junkdrome.org/
unununium@{aircanopy.net,openverse.com}