Re: [Tcl9-cloverfield] [TCLCORE] Variable access

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Frédéric,

I think a reference implementation (pun intended!), or some semi- 
formal description, would clarify this discussion immensely. Until  
then, some not-so-brief comments are below. I've tried to trim down  
to just the key points, so I've trimmed much of the rest of the  
discussion.

On 14 May 2008, at 17:20, Frédéric Bonnet wrote:
[...]
> (I'm going to digress a bit from the original topic)
>
> For example, let's consider the following asymetric directed graph:
>
> A --> B
> |   / |
> |  /  |
> V L   V
> C --> D
>
> Obviously this cannot be represented by a tree, so pure Tcl lists are
> useless in this case. Now if we use nodes with references, we get the
> following :
>
> Node  Children
> A     B C
> B     C D
> C     D
> D     {}

My immediate thought is that graphs are best represented by  
relations, rather than references, so for any serious graph work, I'd  
use a relational database like SQLite, or a relational language like  
Prolog. For a small example like this, I'd probably use my own  
digraph package (http://wiki.tcl.tk/18129 - based on dicts), which  
results in a similar solution to your first array solution:

>
> We may store each node in an array, using node names as keys:
>
>      array set graph {
>          A {B C}
>          B {C D}
>          C {D}
>          D {}
>      }
>
> But here each child is represented by its key, not its variable name,
> which is implicitly the array. So in a real world scenario, we get:
>
>      array set graph {
>          A {graph(B) graph(C)}
>          B {graph(C) graph(D)}
>          C {graph(D)}
>          D {}
>      }

Why would you do this? As I mentioned before, references always need  
some mapping (interpretation) which defines what they refer to.  
Trading a reference which is defined relative to the graph itself  
(i.e. simple keys) for one which is defined relative to the current  
variable namespace, doesn't seem to gain much (and makes it harder to  
e.g. clone the graph).

> Now, if we want to manage the lifetime of each node, we have to  
> maintain
> bookkeeping info somewhere. IOW, we have to write an ad-hoc garbage
> collector.

Sure. If you use a new system of reference (e.g. a dict, or an array  
etc), then you need to manage the lifetimes of the objects contained  
in it. It's not clear in this example what particular strategy of  
memory management is desired: do you want a node to disappear when no  
edge leads to it? Or when the only reference to it is itself (via a  
circular reference)? Or do you want edges to be removed when the node  
they target is removed? It's not obvious to me that a general-purpose  
reference GC does the right thing for the application, without  
knowing what the application is.

[... snip graph representation ...]

> This looks slightly cryptic, but note that this is the serialized  
> value.
> In practice one hardly uses this notation but rather builds structures
> programmatically:
>
>      # Create each node with an initially empty child node
>      set a (A ())
>      set b (B ())
>      set c (C ())
>      set d (D ())
>
>      # Now add all children (element 1 of each node)
>      set a{1} ($&b $&c)
>      set b{1} ($&c $&d)
>      set c{1} ($&d)
>
>      # Put all node in one single variable
>      set graph ($&a $&b $&c $&d)
>
> The idea beind Cloverfield's serialization format is that any  
> reference
> can be recreated using any of its string representation. The technique
> consists in serializing both the referenced value and a suitable
> identifier. [...]

Right, but this still leaves lots of questions. For instance, it is  
quite easy to manufacture references in this system:

set a {ref A}(1 2 3) ;# $a is now a reference with id "A" and val (1  
2 3)
set b {ref A}(a b c)

Now what? Does setting b override the reference in $a? What is used  
to determine which of these conflicting reference definitions is  
current? Is it the time at which they are converted to reference  
type? (And when does that happen?) There is also possible problems of  
accidental (or deliberate) name clashes, and associated security  
issues (e.g. if an attacker can manufacture any reference as a  
string, then he can have a field day feeding you values that contain  
manufactured references, either to capture information or to clobber  
it). You could mitigate these problems to some extent by making the  
$&var syntax return a reference whose ID is some cryptographically  
unforgeable string, but it seems you couldn't get rid of the  
situation entirely. (Being able to pass mutable references where only  
values were previously allowed in general would need a careful  
security audit).

Another scenario:
set a {ref A}(1 2 3)
set b $&a

As I understand it, both $a and $b now refer to the reference ("A")  
that points at the value (1 2 3). Now, let's say that for some reason  
that reference "A" loses its internal rep (say somebody performs an  
operation with $a that converts it to some other type). Furthermore,  
lets say that by some further operation causes a copy-on-write so  
that $a and $b now are distinct Tcl_Objs, but that they still have an  
identical string rep. IIRC, the following might cause that situation:

   append a " "
   set a [string range $a 0 end-1]

So now we have two distinct Tcl_Objs $a and $b that used to be the  
same reference but now aren't and neither has a reference internal  
rep. Now, say $a is stashed away in some data structure and forgotten  
about. Meanwhile, $b is treated as a reference again and re- 
establishes the reference, recreating it if necessary from the string  
rep. Now the user happily updates $b, mutating the reference, to say  
(a b c). A little later, $b is unset and the reference disappears  
again. Now, $a is brought out of hiding, and brushed down, and then  
an attempt is made to read the reference: what does it return?  
Presumably, as the reference has now gone, it is recreated again from  
the string rep of $a, reverting to (1 2 3) -- all the updates to $b  
have been lost, even though these references were supposed to be the  
same.

On the flip side of that coin, imagine that $b isn't destroyed before  
$a is retrieved. Here the connection would be reestablished and the  
updates saved. But this itself might be confusing. If the programmer  
decides to print $a before retrieving the reference value he will see  
that it apparently has the value (1 2 3) still, but then when the  
reference is resolved it will get the updates written to $b  
(presumably) and thus actually contain the value (a b c).

Now it may be that you intend that the [puts $a] would already do the  
conversion and retrieval of the appropriate reference. However, I'm  
not sure how this would work -- does [puts] (and every other command)  
have to scan its argument to see if it looks like a reference before  
using it?

This is the problem with strings as references, they're just too easy  
to bugger about with. You could try something like Feather's opaque  
internal reps (which can't be shimmered), but this then definitely  
moves you outside of EIAS.

>  To handle circular references, a value is only serialized
> when it is first encountered in the current serialization process. In
> the above example, A's children is (B C D), so when serializing the
> whole graph, A is first serialized, then its first child B, then B's
> children and so on, so that when nodes need to be serialized again
> later, we only need to output the ref id, and the whole value can be
> omitted.

But then, presumably if this entire structure gets shimmered (e.g. by  
an append), all of these embedded references lose their internal reps  
(possibly causing the actual reference to be GC'd). Now if we extract  
one of these later nodes that only have the reference ID then we no  
longer have enough information to recreate that reference, as we have  
lost both the original reference and any string rep that could be  
used to recreate it. The integrity of each individual reference is  
only preserved as long as the integrity of the entire structure is  
preserved.

[...]
>
>> I don't really understand what benefits these references have over
>> just using unique globally-scoped variable names.
>
> Lifetime management, mutable vs. immutable, copy-on write come to  
> mind.

Lifetime management I get, but surely references are both mutable and  
not copy-on-write (which would make them immutable), just like  
variables.

> The challenge is to create a system that preserves the existing Tcl
> philosophy while allowing strong references. The lack of such  
> references
> is the reason why DOM bindings such as TclDOM or tdom require special
> handling of the DOM's childNodes method. For example, tdom's  
> childNodes
> require a plain immutable Tcl list, whereas childNodesLive returns a
> "live", opaque, mutable list object. In other languages the structures
> are simply mutable and the DOM semantics is preserved.

My point is that real "strong" references are entirely incompatible  
with EIAS.

>> [...]

>
>> set d [dict create foo 12 bar $&jim]
>> puts $sock $d
>> ... on other end ...
>> set d [read $sock]
>> set $&d(bar) ... ;# is that the right syntax?
>>
>> Does that work? I assume for that to work that the string rep of a
>> reference must identify itself as a reference (e.g. the "{ref foo}
>> bar" syntax), but then how do I get just the current value from a
>> reference if $myref returns the reference?
>
> $myref works as expected, whether it holds a reference or not.
>
> For example:
>
>      % set jim (1 2 3)
>      % set d [dict create foo 12 bar $&jim baz $&jim]
>      foo 12 bar {ref <id>}(1 2 3) baz {ref <id>}{}
>      % set jim (a b c)
>      % set d
>      foo 12 bar {ref <id>}(a b c) baz {ref <id>}{}
>      % set d(baz)
>      a b c

Why does the item denoted by $d(baz) have a different string rep when  
serialised inside the dictionary to when it is used on its own? I.e.,  
why isn't the output of that last command:

     % set d(baz)
     {ref <id>}{}

or (more sophisticated):

     % set d(baz)
     {ref <id>}(a b c)

?

Presumably, [set] is looking at the value at $d(baz) and seeing (by  
inspection of the internal rep, or by string matching if the rep has  
been lost) that it looks like a reference and then doing another de- 
reference to get the actual value? Whereas the $&foo syntax just  
grabs the reference itself and returns it's string rep? Except that  
in this case 'jim' doesn't contain a reference to begin with, but is  
a normal variable, so $&jim creates the reference, presumably. Then,  
do subsequent $&jim's guarantee to return the same reference, or a  
fresh one? Or are variables themselves implemented as references?

> [...]
>
> With the new system, commands will have to decide whether they expect
> variable names, or values. Commands expecting variable names would  
> come
> in small number, as they would be limited to declarative commands such
> as global or variable, or introspection. They are the kind that would
> benefit from proc enhancements, such as auto-upvar'ing with & prefix,
> but wouldn't from references. So the vast majority of commands would
> expect values and not variable names. Of these commands, some would  
> work
> in a purely functional way with immutable values, others could accept
> both values and references. At the script level everything would be
> handled transparently: modify an argument variable, and the change  
> would
> propagate downwards if the passed argument was a reference. Weak
> references to non-existing variables can be detected by checking  
> for the
> existence of the argument variable.

Again, I worry about the security problems this introduces, and more  
generally the difficulties it introduces. For example, consider some  
code like the following:

proc execute-query {query} {
     lappend query -user foo -password sekret
     db eval $query
}
set i [interp create -safe]
$i alias query execute-query
$i eval $untrustedcode

In current Tcl that is safe: the username and password remain in  
trusted code. With mutable references there is now a brand new line  
of attack where the caller simply passes a reference instead of a  
value and can now read back the user name and password once the query  
completes. Mutable references should not be transparently  
interchangeable with values. The security aspects are just the most  
extreme example of the problems this can cause. Much code is written  
now with the assumption that changes to local variables aren't  
visible outside of the proc unless a specific broadening of scope is  
performed (via upvar, variable, global etc).

(Incidentally, while testing this I discovered that a safe slave can  
introspect the aliases defined in it, including retrieving the target  
command that would be executed in the master interpreter, which seems  
like a security flaw of its own).

-- Neil
This message has been checked for viruses but the contents of an attachment
may still contain software viruses, which could damage your computer system:
you are advised to perform your own checks. Email communications with the
University of Nottingham may be monitored as permitted by UK legislation.