Re: [Tcl9-cloverfield] [TCLCORE] Variable access
Status: Alpha
Brought to you by:
fbonnet
From: Neil M. <ne...@Cs...> - 2008-05-14 20:18:31
|
Hi Frédéric, I think a reference implementation (pun intended!), or some semi- formal description, would clarify this discussion immensely. Until then, some not-so-brief comments are below. I've tried to trim down to just the key points, so I've trimmed much of the rest of the discussion. On 14 May 2008, at 17:20, Frédéric Bonnet wrote: [...] > (I'm going to digress a bit from the original topic) > > For example, let's consider the following asymetric directed graph: > > A --> B > | / | > | / | > V L V > C --> D > > Obviously this cannot be represented by a tree, so pure Tcl lists are > useless in this case. Now if we use nodes with references, we get the > following : > > Node Children > A B C > B C D > C D > D {} My immediate thought is that graphs are best represented by relations, rather than references, so for any serious graph work, I'd use a relational database like SQLite, or a relational language like Prolog. For a small example like this, I'd probably use my own digraph package (http://wiki.tcl.tk/18129 - based on dicts), which results in a similar solution to your first array solution: > > We may store each node in an array, using node names as keys: > > array set graph { > A {B C} > B {C D} > C {D} > D {} > } > > But here each child is represented by its key, not its variable name, > which is implicitly the array. So in a real world scenario, we get: > > array set graph { > A {graph(B) graph(C)} > B {graph(C) graph(D)} > C {graph(D)} > D {} > } Why would you do this? As I mentioned before, references always need some mapping (interpretation) which defines what they refer to. Trading a reference which is defined relative to the graph itself (i.e. simple keys) for one which is defined relative to the current variable namespace, doesn't seem to gain much (and makes it harder to e.g. clone the graph). > Now, if we want to manage the lifetime of each node, we have to > maintain > bookkeeping info somewhere. IOW, we have to write an ad-hoc garbage > collector. Sure. If you use a new system of reference (e.g. a dict, or an array etc), then you need to manage the lifetimes of the objects contained in it. It's not clear in this example what particular strategy of memory management is desired: do you want a node to disappear when no edge leads to it? Or when the only reference to it is itself (via a circular reference)? Or do you want edges to be removed when the node they target is removed? It's not obvious to me that a general-purpose reference GC does the right thing for the application, without knowing what the application is. [... snip graph representation ...] > This looks slightly cryptic, but note that this is the serialized > value. > In practice one hardly uses this notation but rather builds structures > programmatically: > > # Create each node with an initially empty child node > set a (A ()) > set b (B ()) > set c (C ()) > set d (D ()) > > # Now add all children (element 1 of each node) > set a{1} ($&b $&c) > set b{1} ($&c $&d) > set c{1} ($&d) > > # Put all node in one single variable > set graph ($&a $&b $&c $&d) > > The idea beind Cloverfield's serialization format is that any > reference > can be recreated using any of its string representation. The technique > consists in serializing both the referenced value and a suitable > identifier. [...] Right, but this still leaves lots of questions. For instance, it is quite easy to manufacture references in this system: set a {ref A}(1 2 3) ;# $a is now a reference with id "A" and val (1 2 3) set b {ref A}(a b c) Now what? Does setting b override the reference in $a? What is used to determine which of these conflicting reference definitions is current? Is it the time at which they are converted to reference type? (And when does that happen?) There is also possible problems of accidental (or deliberate) name clashes, and associated security issues (e.g. if an attacker can manufacture any reference as a string, then he can have a field day feeding you values that contain manufactured references, either to capture information or to clobber it). You could mitigate these problems to some extent by making the $&var syntax return a reference whose ID is some cryptographically unforgeable string, but it seems you couldn't get rid of the situation entirely. (Being able to pass mutable references where only values were previously allowed in general would need a careful security audit). Another scenario: set a {ref A}(1 2 3) set b $&a As I understand it, both $a and $b now refer to the reference ("A") that points at the value (1 2 3). Now, let's say that for some reason that reference "A" loses its internal rep (say somebody performs an operation with $a that converts it to some other type). Furthermore, lets say that by some further operation causes a copy-on-write so that $a and $b now are distinct Tcl_Objs, but that they still have an identical string rep. IIRC, the following might cause that situation: append a " " set a [string range $a 0 end-1] So now we have two distinct Tcl_Objs $a and $b that used to be the same reference but now aren't and neither has a reference internal rep. Now, say $a is stashed away in some data structure and forgotten about. Meanwhile, $b is treated as a reference again and re- establishes the reference, recreating it if necessary from the string rep. Now the user happily updates $b, mutating the reference, to say (a b c). A little later, $b is unset and the reference disappears again. Now, $a is brought out of hiding, and brushed down, and then an attempt is made to read the reference: what does it return? Presumably, as the reference has now gone, it is recreated again from the string rep of $a, reverting to (1 2 3) -- all the updates to $b have been lost, even though these references were supposed to be the same. On the flip side of that coin, imagine that $b isn't destroyed before $a is retrieved. Here the connection would be reestablished and the updates saved. But this itself might be confusing. If the programmer decides to print $a before retrieving the reference value he will see that it apparently has the value (1 2 3) still, but then when the reference is resolved it will get the updates written to $b (presumably) and thus actually contain the value (a b c). Now it may be that you intend that the [puts $a] would already do the conversion and retrieval of the appropriate reference. However, I'm not sure how this would work -- does [puts] (and every other command) have to scan its argument to see if it looks like a reference before using it? This is the problem with strings as references, they're just too easy to bugger about with. You could try something like Feather's opaque internal reps (which can't be shimmered), but this then definitely moves you outside of EIAS. > To handle circular references, a value is only serialized > when it is first encountered in the current serialization process. In > the above example, A's children is (B C D), so when serializing the > whole graph, A is first serialized, then its first child B, then B's > children and so on, so that when nodes need to be serialized again > later, we only need to output the ref id, and the whole value can be > omitted. But then, presumably if this entire structure gets shimmered (e.g. by an append), all of these embedded references lose their internal reps (possibly causing the actual reference to be GC'd). Now if we extract one of these later nodes that only have the reference ID then we no longer have enough information to recreate that reference, as we have lost both the original reference and any string rep that could be used to recreate it. The integrity of each individual reference is only preserved as long as the integrity of the entire structure is preserved. [...] > >> I don't really understand what benefits these references have over >> just using unique globally-scoped variable names. > > Lifetime management, mutable vs. immutable, copy-on write come to > mind. Lifetime management I get, but surely references are both mutable and not copy-on-write (which would make them immutable), just like variables. > The challenge is to create a system that preserves the existing Tcl > philosophy while allowing strong references. The lack of such > references > is the reason why DOM bindings such as TclDOM or tdom require special > handling of the DOM's childNodes method. For example, tdom's > childNodes > require a plain immutable Tcl list, whereas childNodesLive returns a > "live", opaque, mutable list object. In other languages the structures > are simply mutable and the DOM semantics is preserved. My point is that real "strong" references are entirely incompatible with EIAS. >> [...] > >> set d [dict create foo 12 bar $&jim] >> puts $sock $d >> ... on other end ... >> set d [read $sock] >> set $&d(bar) ... ;# is that the right syntax? >> >> Does that work? I assume for that to work that the string rep of a >> reference must identify itself as a reference (e.g. the "{ref foo} >> bar" syntax), but then how do I get just the current value from a >> reference if $myref returns the reference? > > $myref works as expected, whether it holds a reference or not. > > For example: > > % set jim (1 2 3) > % set d [dict create foo 12 bar $&jim baz $&jim] > foo 12 bar {ref <id>}(1 2 3) baz {ref <id>}{} > % set jim (a b c) > % set d > foo 12 bar {ref <id>}(a b c) baz {ref <id>}{} > % set d(baz) > a b c Why does the item denoted by $d(baz) have a different string rep when serialised inside the dictionary to when it is used on its own? I.e., why isn't the output of that last command: % set d(baz) {ref <id>}{} or (more sophisticated): % set d(baz) {ref <id>}(a b c) ? Presumably, [set] is looking at the value at $d(baz) and seeing (by inspection of the internal rep, or by string matching if the rep has been lost) that it looks like a reference and then doing another de- reference to get the actual value? Whereas the $&foo syntax just grabs the reference itself and returns it's string rep? Except that in this case 'jim' doesn't contain a reference to begin with, but is a normal variable, so $&jim creates the reference, presumably. Then, do subsequent $&jim's guarantee to return the same reference, or a fresh one? Or are variables themselves implemented as references? > [...] > > With the new system, commands will have to decide whether they expect > variable names, or values. Commands expecting variable names would > come > in small number, as they would be limited to declarative commands such > as global or variable, or introspection. They are the kind that would > benefit from proc enhancements, such as auto-upvar'ing with & prefix, > but wouldn't from references. So the vast majority of commands would > expect values and not variable names. Of these commands, some would > work > in a purely functional way with immutable values, others could accept > both values and references. At the script level everything would be > handled transparently: modify an argument variable, and the change > would > propagate downwards if the passed argument was a reference. Weak > references to non-existing variables can be detected by checking > for the > existence of the argument variable. Again, I worry about the security problems this introduces, and more generally the difficulties it introduces. For example, consider some code like the following: proc execute-query {query} { lappend query -user foo -password sekret db eval $query } set i [interp create -safe] $i alias query execute-query $i eval $untrustedcode In current Tcl that is safe: the username and password remain in trusted code. With mutable references there is now a brand new line of attack where the caller simply passes a reference instead of a value and can now read back the user name and password once the query completes. Mutable references should not be transparently interchangeable with values. The security aspects are just the most extreme example of the problems this can cause. Much code is written now with the assumption that changes to local variables aren't visible outside of the proc unless a specific broadening of scope is performed (via upvar, variable, global etc). (Incidentally, while testing this I discovered that a safe slave can introspect the aliases defined in it, including retrieving the target command that would be executed in the master interpreter, which seems like a security flaw of its own). -- Neil This message has been checked for viruses but the contents of an attachment may still contain software viruses, which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. |