tcl9-cloverfield Mailing List for Tcl9 (Page 3)
Status: Alpha
Brought to you by:
fbonnet
You can subscribe to this list here.
2008 |
Jan
|
Feb
(3) |
Mar
(9) |
Apr
(22) |
May
(64) |
Jun
(13) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 |
Jan
(12) |
Feb
(1) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(2) |
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
(1) |
Mar
(1) |
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
(1) |
From: Frédéric B. <fb...@en...> - 2008-05-28 09:19:26
|
[For unknown reasons I didn't get all the messages on my main email account, so it took some time for me to respond. Sorry about that] Lars Hellström wrote: > At 2008-05-15 13.55, Frédéric Bonnet wrote: >> Hi Neil, >> >> Neil Madden wrote: > [snip] > >> My point is that real "strong" references are entirely > >> incompatible with EIAS. > > > > Hmmm. I'm not sure what you mean by that. More precisely, > > what does EIAS means exactly? To me, it's the fact that > > everything can be expressed as strings, and that internal > > structures can be recreated from their string representation. > > The Cloverfield reference serialization format meets > > these criteria IMHO. > > This ignorance, if I may say so, seems central to the conception of > Cloverfield; I wonder if it is genuine of merely tactical (pretending > not to know better in order to get away with misrepresenting the > issues). Assuming the former however, I can say the key point is that > the arguments of commands are in Cloverfield not known to be strings, > and since command arguments are certainly in the traditional > "everything" of EIAS, it follows that Cloverfield is not EIAS. So I'm either genuinely ignorant, or tactically dishonest. Sweet. I tend to find this statement slightly insulting but since I'm on a good mood I'll give you the benefit of the doubt. Arguments of commands in Cloverfield ARE strings, as with Tcl. I don't know what makes you think otherwise. So your conclusion falls flat. > To elaborate on the "Cloverfield command arguments need not be strings" > issue: The set of strings over a particular alphabet (which in current > Tcl is the Unicode BMP) is a mathematically well-defined concept, and > presumably the intent is not that Cloverfield should start interpreting > every string containing unfortunate character combinations as commands > to do something -- this would be the way of macro processors such as m4, > and that way lies a quoting swamp -- but continue to handle strings as > Tcl do. Nonetheless, Cloverfield is expected to handle values (such as > references) that when operated upon produce results (e.g. side-effects) > that are different from everything that could result from applying the > same operation on a string. Hence references are not strings. I don't know what you mean by that exactly. Using the same arguments, one could argue that variables or commands are not strings, but I fail to understand how it relates to EIAS as we know it in current Tcl. > Now, what you in "the tridekalogue" invest a considerable effort in is > ensuring that any command call can be written straight off as a string. > This is certainly an important part of The Tcl Way, but that alone does > not give you EIAS. The command may be a string before parsing and > substitution, but after that it would rather have to be a list of > "things" from some proper superset of that of all strings. What you > _might_ do is work some parsing rules in reverse and recover a string > representation for the _list_ of those things; this would bring you > "every list is a string", which is reminiscent of, but a weaker property > than, EIAS. How is that different from the way Tcl handles lists? EIAS ensures that any given value has a string representation at any stage of its transformations, which forms a bijection between the string and internal representations. > To illustrate, > > {ref a}((1 {ref a}{}) (2 {ref a}{})) > > is a perfectly good string, and it might also be the string > representation for a Cloverfield list. Since this is one word, the list > will have one element. That element is not a string, but it may be > entered as a Cloverfield command argument, so that in order to generate > the above, one can evaluate the command > > list {ref a}((1 {ref a}{}) (2 {ref a}{})) I see your point, but it has nothing to do with EIAS. This example just shows that the proposed serialization format doesn't allow an element to reference its root container, because prefixing the string with a {ref} would form a single word and not a list. That's a problem that I totally overlooked, thanks for pointing that out. However the fix is quite easy: let's use the empty string as the reference ID for the root. So: % set v {ref a}((1 {ref a}{}) (2 {ref a}{})) (1 {ref {}}{}) (2 {ref {}}{}) The elements that reference the root container no longer needs the latter to be explicitly prefixed by a {ref id}, which would generate an invalid string rep. So the following gives the same result as the above: % set v ((1 {ref {}}{}) (2 {ref {}}{})) (1 {ref {}}{}) (2 {ref {}}{}) Here v holds a 2-element list whose elements contain a reference to itself. No more inconsistency. > Concretely, if one was to do > > set u ({ref a}(1 2) {ref a}{}) > set v ({ref a}(1 2) {ref a}{}) > > then presumably the values of u and v would have equal string > representation, but be distinct in memory. If one then does > > set w $&v{0} > > will the string representations of u and v still be equal? No they won't, because getting a strong reference from v's element 0 is a mutable operation. > If they are, > then the results of the commands > > linsert $u 1 $&w > linsert $v 1 $&w > > will have to be equal as well, but I doubt this is the intended > behaviour. Rather, you would want the result of the latter to be a list > where all three elements carry the same reference label, whereas in the > result of the former, the middle element should carry a reference label > different from that of the other two elements. Exactly. Moreover in the former the middle element will carry an external reference, other references being internal. In the latter all references will be external. > For that to happen, while > preserving Every List Is A String, the very act of obtaining a reference > to a list element (substitution of $&v{0}) must change the string > representation of that list as a whole. True. By analogy with the set/$ duality in Tcl, getting a reference using $& is sugar for e.g. [reference get], which is a mutable operation. And there's nothing in the current semantics of either Tcl or Cloverfield that prevents commands from having side effects. You seem to favor an interpretation of EIAS that is purely functional. > From there, one might go on to question the supposed locality of > Cloverfield references -- although there might be syntax for local > references that can be used to create data structures, doing anything > with that data structure will force the references to be global -- but > I'll stop here. That's wrong. Global (or external) references are only created when aliasing occurs, e.g. when creating a strong reference between separate values. Good practices will favor purely functional operations or weak references, which are free of any side effects as they are just glorified function names. From a graph theory perspective, strong references are undirected (bidirectional) edges, whereas weak references are directed edges. So creating a strong reference between separate values has the same effect as creating an edge between unconnected graphs, which obviously has side effects on the overall connectivity of both graphs. The proposed reference notation is just a way to label entry points: if a graph has several entry points then it cannot be expressed using only internal references. > PS: To be frank: I believe Cloverfield is/will be an utter waste of good > effort -- an exercise that seeks to afflict the very heart of Tcl with > severe featuritis !-( -- and the sooner such projects die the better. ;-] Thanks for your support, I'll do my best. |
From: Alexandre F. <ale...@gm...> - 2008-05-22 22:30:26
|
On 5/23/08, Frédéric Bonnet <fre...@fr...> wrote: > > Very interesting discussion (and a bit heated at times). Oddly enough, > it took place at the same time I was designing references in > Cloverfield. There must have been some kind of mental link between us at > this moment ;-) :-) > [...] In short there seems to be a general consensus > that references and EIAS don't mix well. Yes, that's a truth I am as ready to emphasize today as I was eager to challenge a few months ago. See, I can learn ;-) > > - late rope-to-string (Cloverfield): no mutability overhead, but no > > cacheing of the linearized string of a rope. > > > > - string rep caching (Tcl): variable mutability overhead (a,b), but > > full cacheing of string reps until invalidated. > Regarding the string representation, I must clarify my point of view. I > put EIAS (and hence the concept of string representation) on a more > conceptual level than the current implementation does. I don't think > that generating a string representation in the traditional sense is a > strong requirement. Well, unless it costs you too much to regenerate the same string thousands of times per second ! > Rather, I think that we should provide a means to > obtain (notice the choice of word) this string representation on demand, > for example using C++-style iterators, or using rope structures. > Besides, there is a strong duality between iterators and ropes: they are > two facets of the same concept, the former on the algorithmic level, the > latter on the structural level. Yeah of course, but caches are the oil that allows many gears to roll smoothly. Without them, much more than Tcl would come to a grinding halt ! The ->bytes field of Tcl_Obj can be seen as just that: a cache of the on-demand computation embodied by UpdateStringOf* functions. Then there is a delicate trade-off between the relative costs of forward string computation and cache consistency enforcement. Addressing this trade-off correctly is the result of the superior insight behind Tcl 8.0 (within the realm of immutable values). Addressing it with immutables is another story, but I doubt the solution will be "no cache". > Moreover I think that implementing references or partially mutable > values is simply impossible in the existing implementation, given the > current semantics of Tcl (COW, pass-by-value, immutability...), and that > is the cause of the (relative) failure of your patch. You're roughly right, but the actual reason is a bit more subtle. COW is rather easily handled by proper perturbation of Tcl_IsShared (see the code). The show-stopper in my case was the interaction between mutables and uncontrolled iterators. Indeed, while it is easy to notice that iterating over a mutable poses a problem, it is impossible to prevent an extension defining its own iterators from taking (say) a snaphot of the "elements" array of a (secretly mutable) list. It's more a question of contract (in a years-old API) between extensions and the core than anything else. The API was born and designed in an immutable one, and it keeps the stigmata :-( > That is also the > reason why Cloverfield needs to be a global approach, because any > partial approach would eventually clash with some other part of the core. Yes, I have to agree with this. > My proposal implies a complete redesign of the core in term of graphs. > Data structures are collections of nodes. Nodes are bound to each other > by directed edges. Nodes hold string values, and edges are references. > Variables are entry points to a graph. A node's string rep can be > obtained by iterating over its adjacent nodes. At present Tcl data > structures are trees, which are acyclic directed graphs, and whose > string rep is generated by a depth-first traversal. Cloverfield proposes > to extend the syntax and the serialization format, so as to introduce > references and allow the representation of arbitrary directed graphs > (potentially cyclic). There are a lot of implications regarding the > implementation (backrefs and friends), but I think it's doable (I've > already done that kind of things in other projects, details upon > request). The first casualties will be Tcl_Objs and variables as we know > them, so that means a lot of incompatibilities. Yes. Simple refcounting GC gone. Go to mark-and-sweep. Worry about GC-triggering situations, and ugly worst cases, etc... You're rebuilding the Empire State Building, and you're right to start with the foundation, but what's not clear is whether you'll be reusing more than a few bricks from Tcl ... but maybe you want only the dwellers, not the bricks ;-) -Alex |
From: Frédéric B. <fre...@fr...> - 2008-05-22 22:00:56
|
Alexandre Ferrieux wrote: > On 5/21/08, Larry McVoy <lm...@bi...> wrote: >>> My goal with Cloverfield is to keep the existing Tcl semantics >>> (copy-on-write, pass-by-value) while adding references. This implies >>> that the string rep of a structure holding references may change over >>> time. This would be grossly inefficient if the whole string rep had to >>> be recreated over and over again, especially for large structures, so we >>> need an efficient string representation that allows for selective >>> invalidation of substrings. >> Is this really true? It seems to me it is only true for things wanting >> to get at the string rep. And yes, you want to be able to puts($obj) >> but if all the code you are calling is compiled and knows how to get >> at the internals isn't that a have-your-cake-and-eat-it-too? > > I concur with Larry. See my attempts with the Mutability patch earlier > this year, and also a (long) thread on clt exploring different options > for this: > > http://groups.google.com/group/comp.lang.tcl/tree/browse_frm/thread/b6c26a33f5ebd3e0/b9e4201c7c7c6c6c?hl=en&rnum=101&q=fredderic+ferrieux&_done=%2Fgroup%2Fcomp.lang.tcl%2Fbrowse_frm%2Fthread%2Fb6c26a33f5ebd3e0%2Fb94505c1f6b335db%3Fhl%3Den%26lnk%3Dgst%26q%3Dfredderic%2Bferrieux%26#doc_b9e4201c7c7c6c6c Very interesting discussion (and a bit heated at times). Oddly enough, it took place at the same time I was designing references in Cloverfield. There must have been some kind of mental link between us at this moment ;-) I took some time to read the whole discussion. A lot of concepts were developed that are quite similar to what I propose in Cloverfield. Also, there seems to be a demand from one part of the community, and a rejection from an opposite part, either from plain dismissal of the concept, or out of frustration that it may even be implementable in the current state of Tcl. In short there seems to be a general consensus that references and EIAS don't mix well. > Basically, two approaches were imagined: > > (a) keep track of back references (from containee to container) and > invalidate string reps "up the container hierarchy" when a mutable > object has been updated. > > (b) keep a list of mutable objects whose string rep has been > computed, and mass-invalidate them when any time a mutable is poked > into. > > Only (b) was implemented in my stillborn patch (you can imagine why). > But, as Larry suggest, unless you stuff your code with [puts], the > performance hit of mass invalidation is fairly small in practice. > > Notice that this was done within true Tcl, where string reps are true > strings, not ropes. Of course the efficiency trade-off is between: > > - late rope-to-string (Cloverfield): no mutability overhead, but no > cacheing of the linearized string of a rope. > > - string rep caching (Tcl): variable mutability overhead (a,b), but > full cacheing of string reps until invalidated. > > Do you agree with this summary ? Do you see alternatives ? Regarding the string representation, I must clarify my point of view. I put EIAS (and hence the concept of string representation) on a more conceptual level than the current implementation does. I don't think that generating a string representation in the traditional sense is a strong requirement. Rather, I think that we should provide a means to obtain (notice the choice of word) this string representation on demand, for example using C++-style iterators, or using rope structures. Besides, there is a strong duality between iterators and ropes: they are two facets of the same concept, the former on the algorithmic level, the latter on the structural level. So a string representation only needs to be enforced at the script level. At the internal level, the iterator abstraction would be used instead of a plain string. This means that this concept cannot be used efficiently in the current state of Tcl. IOW, a rope object type won't solve our problems. Rather, Tcl_Objs should use ropes instead of plain strings for string representation. (*) Moreover I think that implementing references or partially mutable values is simply impossible in the existing implementation, given the current semantics of Tcl (COW, pass-by-value, immutability...), and that is the cause of the (relative) failure of your patch. That is also the reason why Cloverfield needs to be a global approach, because any partial approach would eventually clash with some other part of the core. My proposal implies a complete redesign of the core in term of graphs. Data structures are collections of nodes. Nodes are bound to each other by directed edges. Nodes hold string values, and edges are references. Variables are entry points to a graph. A node's string rep can be obtained by iterating over its adjacent nodes. At present Tcl data structures are trees, which are acyclic directed graphs, and whose string rep is generated by a depth-first traversal. Cloverfield proposes to extend the syntax and the serialization format, so as to introduce references and allow the representation of arbitrary directed graphs (potentially cyclic). There are a lot of implications regarding the implementation (backrefs and friends), but I think it's doable (I've already done that kind of things in other projects, details upon request). The first casualties will be Tcl_Objs and variables as we know them, so that means a lot of incompatibilities. This explains why Cloverfield targets Tcl 9. (*) There are ways to plug ropes in place of C strings even in the current implementation. See for example the wiki page on ropes (http://wiki.tcl.tk/rope) and take a look at the original paper on "C Cords". There is also some valuable info on the page about SGI implementation details. |
From: David W. <dav...@gm...> - 2008-05-22 17:32:09
|
> I've had a look at Hecl a while back and I think it works remarkably well. > > Contrary to Cloverfield, which keeps Tcl's copy-on-write semantics, Hecl > uses references exclusively, as does Java, so it shifts away from Tcl > semantics (for example, [incr $i] increments variable i, and [list $a $b > $c] is a "live" list), but this totally makes sense since it's closely > tied to Java. However I remember that an early version allowed both > values and references, the latter using an ampersand instead of a > dollar. Can you confirm this? Yes, that was part of my experimentation. We ended up not using it, but that's part of the fun of doing something that's not tied to the past:-) > My goal with Cloverfield is to keep the existing Tcl semantics > (copy-on-write, pass-by-value) while adding references. This implies > that the string rep of a structure holding references may change over > time. This would be grossly inefficient if the whole string rep had to > be recreated over and over again, especially for large structures, so we > need an efficient string representation that allows for selective > invalidation of substrings. Unfortunately Java doesn't seem to be > suitable for this task because of its everything-is-an-object, > all-reference approach: locality of reference is poor and would kill the > performances in this critical part of the implementation. That is why I > plan to build the reference implementation from scratch using either C > or (sanitized) C++, the primary runtime target being the LLVM > infrastructure. However Java is a very likely secondary target because > of its ubiquity. The string representation I have in mind is based on > immutable, constant strings, so this is a good match with the Java model. > > I'll take a look at the Hecl code base to see whether it can make a > suitable experimental testbed. Having a prototype implementation of > Cloverfield references would certainly help people grasp the concept and > eliminate the design errors. If Hecl's not your cup of tea, other small Tcl implementations like Jim, or Jacl might be good targets as well. -- David N. Welton http://www.welton.it/davidw/ http://www.dedasys.com/ |
From: Alexandre F. <ale...@gm...> - 2008-05-22 11:44:55
|
On 5/21/08, Larry McVoy <lm...@bi...> wrote: > > My goal with Cloverfield is to keep the existing Tcl semantics > > (copy-on-write, pass-by-value) while adding references. This implies > > that the string rep of a structure holding references may change over > > time. This would be grossly inefficient if the whole string rep had to > > be recreated over and over again, especially for large structures, so we > > need an efficient string representation that allows for selective > > invalidation of substrings. > > Is this really true? It seems to me it is only true for things wanting > to get at the string rep. And yes, you want to be able to puts($obj) > but if all the code you are calling is compiled and knows how to get > at the internals isn't that a have-your-cake-and-eat-it-too? I concur with Larry. See my attempts with the Mutability patch earlier this year, and also a (long) thread on clt exploring different options for this: http://groups.google.com/group/comp.lang.tcl/tree/browse_frm/thread/b6c26a33f5ebd3e0/b9e4201c7c7c6c6c?hl=en&rnum=101&q=fredderic+ferrieux&_done=%2Fgroup%2Fcomp.lang.tcl%2Fbrowse_frm%2Fthread%2Fb6c26a33f5ebd3e0%2Fb94505c1f6b335db%3Fhl%3Den%26lnk%3Dgst%26q%3Dfredderic%2Bferrieux%26#doc_b9e4201c7c7c6c6c Basically, two approaches were imagined: (a) keep track of back references (from containee to container) and invalidate string reps "up the container hierarchy" when a mutable object has been updated. (b) keep a list of mutable objects whose string rep has been computed, and mass-invalidate them when any time a mutable is poked into. Only (b) was implemented in my stillborn patch (you can imagine why). But, as Larry suggest, unless you stuff your code with [puts], the performance hit of mass invalidation is fairly small in practice. Notice that this was done within true Tcl, where string reps are true strings, not ropes. Of course the efficiency trade-off is between: - late rope-to-string (Cloverfield): no mutability overhead, but no cacheing of the linearized string of a rope. - string rep caching (Tcl): variable mutability overhead (a,b), but full cacheing of string reps until invalidated. Do you agree with this summary ? Do you see alternatives ? -Alex |
From: <lm...@bi...> - 2008-05-21 15:50:56
|
> My goal with Cloverfield is to keep the existing Tcl semantics > (copy-on-write, pass-by-value) while adding references. This implies > that the string rep of a structure holding references may change over > time. This would be grossly inefficient if the whole string rep had to > be recreated over and over again, especially for large structures, so we > need an efficient string representation that allows for selective > invalidation of substrings. Is this really true? It seems to me it is only true for things wanting to get at the string rep. And yes, you want to be able to puts($obj) but if all the code you are calling is compiled and knows how to get at the internals isn't that a have-your-cake-and-eat-it-too? -- --- Larry McVoy lm at bitmover.com http://www.bitkeeper.com |
From: Frédéric B. <fb...@en...> - 2008-05-21 14:13:35
|
Hi Andy, Andy Goth wrote: > I am still a very busy boy, basically working multiple jobs in two different > states. There's ordinary paywork in Texas, there's a house rental racket I do > in Oklahoma, and now I also help work on a flight simulator in OK. So that's > why I'm so ridiculously behind on answering emails! That's OK, I'm busy too ;-) (preparing Daughter v3.0 by the end of June). > On Thu, 24 Apr 2008 16:57:33 +0200, Frédéric Bonnet wrote >> Compared to the previous version, I've rewritten it a bit so as to >> use a single getToken, versus one per rule previously. This >> simplifies things a lot. > > I haven't read it in great detail, but it looks to me like your new code is > somewhere halfway between my code and your old code, in terms of balancing > between the extremes of one-huge-state-machine and one-proc-per-state. Is > that correct? You seem to have concluded that having a smaller number of > slightly more complex state machines is a simplification, that trivial state > machines are more trouble than they're worth. Indeed. My first parser was just a collection of sub-parsers for each quoting rule. This was fine for experimentation purpose, but once things stabilize a bit, it's better to do some factorization. More specifically, the previous implementation had specific tokenizers for each rule, whereas the latest has only one. The main advantage is that the tokenizer is no longer context-sensitive (*). This means that the tokenizer can run on the whole source data. Useful for e.g. syntax colorers. It also allows for a plug-in C implementation (which is MUCH faster according to my tests). Last, keeping sub-parsers for each rule makes the code more maintainable IMHO. (*) actually it still is, but in a minor fashion: char grouping rules differ slightly depending on the context, for optimization purpose. In practice this means that a context-free tokenizer will generate separate sequences that would end up being grouped together by the parser in a given context. For example with the string "a b", the context-free tokenizer generates 3 sequences (NAME/WHITESPACE/NAME) whereas in the context of rule [4] it would generate 1 sequence (TEXT). In both cases a LITERAL syllable is generated, so the parser output is the same. >> Moreover I've plugged in a C implementations using CriTcl when available. > > Interesting approach. Do you think this makes the code more readable or > faster? Or are you primarily interested in limited experimentation with C > and/or learning how to use Critcl to interface C with Tcl, or perhaps about > language mixing in general? It wasn't clear to me whether the cprocs are the > only implementation present in the code or if all the C stuff was also > implemented in Tcl. First, I like the experimental and incremental techniques that this approach allows. When code stabilizes it can be migrated to C easily. Moreover the extra Tcl <-> C layer helps clarifies the internal structure. Learning how to use CriTcl is secondary to me, because it's rather straightforward (which is quite impressive BTW). Concerning readability, it depends ;) Some things are easier in C than in Tcl, for other things it's the contrary. So a mixed approach is interesting. Regarding procs vs. cprocs, I provided both versions, so that the code runs everywhere. That way people can test it without having to install a full-blown dev env (should they want to test it with a regular GCC-backed CriTcl). One last point: if we build the interp core over the LLVM framework, it's interesting to know that the latter also provides a dynamic C/C++ compilation environment based on GCC that emits LLVM bitcode. So providing a CriTcl-like extension for a LLVM-based Cloverfield would give us a portable scripted compiler with JIT and runtime optimizations. That's one of the hidden goals of Cloverfield (and a potential killer app), but shhhh, it's a secret ;-) >> Regarding CriTcl, I'm using the TCC-based Odyce extension bundled >> with the eTcl distrib. > > Have the sources for that been made available, or can it only be had in binary > form? I totally agree that it's a good thing that it be self-contained, which > is why I'm interested to know. AFAIK it's unavailable outside of eTcl, unfortunately. But eTcl works quite well (it's a plain Tcl8.5 distrib), and can even run on mobile platforms. And from us developers' point of view, it frees us from portability issues. |
From: Frédéric B. <fb...@en...> - 2008-05-21 09:58:17
|
Hi David, David Welton wrote: > For those wishing to play around with something, I think that Hecl is > in some ways a bit simpler and easier to hack than Tcl, and might be a > fun sandbox for people. Hecl currently has references, as we wanted > to be able to handle references to Java objects without using a hash > table to look them up. I've had a look at Hecl a while back and I think it works remarkably well. Contrary to Cloverfield, which keeps Tcl's copy-on-write semantics, Hecl uses references exclusively, as does Java, so it shifts away from Tcl semantics (for example, [incr $i] increments variable i, and [list $a $b $c] is a "live" list), but this totally makes sense since it's closely tied to Java. However I remember that an early version allowed both values and references, the latter using an ampersand instead of a dollar. Can you confirm this? My goal with Cloverfield is to keep the existing Tcl semantics (copy-on-write, pass-by-value) while adding references. This implies that the string rep of a structure holding references may change over time. This would be grossly inefficient if the whole string rep had to be recreated over and over again, especially for large structures, so we need an efficient string representation that allows for selective invalidation of substrings. Unfortunately Java doesn't seem to be suitable for this task because of its everything-is-an-object, all-reference approach: locality of reference is poor and would kill the performances in this critical part of the implementation. That is why I plan to build the reference implementation from scratch using either C or (sanitized) C++, the primary runtime target being the LLVM infrastructure. However Java is a very likely secondary target because of its ubiquity. The string representation I have in mind is based on immutable, constant strings, so this is a good match with the Java model. I'll take a look at the Hecl code base to see whether it can make a suitable experimental testbed. Having a prototype implementation of Cloverfield references would certainly help people grasp the concept and eliminate the design errors. Cheers, Fred. |
From: Frédéric B. <fb...@en...> - 2008-05-20 14:53:40
|
Andreas Leitgeb wrote: >> set first 1st >> foreach next {2nd 3rd 4th} { >> lappend l3 ($&first $next) >> } >> puts $l3; # ({ref <id>}1st 2nd) ({ref <id>}{} 3rd) ({ref <id>}{} 4th) > > Ah, ok, so now this is with external ref, then. > > I'm however somewhat struck by odd, as to the auto-internalizing of refs, > and what I'd deduce as logical consequence: > > set l3_copy $l3 ;# still with extref to "first" > # mod on $&l3_copy{0 0} reflects in $first and in $l3, right? Yes. Both l3 and l3_copy hold a reference to the same value at {0 0}, so modifying this value reflects everywhere. > # Out of curiosity, is there a way to see that "first"'s > # value is currently ref-shared? I think references will need a whole set of dedicated commands for introspection and the like. This is conform to the Tcl philosophy, where the core language is very small and what would be keywords in other languages is provided by commands. For example, the ensemble command "reference" or "ref" could provide a "set" subcommand to change a referenced value, versus the toplevel "set" command that changes the binding of a given variable (or part thereof) to a new value. Likewise, "reference shared" could return whether a referenced value is shared across contexts (and hence is externalized). Another command could be used to "prune" external refs so that it can be safely used without side effects (for example when passing strings around interps). As a sidenote, I think that the whole concept of references and values, and their associated semantics, could be expressed more formally using graph theory. This would clarifies things a lot. > unset first; # assuming no previous further spreading of $&first > > set l3_2ndcopy $l3 > # mod on $&l3_2ndcopy{0 0} does *not* reflect in $l3, right? Right. Since the reference is no longer external (as only one variable refers to the value), then it becomes internal, so copy-on-write semantics fully apply. >> <id> being some identifier. But there's a subtlety in the sense that the >> reference is no longer intern but extern, ie it refers to an outer >> variable. So copying the value preserves the external reference. > > And what if I did it the other way round: > set l4 (({ref a}1st 2nd) ({ref a}{} 3rd)) > set ext $&l4{0 0} > # does this make the reference "a" external, such > # that a copy of $l4 made afterward would then still > # share the ref? Yes (as long as ext refers to it). So it's fully symmetric. >> A structure with references can be used in an immutable way. > > A structure with external refs can be used in an immutable way, > but still behaves like partially mutable, as you explained in > your answer to my first question. But that's just a sidenote, > not the point of this question. True. But a structure with external refs is partially mutable only if a mutable op is performed on a subpart of it. If only immutable ops are performed, then the whole structure remains untouched. >> (assuming that Tcl's lreplace is kept as such, and converted to accept >> references as arguments). > If lreplace was thusly converted, why not lrange? That's a good question. Slicing (like lrange does) is an immutable operation, as it doesn't alter the original structure. However Cloverfield provides a syntax for slicing: set l (0 1 2 3 4) set l2 $&l{1..3} ; # l2 points to l's [1-3] subrange lreplace $&l2 1 1 foo puts $l2 ; # => 1 foo 3 puts $l ; => 0 1 foo 3 4 When used on values, this syntax is equivalent to lrange. So for consistency, lrange, when given a reference as argument, would return a reference to a sublist of its argument (but wouldn't modify the argument, so it's still an immutable operation). >> # Cloverfield >> set i 1 >> set ref $&i # ref is 1 >> setref $&ref 2 # i is 3 > > 2? 3? I guess, that's a typo :-) You're right :*/ That should be "setref $&ref 3". > Anyway, this would mean, that: >> proc strip {l} { >> setref $&l [lrange $l 1 end-1] >> } > would have answered my third question. Indeed. >> Unless we decide that set takes references, but this is a >> significant departure from Tcl > > It would obviously not be the first one in cloverfields history :-) ;-) > If "append $&i ..." were an acceptable departure, then > "set $&l ..." wouldn't be less so in my eyes. To be fair, I think that the current "set" syntax is part of the overall look & feel of Tcl. E.g. set i 1 set j $i I wouldn't want Cloverfield to look like: set $@i 1 set $&j $i Much less readable IMHO. Besides, if "set" accepts references instead of variable names, it means that we would need another command that does the reverse in order to support both semantics (i.e. alter value vs. point to value). This is the same dichotomy in C (*p = i vs. p = &i). So we would recreate "set" with a different name. >> And now for fun, enter inline expansion {*}: >> set v {ref root}(a {*}{ref root}{}) > > That beats me. Is this still just discussion of "what would be" > or is there already a demo-implementation that accepts this? This is guaranteed 100% vaporware ;-) Actually, my goal regarding the implementation is to allow such a construct. > I probably wouldn't allow that particular construction, since > I believe that this behaviour is really impossible to implement > generally. > e.g.: > proc b {} {puts "b called"} > set v {ref root}(a {*}{ref root}{} [b]) > The second element (thus the whole structure) is immediately > inlined, before the third element has even be parsed. That > can't go. Why? The above code only builds a list with 3 nodes: first is value "a", second is an inline-expanded node that points to its container, and third is [b] (here, an empty string). The only difference between refs with and without {*} is the inline expansion flag. Of course foreach'ing such a list would cause an endless loop, but the syntax is valid and serializable (thanks to the serialize-first-only rule). So creating "infinite lists" (AKA circular lists) is trivial. >>> Fifth question: >>> Flattening: >>> % set f1 [list {*}$v];# stringrep=?? >> Each of v's elements are passed as value to [list], so: >> (1 2 3) (1 2 3) (1 2 3) > > % set f1 [list {*}$&v];# stringrep=?? > % set f2 [list $&v{0} $&v{1} $&v{2}]; # ... > refs retained or lost? Both are equivalent, ie they build a list of references to the same value, so the references are retained (and output in the string rep). > (and a new) Sixth Question: > Does this {...} "lindex"-syntax have a pendant for dicts? Yes, using parentheses (like with Tcl's arrays): set v (a (b 1 c 2) d (3 4 5)) puts $v(a) ; # => (b 1 c 2) puts $v(a b) ; # => 1 puts $v(d){1} ; # => 4 puts $v{1}(c) ; # => 2 > How does treating an even sized list as a dict work with ref's > both for keys and values? > e.g. puts [dict get ({ref a}2 {ref a}{} 3 {ref a}{})] > I really wouldn't believe refs to work as keys. Dicts use values as keys. The ref is just metadata. So value-wise, the above code is equivalent to: puts [dict get (2 2 3 2)] (all references point to the same value 2). |
From: Lars H. <Lar...@re...> - 2008-05-20 14:28:17
|
At 2008-05-15 13.55, Frédéric Bonnet wrote: > Hi Neil, > > Neil Madden wrote: [snip] >> My point is that real "strong" references are entirely >> incompatible with EIAS. > > Hmmm. I'm not sure what you mean by that. More precisely, > what does EIAS means exactly? To me, it's the fact that > everything can be expressed as strings, and that internal > structures can be recreated from their string representation. > The Cloverfield reference serialization format meets > these criteria IMHO. This ignorance, if I may say so, seems central to the conception of Cloverfield; I wonder if it is genuine of merely tactical (pretending not to know better in order to get away with misrepresenting the issues). Assuming the former however, I can say the key point is that the arguments of commands are in Cloverfield not known to be strings, and since command arguments are certainly in the traditional "everything" of EIAS, it follows that Cloverfield is not EIAS. To elaborate on the "Cloverfield command arguments need not be strings" issue: The set of strings over a particular alphabet (which in current Tcl is the Unicode BMP) is a mathematically well-defined concept, and presumably the intent is not that Cloverfield should start interpreting every string containing unfortunate character combinations as commands to do something -- this would be the way of macro processors such as m4, and that way lies a quoting swamp -- but continue to handle strings as Tcl do. Nonetheless, Cloverfield is expected to handle values (such as references) that when operated upon produce results (e.g. side-effects) that are different from everything that could result from applying the same operation on a string. Hence references are not strings. Now, what you in "the tridekalogue" invest a considerable effort in is ensuring that any command call can be written straight off as a string. This is certainly an important part of The Tcl Way, but that alone does not give you EIAS. The command may be a string before parsing and substitution, but after that it would rather have to be a list of "things" from some proper superset of that of all strings. What you _might_ do is work some parsing rules in reverse and recover a string representation for the _list_ of those things; this would bring you "every list is a string", which is reminiscent of, but a weaker property than, EIAS. To illustrate, {ref a}((1 {ref a}{}) (2 {ref a}{})) is a perfectly good string, and it might also be the string representation for a Cloverfield list. Since this is one word, the list will have one element. That element is not a string, but it may be entered as a Cloverfield command argument, so that in order to generate the above, one can evaluate the command list {ref a}((1 {ref a}{}) (2 {ref a}{})) (If it seems to you that I've got the nesting depth off by one, then check how it works for lists without references. Confusing the word one writes in a command for the actual string representation of the value of that word is a common mistake.) It's futhermore possible that you don't even get Every List Is A String, depending on details of how references are reflected in the serialised form. The issue here is that one aspect of "Is A String" is that the string representation must tell you _everything_ about how that value behaves when operated upon, and for that it matters whether an element is local (only referenced to from within the list) or not (something outside the list holds a reference to the element). Concretely, if one was to do set u ({ref a}(1 2) {ref a}{}) set v ({ref a}(1 2) {ref a}{}) then presumably the values of u and v would have equal string representation, but be distinct in memory. If one then does set w $&v{0} will the string representations of u and v still be equal? If they are, then the results of the commands linsert $u 1 $&w linsert $v 1 $&w will have to be equal as well, but I doubt this is the intended behaviour. Rather, you would want the result of the latter to be a list where all three elements carry the same reference label, whereas in the result of the former, the middle element should carry a reference label different from that of the other two elements. For that to happen, while preserving Every List Is A String, the very act of obtaining a reference to a list element (substitution of $&v{0}) must change the string representation of that list as a whole. From there, one might go on to question the supposed locality of Cloverfield references -- although there might be syntax for local references that can be used to create data structures, doing anything with that data structure will force the references to be global -- but I'll stop here. Lars Hellström PS: To be frank: I believe Cloverfield is/will be an utter waste of good effort -- an exercise that seeks to afflict the very heart of Tcl with severe featuritis !-( -- and the sooner such projects die the better. ;-] |
From: Frédéric B. <fb...@en...> - 2008-05-19 15:45:12
|
Hi, Andreas Leitgeb wrote: > A few outsider-comments: > >> Tridekalogue on the Wiki here: > While endeka and dodeka are greek words, trideka isn't. Greek 13 is dekatria. > And were you to add more rules, it would become dekatessera, dekapente,... Aside from my wiki response, I'd point out the fact that, while 13 is dekatria in (modern) Greek, the numerical *prefix* for 13 is triskaideka (or its shorter form trideka). Hence, the tridekagon or triskaidekagon is a 13-sided polygon: http://en.wikipedia.org/wiki/Numerical_prefix Note #1: "Sometimes the prefixes are cited as though they were the original words themselves. The prefixes derived from Greek are not only in the wrong alphabet, but also differ from the actual corresponding Greek words. See the individual word etymologies for the actual number words." > > Now about the technical questions: Very interesting questions indeed! I hope my answers will clarify things a bit. > > First question: > > If I wanted to incrementally build up a list, > where each element is a sublist, and each first > sub-element were references to the same value, > how would I do it? > e.g.: > set l (({ref a}1st 2nd)) > lappend l ($&l{0 0} 3rd) ;# right? > puts $l;# is it now: (({ref a}1st 2nd) ({ref a}{} 3rd)) ? Yep. > And what, if I wanted to start from an empty list, and add > these sublists in a loop? How would I get it started? How > do I add a new element to be refer-able ? set l2 (({ref a}1st 2nd)) foreach next {3rd 4th} { lappend l2 ($&l2{0 0} $next) } puts $l2; # ({ref a}1st 2nd) ({ref a}{} 3rd) ({ref a}{} 4th) Another way to do it would be: set first 1st foreach next {2nd 3rd 4th} { lappend l3 ($&first $next) } puts $l3; # ({ref <id>}1st 2nd) ({ref <id>}{} 3rd) ({ref <id>}{} 4th) <id> being some identifier. But there's a subtlety in the sense that the reference is no longer intern but extern, ie it refers to an outer variable. So copying the value preserves the external reference. Consider this: set l1_copy $l1 append $&l1{0 0} foo append $&l1{0 1} bar puts $l1{0} ; # => 1stfoo 2ndbar puts $l1_copy{0} ; # => 1st 2nd set l2_copy $l2 append $&l2{0 0} foo append $&l2{0 1} bar puts $l2{0} ; # => {ref <id>}1stfoo 2ndbar puts $l2_copy{0} ; # => {ref <id>}1stfoo 2nd That's because l2 contains an external reference (the value is still referenced by variable "first"), and so does l2_copy even if it's a copy, as the global ID is copied as is. However when this reference goes out of scope, e.g. if it's a local variable and the proc returns, the external reference gets converted to internal, and work as expected: proc test {} { set first 1st foreach next {2nd 3rd 4th} { lappend l ($&first $next) } return $l } set l [test] ; # ({ref <id>}1st 2nd) ({ref <id>}{} 3rd) ({ref <id>}{} 4th) set l_copy $l append $&l1{0 0} foo append $&l1{0 1} bar puts $l1{0} ; # => 1stfoo 2ndbar puts $l1_copy{0} ; # => 1st > Second question: > > how would I obtain a thoroughly ref-free snapshot of $l ? > (One, whose stringrep would then be: {{1st 2nd} {1st 3rd}}) As a general rule, you can't, because structures are potentially self-referential. However you can flatten it by hand: foreach elt $l { lappend l2 ($l{0} $l{1}) } The real question is, why and in which case would you do such a thing? The best way to avoid building self-referential structures is to use values instead of references (as with Tcl). > Third question: > > If I wanted to write a procedure that works correct with such > mutables, how would I do it? Let's say, I want a procedure that > takes such a list, and is supposed to remove first and last > element: First, don't confuse mutability and references, as they are orthogonal (albeit related) concepts. Mutability occur when commands properly documented as such treat their arguments differently depending on whether they are passed by value or reference. So mutability is a choice made on the caller side. A structure with references can be used in an immutable way. > proc strip {l} { > #set l [lrange $l 1 end-1] ;# probably won't work, because > # it doesn't mutate, but replace the value of local "l". > # Maybe cloverfield offers a mutant variant of lrange, but > # that would not be the point of the question. How do I > # perform a mutation that replaces the value entirely? > } Indeed, lrange isn't a mutable op. You'll want lreplace for that, one call for each head and tail element: proc strip {l} { lreplace $&l 0 0 lreplace $&l end end } (assuming that Tcl's lreplace is kept as such, and converted to accept references as arguments). Moreover, the set in your code would have no effect other than pointing the local variable l to a new value, leaving the original intact. This uses the same semantics as pointers in C. This means that set is immutable, so to get the intended behavior we need a version of set that takes references, let's call it setref: /* C code int i=1, j=2, *p; p = i; /* *p is 1 */ *p = 3; /* i is 3 */ p = j; /* *p is 2 */ *p = 4; /* j is 4 */ # Cloverfield set i 1 set j 2 set ref $&i # ref is 1 setref $&ref 2 # i is 3 set ref $&j # ref is 2 setref $&ref 4 # j is 4 The reason is that set takes variable names, so passing in a reference actually passes the referenced value as a variable name. Unless we decide that set takes references, but this is a significant departure from Tcl that will require the $ prefix on all variable usage (except for declarative commands). This perlifies/pythonifies the syntax a bit (and REQUIRES a weak reference syntax), however it's a big improvement from the compiler's POV. This needs to be discussed. > Fourth question: > > % set v ({ref a}(1 2 3) {ref a}{} {ref a}{}) > % llength $v ;# -> 3 > % set v1 [lrange $v 1 end];# stringrep now: {ref a}(1 2 3) {ref a}{} ? Yes. > In words: after stripping off the one element whose stringrep > previously contained the value, will the new stringrep write > out the value to the first then-still-available occurrance of > the ref? > > % set v2 [lrange $v1 1 end];# "{ref a}(1 2 3)" or just "1 2 3" now? 1 2 3. The {ref} syntax is just a serialization method for self-referential structures. A value that is referered only once doesn't need a ref ID. Building circular structures: set v {ref root}(a {ref root}{}) llength $v ; # => 2 lindex $v 0; # => a lindex $v 1; # {ref root}(a {ref root}{}), same as $v And now for fun, enter inline expansion {*}: set v {ref root}(a {*}{ref root}{}) llength $v ; # => infinity! lindex $v $i ; # for every value of i, gives a lreplace $v 0 0; # => {ref root}(a {*}{ref root}{}), same as $v > Fifth question: > > Flattening: > % set f1 [list {*}$v];# stringrep=?? Each of v's elements are passed as value to [list], so: (1 2 3) (1 2 3) (1 2 3) This answers your second question above better than I did. Nice flattening technique ;-) > % set f2 [concat $v $v];# stringrep=?? $v is passed as value in both cases. Moreover each has its own reference context. So this gives: {ref <id1>}(1 2 3) {ref <id1>}{} {ref <id1>}{} {ref <id2>}(1 2 3) {ref <id2>}{} {ref <id2>}{} Where id1 and id2 are distinct IDs. Also note that the ref ids given by the user are purely advisory. They are kept as is whenever possible, but the actual value used for serialization may differ, like in the above code: since reference ID "a" is used in two distinct contexts, it can't be used in our case so it gets replaced by a distinct reference in each context. However the following would work as expected: % concat $&v $&v {ref a}(1 2 3) {ref a}{} {ref a}{} {ref a}{} {ref a}{} {ref a}{} That's because v is passed as reference so the inner references are resolved in the same context. > End of questions for now. > > These questions are primarily for understanding, but also meant > to touch sore spots, in the hope that they turn out not to be > sore at all :-) > > I expect rather short and simple answers: either a yes/no, > or some stringreps or small snippets of sample code. I tried > to make the questions principially easily answerable. This is fine for me. It also helps me express my mind concerning references. I try to design them in the most intuitive, unsurprising and useful way, so any question that points out corny cases is welcome. |
From: Andy G. <unu...@ai...> - 2008-05-17 22:20:40
|
I take it from your discussion that grouping metacharacters control word boundaries whereas quoting metacharacters determine which other metacharacters are stripped of their special meaning. I didn't see these as separate because in Tcl they're not. Both braces and double quotes control both quoting and grouping. The two overlap mightily. Consider that a backslash (a quoting metacharacter) can be used to disable the "anti-grouping" behavior of whitespace, so by extension backslash can be used to control grouping. The only place in Tcl where you might say that the two are separate is for backslashes and braces embedded in a braced word. Within a braced word, braces preceded by an even number (usually zero) of backslashes count toward identifying which close brace terminates the word. That can be said to be grouping, since the close brace must be the last character of the word (modulo {*}, of course). Or you can just do like the Dodekalogue and not even mention grouping, since that concept is not essential for understanding brace matching. Since the embedded backslashes and braces have special meaning, they can be considered metacharacters, yet they are retained in the output word because the word is brace-quoted. This retention is not surprising at all, and (if you consider these to be grouping but not quoting metacharacters) it shows that Tcl "always" retains grouping metacharacters. (Of course, this is the only place where Tcl can be said to have dedicated grouping metacharacters, so "always" doesn't mean much.) It's not obvious that braces and backslashes embedded in braced words are metacharacters, and when teaching someone the language, it probably is not helpful to describe them as such. So it may be a surprise to learn that they are metacharacters, especially considering the fact that all other metacharacters are discarded. It is certainly a surprise to learn that they control grouping, a feature that the Dodekalogue doesn't even name. I was surprised just now to discover this, and I'm pretty sure this discovery won't cause me to start thinking in terms of grouping when I write Tcl code. (You can argue that I've been thinking this way all along, as have all Tcl programmers, but my point is that the word "grouping" wasn't part of my Tcl vocabulary.) An alternative way of thinking about grouping metacharacters (in your proposal, not embedded in braces) is as quoting metacharacters that disable the "anti-grouping" behavior of whitespace, like double quotes or the backslash I mentioned earlier. I don't know if this is overall an easier or harder way of thinking about it, but I think it has some merit. In this email I discuss variable access as well as grouping, since it seems to me that your grouping proposal is primarily targeted at making variable access work the way we want it to. Making code work the same inside braced words as at the top level is a secondary benefit of your proposal, not the original impetus. There are other ways to achieve that if it truly is a goal rather than a pleasant side effect of something else. On Wed, 07 May 2008 15:04:41 +0200, Frédéric Bonnet wrote > So it makes sense that characters that are implied in substitution or > quoting get replaced or suppressed, whereas other characters are kept. I agree that it makes sense in Tcl to say that non-substitution, non-quoting characters are retained, but only because substitution and quoting are the only such special concepts given in the Dodekalogue. When extending the language to explicitly have separate grouping rules (by explicit, I mean that you name them as such, rather than allowing the reader to derive the concept), one would expect the metacharacters associated with the new rules to be removed, like all other (obvious) metacharacters. The way I see it, we're basically discovering a new feature in Tcl and extending it to take place even outside of braced words. This is as surprising as the empty-string array. You can definitely make a strong case for it, but I don't think it's likely to be well received. I know I'm having a hard time with it, only gradually catching on to the notion that a precedent may exist. (Disclaimer: I have not had a chance to read any of the recent discussion. I am only now catching up. Maybe this was all accepted with open arms; I don't know yet.) > I think you're confusing quoting with grouping and substitution. If you meant to say simply "confusing quoting with grouping", then yes I am. Tcl confuses the two as well, but it gets away with it because in practice they're the same. (In Tcl.) > Quoting less than a word makes no sense, as it always involves the whole > word. This changes things. % set a (1 _ 3 _ 5 _); set b 5 % set a{$b} [set b 1]; puts $a 1 _ 3 _ 5 1 % set a (1 x 3 y 5 z); set b 5 % set a($b) [set b 1]; puts $a 1 _ 3 _ 5 1 Now both work the same, since $b is substituted by the parser in both cases. This is because the braces in a{$b} are not quoting metacharacters, only grouping metacharacters. (And in this example, it doesn't even matter that they are grouping metacharacters, just that they are not quoting metacharacters.) This brings some serious problems with it, though. One is a reduction in performance, since the parse tree describing the variable name a{$b} will merely say that it is the concatenation of "a{", the value of b, and "}". Therefore the variable lookup code must secondarily parse it every time, and there is no place to cache the extended parse tree. By extended parse tree, I mean an object that says "get the value of b and use it as the vectored index into the value of a". The other is a security problem. Since neither the braces nor the parentheses are quoting rules, the substituted results are concatenated into the output word without any special treatment. Information about how the word was generated is lost in the process. Here's code that exploits this: % set b {[exit]} % set a{$b} 0 The second [set]'s first argument will be "a{[exit]}", which [set] passes to the variable lookup mechanism. The variable lookup mechanism performs more substitution on it in order to figure out the right index. In the process, the program is terminated. In general, the security hole is double evaluation. Both of these problems (performance and security) are also experienced when not bracing the argument to [expr]. Here's another case, using parentheses instead of braces. % set b "first)(second" % set a($b) 0 [set]'s first argument will be "a(first)(second)", which will result in the variable lookup mechanism seeing two keyed indexes. The programmer may have intended to have one key with embedded backwards parentheses. Heck, the programmer might not have even supplied the index name; maybe it's somebody's nickname read in from IRC! Safe code: % set a{\$b} 0; set a(\$b) 0 % set {a{$b}} 0; set {a($b)} 0 If all variable names containing embedded substitutions must be quoted by the programmer in this manner, what is the benefit of your proposal? -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-05-17 16:57:41
|
On Tue, 29 Apr 2008 16:04:52 +0200, Frédéric Bonnet wrote > In the end, everything ends up being a list of characters. EIAS, yay! :^) > Richard Stallman once said that Tcl sucked because it lacked linked > lists. If I remember correctly, RMS went on to highlight Guile as a language that does have linked lists. This should tell you what he means when he says linked list. He must be talking about the ability to have an object refer to another object, and have dereferencing take place either easily or automatically. He's probably doesn't care whether or not the underlying implementation uses node structs containing pointers to node structs; he's concerned with whether or not the power to refer is exposed to the programmer. Tcl, like all languages, can be used to make objects refer to other objects, but unlike the languages favored by RMS, in Tcl it falls to the programmer to invent the mechanism for reference. > But they are themselves far from the ideal data structure from > the implementation's POV, because algorithmic complexity takes a hit > everywhere except insertion or removal. Again, RMS probably wasn't talking about implementation, the same way you weren't in the beginning of your email (the one to which I am replying). By the way, I snipped most out most of that high-level discussion because on those points I agree completely. I delete much of the later discussion as well, again because of my complete agreement. > That's why most sensible implementations use unrolling techniques. My implementation proposal is an unrolling technique, as are ropes containing multiple-character string nodes. I simply beefed up ropes by adding efficient* multibyte characters. But I didn't stop at characters; I made it so that the rope can be used to store lists of objects of any kind, without requiring all objects in the list be the same size. I didn't imbue the hierarchical structure of the rope with meaning; I just let it be a balancing mechanism. I did all this with skiplists, too. (* efficient in terms of memory storage and access time, hopefully at a minimum cost to locality of reference) > Fairly large strings can also be memory-mapped, so using such a model > effectively provides a means of handling very large datasets. This of course only works with an unrolled structure; otherwise you'd have to map only one character per node! :^) Instead you'll wind up with a iovec-type list, such as is used for scatter/gather I/O. (Or is that scatter/gather O/I or gather/scatter I/O? Heh.) > Functions or operators such as repetition could also be applied to given > nodes. Concatenation is the default function, correct? You're suggesting having other kinds of functions, which sounds reasonable. The difference here is arity; concatenation takes any number of rope inputs (two or more makes the most sense), but repetition takes one rope input plus a parameter giving the count. When it comes time for implementation, fields could be added to the rope node structure to give the function. Maybe concatenation could be implied and the additional function would be done on the result of the concatenation. I dunno. With skiplists, there is no obvious place to store these extra fields. > Now back to data structure unification. My idea was to somewhat > coerce the skiplist promotion algorithm into selecting the "right" > nodes, so that the final rope tree structure matches the logical > structure. With my subdivision proposal of long ago, this happens automatically without any real effort on our part. The entire script is a single rope. (If it is too long to fit in a single buffer, the rope is the concatenation of the buffers; otherwise, the rope is only one node.) Parsing it causes it to be subdivided into commands, which are further subdivided into words and syllables. Syllables can get split even more if they're command substitutions or variables names. > That way we could represent strings and lists with the > same set of data structures: a list would simply be some kind of > annotated string. The problem I have with this is that skiplists have virtually no room to store data for internal nodes, only leaf nodes. This is fine for ropes, because all the internal nodes are the same, and it's the traversal that is important. But when there are multiple possible functions (not just concatenation) that describe the semantic relationship between leaf nodes, skiplists fail. Or at least I don't see how they could be used. > But I feel that there's a great deal of similarities between a rope, a list > and a parse tree. I know that I am satisfied with the unification of ropes and lists that I proposed. I'm not sure it's necessary to reuse the same data structure for parse trees. I do think it would be beneficial if we could pull it off, assuming it doesn't add complexity to the core list structure that doesn't also benefit strings or lists. I think it could be done if the various substitution methods, etc., are recast as rope functions (combinators?), but I don't know where they would be stored when using a skiplist. Again, skiplists are nice because of the improved locality of reference, so I don't really want to go back to a classical linked tree. But a linked tree provides storage for internal node data. Hmm! Is it possible to interleave function nodes and data nodes, all stored at the leaf level? A function node would say something like "repeat the concatenation of the next two nodes five times". That sounds too complex to me, but you might have a better idea. > So the base level is a list of strings, whereas above levels serve two > purposes: indexing (as with a regular skiplist), and structure (thanks to > annotations). You keep talking about annotation. Please give a concrete example, preferably with a possible C struct definition to illustrate a possible implementation. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-05-17 16:03:45
|
On Thu, 24 Apr 2008 18:12:48 +0200, Frédéric Bonnet wrote > Tcl is a language for craftsmen, and I expect Cloverfield to > follow the same path (away from the toy, ad-hoc languages). Thank you for clarifying. This is the answer I was hoping for. :^) I just wanted to hear it from you. This new language is an outreach to qualified programmers who had trouble grokking or living with a couple surprising consequences of Tcl's dodekalogue. This new language is not an outreach to people who only know how to paste together examples found in cookbooks. The latter audience is too frequently targeted, I think... And that has put pressure on the "industry" to reduce its standards. :^( > I remember the discussion around {*}. Of course, now that it's part > of Tcl8.5, everybody starts to use it and thinks it's a great > addition, and no one would think in their right mind to go back to > quoting hacks with eval and list. I hated having to use those hacks. And since I originally learned (or thought I learned) Tcl from analyzing a poorly written codebase, I didn't know about how to use [list] to make [eval] safe. I also didn't comprehend the reasons why things sometimes broke due to space characters, etc. So I instead tried to make my code safe by having it explicitly disallow spaces and such. What a horrible approach that was... It's basically the same deal as antivirus software. Instead of fixing the exploitable holes, let's disallow programs known to take advantage of them. (I later relearned by carefully studying the endekalogue and reading a few books.) > But back in the days, a significant part of the community felt that it would > deface the language. This argument is still made. On 2007-07-18 19:46:32, about {*}, Tom Poindexter wrote "Wart is both a visual clue and a commentary of the Perlification of the Tcl syntax caused by the operator." > We miss the "it's broken so let's fix it" philosophy. More like, "Tcl isn't broken; it's your understanding of Tcl that is broken." > We need to be more pragmatic (but of course not too pragmatic else we'll end > up like PHP). Please explain what it means for PHP to be pragmatic. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-05-17 15:45:43
|
I am still a very busy boy, basically working multiple jobs in two different states. There's ordinary paywork in Texas, there's a house rental racket I do in Oklahoma, and now I also help work on a flight simulator in OK. So that's why I'm so ridiculously behind on answering emails! On Thu, 24 Apr 2008 16:57:33 +0200, Frédéric Bonnet wrote > Compared to the previous version, I've rewritten it a bit so as to > use a single getToken, versus one per rule previously. This > simplifies things a lot. I haven't read it in great detail, but it looks to me like your new code is somewhere halfway between my code and your old code, in terms of balancing between the extremes of one-huge-state-machine and one-proc-per-state. Is that correct? You seem to have concluded that having a smaller number of slightly more complex state machines is a simplification, that trivial state machines are more trouble than they're worth. > Moreover I've plugged in a C implementations using CriTcl when available. Interesting approach. Do you think this makes the code more readable or faster? Or are you primarily interested in limited experimentation with C and/or learning how to use Critcl to interface C with Tcl, or perhaps about language mixing in general? It wasn't clear to me whether the cprocs are the only implementation present in the code or if all the C stuff was also implemented in Tcl. > Regarding CriTcl, I'm using the TCC-based Odyce extension bundled > with the eTcl distrib. Have the sources for that been made available, or can it only be had in binary form? I totally agree that it's a good thing that it be self-contained, which is why I'm interested to know. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: David W. <dav...@gm...> - 2008-05-16 15:26:48
|
> Thank you! I'll try to do my best to come up with a (probably Tcl-based) > mock implementation for experimentation purpose. For those wishing to play around with something, I think that Hecl is in some ways a bit simpler and easier to hack than Tcl, and might be a fun sandbox for people. Hecl currently has references, as we wanted to be able to handle references to Java objects without using a hash table to look them up. -- David N. Welton http://www.welton.it/davidw/ http://www.dedasys.com/ |
From: Andreas L. <av...@lo...> - 2008-05-16 08:33:03
|
A few outsider-comments: > Tridekalogue on the Wiki here: While endeka and dodeka are greek words, trideka isn't. Greek 13 is dekatria. And were you to add more rules, it would become dekatessera, dekapente,... Now about the technical questions: First question: If I wanted to incrementally build up a list, where each element is a sublist, and each first sub-element were references to the same value, how would I do it? e.g.: set l (({ref a}1st 2nd)) lappend l ($&l{0 0} 3rd) ;# right? puts $l;# is it now: (({ref a}1st 2nd) ({ref a}{} 3rd)) ? And what, if I wanted to start from an empty list, and add these sublists in a loop? How would I get it started? How do I add a new element to be refer-able ? Second question: how would I obtain a thoroughly ref-free snapshot of $l ? (One, whose stringrep would then be: {{1st 2nd} {1st 3rd}}) Third question: If I wanted to write a procedure that works correct with such mutables, how would I do it? Let's say, I want a procedure that takes such a list, and is supposed to remove first and last element: proc strip {l} { #set l [lrange $l 1 end-1] ;# probably won't work, because # it doesn't mutate, but replace the value of local "l". # Maybe cloverfield offers a mutant variant of lrange, but # that would not be the point of the question. How do I # perform a mutation that replaces the value entirely? } Fourth question: % set v ({ref a}(1 2 3) {ref a}{} {ref a}{}) % llength $v ;# -> 3 % set v1 [lrange $v 1 end];# stringrep now: {ref a}(1 2 3) {ref a}{} ? In words: after stripping off the one element whose stringrep previously contained the value, will the new stringrep write out the value to the first then-still-available occurrance of the ref? % set v2 [lrange $v1 1 end];# "{ref a}(1 2 3)" or just "1 2 3" now? Fifth question: Flattening: % set f1 [list {*}$v];# stringrep=?? % set f2 [concat $v $v];# stringrep=?? End of questions for now. These questions are primarily for understanding, but also meant to touch sore spots, in the hope that they turn out not to be sore at all :-) I expect rather short and simple answers: either a yes/no, or some stringreps or small snippets of sample code. I tried to make the questions principially easily answerable. PS: I'm not on the cloverfield list, so, if an answer is not also mailed on tcl-core, please cc me directly. |
From: Neil M. <ne...@Cs...> - 2008-05-16 06:56:06
|
On 15 May 2008, at 22:26, Frédéric Bonnet wrote: > Neil Madden wrote: >> Hi, >> >> I was going to reply at length again, but I'd just be repeating the >> same points. My feeling is that a more fruitful discussion could be >> had with a concrete reference implementation to play with. I realise >> that may not be your immediate focus, so until then, I wish you the >> best with the effort. > > Thank you! I'll try to do my best to come up with a (probably Tcl- > based) > mock implementation for experimentation purpose. > > BTW, are you happy with the proposed syntax, besides agreeing or > not on > the whole concept of references? I.e.: > > $var : regular value substitution > $&var : strong reference (early bound) > $@var : weak reference (late bound on access), semantically > identical > to var name. > > After all, that was the point of the original message ;-) I have no particular problem with the proposed syntax. Whatever you choose, somebody will dislike it, but I don't think it's that important (it is only syntax). I managed to get used to {expand} (while it existed), despite finding it hideous at first -- because it was just so useful. People even manage to get used to Perl, legend has it ;^) Neil This message has been checked for viruses but the contents of an attachment may still contain software viruses, which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. |
From: Donal K. F. <don...@ma...> - 2008-05-16 03:18:34
|
Frédéric Bonnet wrote: > BTW, are you happy with the proposed syntax, besides agreeing or not on > the whole concept of references? I.e.: > > $var : regular value substitution > $&var : strong reference (early bound) > $@var : weak reference (late bound on access), semantically identical > to var name. > > After all, that was the point of the original message ;-) Since you ask, personally, no. :-) Neither type of reference has anything to do with reading the variable, and so should not use a $ character. After all, traditionally in Tcl variable names are not decorated, and $ is just syntactic sugar. But for your experimentation, you can ignore my prejudices; if I'm not contributing effort, I'm in the peanut gallery. This is quite apart from the whole question of references at all, which are a can of worms. I'm assuming you won't *really* believe me on that until you try it for yourself. :-) Donal. |
From: Frédéric B. <fre...@fr...> - 2008-05-15 21:26:33
|
Neil Madden wrote: > Hi, > > I was going to reply at length again, but I'd just be repeating the > same points. My feeling is that a more fruitful discussion could be > had with a concrete reference implementation to play with. I realise > that may not be your immediate focus, so until then, I wish you the > best with the effort. Thank you! I'll try to do my best to come up with a (probably Tcl-based) mock implementation for experimentation purpose. BTW, are you happy with the proposed syntax, besides agreeing or not on the whole concept of references? I.e.: $var : regular value substitution $&var : strong reference (early bound) $@var : weak reference (late bound on access), semantically identical to var name. After all, that was the point of the original message ;-) |
From: Neil M. <ne...@Cs...> - 2008-05-15 20:49:20
|
Hi, I was going to reply at length again, but I'd just be repeating the same points. My feeling is that a more fruitful discussion could be had with a concrete reference implementation to play with. I realise that may not be your immediate focus, so until then, I wish you the best with the effort. Neil This message has been checked for viruses but the contents of an attachment may still contain software viruses, which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. |
From: Frédéric B. <fb...@en...> - 2008-05-15 11:50:40
|
Hi Neil, Neil Madden wrote: > I think a reference implementation (pun intended!), or some semi-formal > description, would clarify this discussion immensely. Until then, some > not-so-brief comments are below. I've tried to trim down to just the key > points, so I've trimmed much of the rest of the discussion. OK, so I'm going to give a little more information about Cloverfield references here, but first, I suggest you read the Cloverfield Tridekalogue on the Wiki here: http://wiki.tcl.tk/20643 Especially the parts on references. Also read the section on references on the main Cloverfield page: http://wiki.tcl.tk/20638 (near the end of the page). The reason why I haven't given much details before is that the original discussion revolved on syntactic rather than semantic issues. But now I understand that a little background info on what is meant by "reference" is certainly useful to get the whole concept and its potential usage. So the present discussion is no longer about the proposed syntax enhancement, but about the semantics of references in Cloverfield. Both are orthogonal issues. I've changed the subject line accordingly. First, references as they are meant in Cloverfield are distinct from C pointers or Java references. The {ref <id>} syntax is closer in principle to XML ids, in the sense that it can be used to tag data within a structure, so that potentially circular relations can be expressed. And the distinction between $var and $&var has more to do with immutable vs mutable than with value vs reference. Also note that Cloverfield is a work in progress, so the syntax is not frozen. % set v ({ref a}(1 2 3) {ref a}{}) % puts $v {ref a}(1 2 3) {ref a}{} % puts $v{0} 1 2 3 % puts $v{1} 1 2 3 COW still applies to structures with references. Mutations don't propagate through simple values ($) but only through references, explicitly ($&) or implicity (mutable operations). % set v2 $v % puts $v2 {ref a}(1 2 3) {ref a}{} % set v3 $&v % puts $v2 {ref a}(1 2 3) {ref a}{} % set v{0 2} foo % puts $v{0} 1 2 foo % puts $v{1} 1 2 foo % puts $v {ref a}(1 2 foo) {ref a}{} % puts $v2 {ref a}(1 2 3) {ref a}{} % puts $v3 {ref a}(1 2 foo) {ref a}{} Moreover, the {ref <id>} tag is only output for values that are referenced several times in a given structure (=> self-referential or circular structures). So simple structures are serialized simply, even if references are internally used everywhere. % set v{1} bar % puts $v (1 2 3) bar % set v{1} $&v{0} % puts $v {ref a}(1 2 3) {ref a}{} Another important point is that the ID space is word-local, i.e. IDs are resolved in the current substitution context. So references with the same name don't point to the same value unless they are in the same context: # Two structures using the same ID but in different contexts. % set a {ref a}(1 2 3) % set b {ref a}(4 5 6) % puts $a 1 2 3 % puts $b 4 5 6 # One structure using the same ID twice in the same context. # (Using parentheses). % set c (({ref a}1 2) ({ref a}3 4)) % puts $c ({ref a}1 2) ({ref a}{} 4) % puts $c{0} 1 2 % puts $c{1} 1 4 % set c{0 0} foo % puts $c{0} foo 2 % puts $c{1} foo 4 % puts $c ({ref a}foo 2) ({ref a}{} 4) % One structure using the same ID twice but in distinct contexts. # (Using braces). % set d {{{ref a}1 2} {{ref a}3 4}} % puts $d {{ref a}1 2} {{ref a}3 4} % puts $d{0} {ref a}1 2 % puts $d{1} {ref a}3 4 % set d{0 0} foo % puts $d{0} foo 2 % puts $d{1} {ref a}3 4 % puts $d {foo 2} {{ref a}3 4} Finally, Cloverfield allows globally scoped IDs for complex cases (values shared by otherwise distinct structures). They are the only potential source of security problems. Also, global IDs are per interpreter so propagation is limited. So safe interpreters could for example manage separate pools of global IDs depending on the source, or let them resolve to totally distinct references (which would break the bindings but preserve the data). But I don't think it's a big problem with regular applications, as applications having security issues should use safe interps anyway. I don't know if I make myself clear, however it's quite clear in my mind ;-) >> A --> B >> | / | >> | / | >> V L V >> C --> D >> >> Obviously this cannot be represented by a tree, so pure Tcl lists are >> useless in this case. Now if we use nodes with references, we get the >> following : >> >> Node Children >> A B C >> B C D >> C D >> D {} > > My immediate thought is that graphs are best represented by relations, > rather than references, so for any serious graph work, I'd use a > relational database like SQLite, or a relational language like Prolog. > For a small example like this, I'd probably use my own digraph package > (http://wiki.tcl.tk/18129 - based on dicts), which results in a similar > solution to your first array solution: Keep in mind that this was a very simple example, you have to put it in perspective for a more generic usage. You can't mandate a RDB for every application that needs even mildly complex structures. The typical example I have in mind is DOM trees. >> array set graph { >> A {graph(B) graph(C)} >> B {graph(C) graph(D)} >> C {graph(D)} >> D {} >> } > > Why would you do this? As I mentioned before, references always need > some mapping (interpretation) which defines what they refer to. Trading > a reference which is defined relative to the graph itself (i.e. simple > keys) for one which is defined relative to the current variable > namespace, doesn't seem to gain much (and makes it harder to e.g. clone > the graph). We put ourselves in the general case where references must point to globally scoped variables. For example you may want to use several graphs and move some nodes between them. If you prefer I can express the above graph as a set of variables: set a {A {b c}} set b {B {c d}} set c {C {d}} set d {D {}} But this doesn't change the point. >> Now, if we want to manage the lifetime of each node, we have to maintain >> bookkeeping info somewhere. IOW, we have to write an ad-hoc garbage >> collector. > > Sure. If you use a new system of reference (e.g. a dict, or an array > etc), then you need to manage the lifetimes of the objects contained in > it. It's not clear in this example what particular strategy of memory > management is desired: do you want a node to disappear when no edge > leads to it? Or when the only reference to it is itself (via a circular > reference)? Or do you want edges to be removed when the node they target > is removed? It's not obvious to me that a general-purpose reference GC > does the right thing for the application, without knowing what the > application is. We're in the classical case where structures can be arbitrarily shared, and get GC'd when they are no longer reachable. That's the kind of application where Tcl doesn't do well when compared to competing solutions such as Java or Python. >> The idea beind Cloverfield's serialization format is that any reference >> can be recreated using any of its string representation. The technique >> consists in serializing both the referenced value and a suitable >> identifier. [...] > > Right, but this still leaves lots of questions. For instance, it is > quite easy to manufacture references in this system: > > set a {ref A}(1 2 3) ;# $a is now a reference with id "A" and val (1 2 3) > set b {ref A}(a b c) > > Now what? Does setting b override the reference in $a? What is used to > determine which of these conflicting reference definitions is current? [...] See my previous explanation. Also, reference IDs are designed to be manufactured easily, but their scope is limited to the context in which they are declared, so they are pretty useless for potential attacks. > (Being able to > pass mutable references where only values were previously allowed in > general would need a careful security audit). With the proposed syntax, one cannot pass a mutable value other than explicity. For example (let's suppose that incr works on values or references and not var names): % set v ({ref a}1 {ref a}{}) % set v{1} 1 % incr $v{0} 2 % set v{1} 1 % incr $&v{0} 2 % set v{1} 2 Other than with global IDs, mutable ops behind your back is impossible. > Another scenario: > set a {ref A}(1 2 3) > set b $&a > > As I understand it, both $a and $b now refer to the reference ("A") that > points at the value (1 2 3). Now, let's say that for some reason that > reference "A" loses its internal rep (say somebody performs an operation > with $a that converts it to some other type). First, {ref A} only tags data within a. So in the first line, {ref A} is essentially a no-op. The second line, though, binds variable b to a's current value. If the value of variable a changes then b is unaffected. However if the content of a's value is altered through a mutable operation, then this change affects the content of b as well. % set a (1 2 3) % set b $&a % set a{0} foo ; # Mutable op on a's value, also ref'd by b. % set b foo 2 3 % set a (4 5 6) ; # Variable now refers to another value. % set b foo 2 3 > Furthermore, lets say that > by some further operation causes a copy-on-write so that $a and $b now > are distinct Tcl_Objs, but that they still have an identical string rep. > IIRC, the following might cause that situation: > > append a " " > set a [string range $a 0 end-1] > > So now we have two distinct Tcl_Objs $a and $b that used to be the same > reference but now aren't and neither has a reference internal rep. Now, > say $a is stashed away in some data structure and forgotten about. > Meanwhile, $b is treated as a reference again and re-establishes the > reference, recreating it if necessary from the string rep. Now the user > happily updates $b, mutating the reference, to say (a b c). A little > later, $b is unset and the reference disappears again. Now, $a is > brought out of hiding, and brushed down, and then an attempt is made to > read the reference: what does it return? Presumably, as the reference > has now gone, it is recreated again from the string rep of $a, reverting > to (1 2 3) -- all the updates to $b have been lost, even though these > references were supposed to be the same. Since the internal link is broken when a's value is set again, this won't happen hopefully. % set a foo % set b $&a % append a bar ; # Mutable op on a's value, also ref'd by b. % set b foobar % set a [string range $a 0 end-1] ; # a now refers to another value. fooba % set b foobar I hope this answers all your concerns. >> To handle circular references, a value is only serialized >> when it is first encountered in the current serialization process. In >> the above example, A's children is (B C D), so when serializing the >> whole graph, A is first serialized, then its first child B, then B's >> children and so on, so that when nodes need to be serialized again >> later, we only need to output the ref id, and the whole value can be >> omitted. > > But then, presumably if this entire structure gets shimmered (e.g. by an > append), all of these embedded references lose their internal reps > (possibly causing the actual reference to be GC'd). Now if we extract > one of these later nodes that only have the reference ID then we no > longer have enough information to recreate that reference, as we have > lost both the original reference and any string rep that could be used > to recreate it. The integrity of each individual reference is only > preserved as long as the integrity of the entire structure is preserved. No, references can be recreated because both their ID and value are serialized. Oh, you mean extracting the value-less nodes from their string rep? Sure the value would be lost, but I don't think this makes much sense in a real-world scenario, as it involves slicing the string rep manually without using access commands. That would be the same as using [string range] to extract an element from a list's string rep (and it wouldn't work anyway because of quoting rules). > My point is that real "strong" references are entirely incompatible with > EIAS. Hmmm. I'm not sure what you mean by that. More precisely, what does EIAS means exactly? To me, it's the fact that everything can be expressed as strings, and that internal structures can be recreated from their string representation. The Cloverfield reference serialization format meets these criteria IMHO. >> % set jim (1 2 3) >> % set d [dict create foo 12 bar $&jim baz $&jim] >> foo 12 bar {ref <id>}(1 2 3) baz {ref <id>}{} >> % set jim (a b c) >> % set d >> foo 12 bar {ref <id>}(a b c) baz {ref <id>}{} >> % set d(baz) >> a b c > > Why does the item denoted by $d(baz) have a different string rep when > serialised inside the dictionary to when it is used on its own? I.e., > why isn't the output of that last command: > > % set d(baz) > {ref <id>}{} String reps are generated on demand and depending on the context. So d(baz) refers to the list (a b c). The fact that this value is used elsewhere is out of context. So the string rep is simply that of the refered value. > Presumably, [set] is looking at the value at $d(baz) and seeing (by > inspection of the internal rep, or by string matching if the rep has > been lost) that it looks like a reference and then doing another > de-reference to get the actual value? Whereas the $&foo syntax just > grabs the reference itself and returns it's string rep? Except that in > this case 'jim' doesn't contain a reference to begin with, but is a > normal variable, so $&jim creates the reference, presumably. Then, do > subsequent $&jim's guarantee to return the same reference, or a fresh > one? Or are variables themselves implemented as references? Variables are just entry points to references. IOW, variables are names in the current calling context that hold references to values. So $&var returns a reference to var's value at the time of substitution. $var returns the value itself. $@var would return a weak reference to var (semantically equivalent to its variable name), meaning that it would dereference var at each access. If var's value changes, then $@var will reflect this change, contrary to $&var which will continue to point to the previous value. % set var foo % set v1 $&var % set v2 $v1 % set v3 $@var % set var bar % set v1 foo % set v2 foo % set v3 bar % set v1 baz % set v2 foo > Again, I worry about the security problems this introduces, and more > generally the difficulties it introduces. For example, consider some > code like the following: > > proc execute-query {query} { > lappend query -user foo -password sekret > db eval $query > } > set i [interp create -safe] > $i alias query execute-query > $i eval $untrustedcode > > In current Tcl that is safe: the username and password remain in trusted > code. With mutable references there is now a brand new line of attack > where the caller simply passes a reference instead of a value and can > now read back the user name and password once the query completes. > Mutable references should not be transparently interchangeable with > values. The security aspects are just the most extreme example of the > problems this can cause. Much code is written now with the assumption > that changes to local variables aren't visible outside of the proc > unless a specific broadening of scope is performed (via upvar, variable, > global etc). > > (Incidentally, while testing this I discovered that a safe slave can > introspect the aliases defined in it, including retrieving the target > command that would be executed in the master interpreter, which seems > like a security flaw of its own). I see your point. In the code above, execute-query should avoid modifying the query variable, as it can hold a reference. Hopefully with the proposed changes, lappend would become [list append] or something and work in a purely functional way: proc execute-query {query} { db eval [list append $query -user foo -password sekret] } But this would require a careful audit of aliased procs (which is needed anyway). A more general solution would be to disallow reference passing from safe interpreters and downgrade them to plain values, which would do the trick. Anyway, I think my proposal won't break the existing pass-by-value semantics, and rest assured that I'll work hard to ensure that the implementation lives up to this promise. |
From: Alexandre F. <ale...@gm...> - 2008-05-14 20:54:59
|
Hi Neil, On 5/14/08, Neil Madden <ne...@cs...> wrote: > > Mutable references should not be transparently > interchangeable with values. If you permit the tangent, I can only violently agree with this, as a lesson from my stillborn Mutability patch (try taming iterators with that...). -Alex |
From: Neil M. <ne...@Cs...> - 2008-05-14 20:28:33
|
Hi Gustaf, On 14 May 2008, at 19:32, Gustaf Neumann wrote: > Neil Madden schrieb: >> What would perhaps be easier to remember would be if variables >> were constant mappings and there was a separate reference type -- >> then you could use $foo syntax everywhere. You can do this right >> now in Tcl (modulo garbage collection): >> > Just 2 cents from the xotcl front, concerning variable references. > > XOTcl provides so-called filters, which are functions called around > every invocation of every method of an object. These filters can be > registered dynamicaly, they intercept method calls and can call the > shadowed methods via "next". This way, one can implement in a few > lines e.g. a trace program, which simply intercepts all method calls > and prints e.g. the call and exit of the functions. > > For this situation, relative refernces (the variable one level higher > in the callstack) are a problem. If XOTcl would not care about this, > programs performing an upvar would be broken, as soon a filter is > registered, since the upvar will point to the filter and not to the > same method as in the case without the filter. For this reason, XOTcl > implements its own upvar (and uplevel) methods to achieve transparency > for filters (and mixins as well). This works in general quite well, > but people have to write e.g. "my upvar ..." instead of plain "upvar" > to get this transparency. The ref scheme I outlined above could also work, at the cost of storing all state in fully-qualified global/namespace variables. > > When implementing these methods, i considered as well > an alternate approach: > > - The main problem for this kind of applications > are relative upvars (similar holds for uplevels). > > - Therefore, an option would be to provide an operator > (say &) that returns the 'absolute reference' > of a variable. > > - This 'absolute reference' is a symbolic name for the hash table > where the variable exists (or should be created) and could either > denote a variable on the callstack (say "#12 xxxx") or a > namespaced variable (say "::namespace::yyyyy"). This distinction > has to be determined by the new operator. > > - In case, the variable does not exist, one could provide > default assumptions (e.g. if called from a proc, it could > default to a variable on the callstack). > > - The dereferencing of the 'absolute reference' could be > done either in C for commands like lappend, or it could > be done via an addition Tcl command "alias" > > proc foo {varname} { > alias $varname x > set x 1 > } > > proc bar {} { > set y 0 > foo [& y] > } > > Such absolute references could be easily passed around > and are less harmful than the relative upvars, used > predominantly in Tcl today. Something like (arrays left as an exercise): proc & varName { if {$varName in [uplevel 1 [list info locals]]} { # Local var return [list #[expr {[info level]-1}] $varName] } else { return [list #0 [uplevel 1 [list namespace which -variable $varName]]] } } proc myincr {ref {amount 1}} { upvar {*}$ref var incr var $amount } proc foo ref { myincr $ref } proc bar a { foo [& a]; puts "a = $a" } bar 12 ;# a = 13 set b 10 foo [& b] ;# => 11 -- Neil This message has been checked for viruses but the contents of an attachment may still contain software viruses, which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. |
From: Neil M. <ne...@Cs...> - 2008-05-14 20:18:31
|
Hi Frédéric, I think a reference implementation (pun intended!), or some semi- formal description, would clarify this discussion immensely. Until then, some not-so-brief comments are below. I've tried to trim down to just the key points, so I've trimmed much of the rest of the discussion. On 14 May 2008, at 17:20, Frédéric Bonnet wrote: [...] > (I'm going to digress a bit from the original topic) > > For example, let's consider the following asymetric directed graph: > > A --> B > | / | > | / | > V L V > C --> D > > Obviously this cannot be represented by a tree, so pure Tcl lists are > useless in this case. Now if we use nodes with references, we get the > following : > > Node Children > A B C > B C D > C D > D {} My immediate thought is that graphs are best represented by relations, rather than references, so for any serious graph work, I'd use a relational database like SQLite, or a relational language like Prolog. For a small example like this, I'd probably use my own digraph package (http://wiki.tcl.tk/18129 - based on dicts), which results in a similar solution to your first array solution: > > We may store each node in an array, using node names as keys: > > array set graph { > A {B C} > B {C D} > C {D} > D {} > } > > But here each child is represented by its key, not its variable name, > which is implicitly the array. So in a real world scenario, we get: > > array set graph { > A {graph(B) graph(C)} > B {graph(C) graph(D)} > C {graph(D)} > D {} > } Why would you do this? As I mentioned before, references always need some mapping (interpretation) which defines what they refer to. Trading a reference which is defined relative to the graph itself (i.e. simple keys) for one which is defined relative to the current variable namespace, doesn't seem to gain much (and makes it harder to e.g. clone the graph). > Now, if we want to manage the lifetime of each node, we have to > maintain > bookkeeping info somewhere. IOW, we have to write an ad-hoc garbage > collector. Sure. If you use a new system of reference (e.g. a dict, or an array etc), then you need to manage the lifetimes of the objects contained in it. It's not clear in this example what particular strategy of memory management is desired: do you want a node to disappear when no edge leads to it? Or when the only reference to it is itself (via a circular reference)? Or do you want edges to be removed when the node they target is removed? It's not obvious to me that a general-purpose reference GC does the right thing for the application, without knowing what the application is. [... snip graph representation ...] > This looks slightly cryptic, but note that this is the serialized > value. > In practice one hardly uses this notation but rather builds structures > programmatically: > > # Create each node with an initially empty child node > set a (A ()) > set b (B ()) > set c (C ()) > set d (D ()) > > # Now add all children (element 1 of each node) > set a{1} ($&b $&c) > set b{1} ($&c $&d) > set c{1} ($&d) > > # Put all node in one single variable > set graph ($&a $&b $&c $&d) > > The idea beind Cloverfield's serialization format is that any > reference > can be recreated using any of its string representation. The technique > consists in serializing both the referenced value and a suitable > identifier. [...] Right, but this still leaves lots of questions. For instance, it is quite easy to manufacture references in this system: set a {ref A}(1 2 3) ;# $a is now a reference with id "A" and val (1 2 3) set b {ref A}(a b c) Now what? Does setting b override the reference in $a? What is used to determine which of these conflicting reference definitions is current? Is it the time at which they are converted to reference type? (And when does that happen?) There is also possible problems of accidental (or deliberate) name clashes, and associated security issues (e.g. if an attacker can manufacture any reference as a string, then he can have a field day feeding you values that contain manufactured references, either to capture information or to clobber it). You could mitigate these problems to some extent by making the $&var syntax return a reference whose ID is some cryptographically unforgeable string, but it seems you couldn't get rid of the situation entirely. (Being able to pass mutable references where only values were previously allowed in general would need a careful security audit). Another scenario: set a {ref A}(1 2 3) set b $&a As I understand it, both $a and $b now refer to the reference ("A") that points at the value (1 2 3). Now, let's say that for some reason that reference "A" loses its internal rep (say somebody performs an operation with $a that converts it to some other type). Furthermore, lets say that by some further operation causes a copy-on-write so that $a and $b now are distinct Tcl_Objs, but that they still have an identical string rep. IIRC, the following might cause that situation: append a " " set a [string range $a 0 end-1] So now we have two distinct Tcl_Objs $a and $b that used to be the same reference but now aren't and neither has a reference internal rep. Now, say $a is stashed away in some data structure and forgotten about. Meanwhile, $b is treated as a reference again and re- establishes the reference, recreating it if necessary from the string rep. Now the user happily updates $b, mutating the reference, to say (a b c). A little later, $b is unset and the reference disappears again. Now, $a is brought out of hiding, and brushed down, and then an attempt is made to read the reference: what does it return? Presumably, as the reference has now gone, it is recreated again from the string rep of $a, reverting to (1 2 3) -- all the updates to $b have been lost, even though these references were supposed to be the same. On the flip side of that coin, imagine that $b isn't destroyed before $a is retrieved. Here the connection would be reestablished and the updates saved. But this itself might be confusing. If the programmer decides to print $a before retrieving the reference value he will see that it apparently has the value (1 2 3) still, but then when the reference is resolved it will get the updates written to $b (presumably) and thus actually contain the value (a b c). Now it may be that you intend that the [puts $a] would already do the conversion and retrieval of the appropriate reference. However, I'm not sure how this would work -- does [puts] (and every other command) have to scan its argument to see if it looks like a reference before using it? This is the problem with strings as references, they're just too easy to bugger about with. You could try something like Feather's opaque internal reps (which can't be shimmered), but this then definitely moves you outside of EIAS. > To handle circular references, a value is only serialized > when it is first encountered in the current serialization process. In > the above example, A's children is (B C D), so when serializing the > whole graph, A is first serialized, then its first child B, then B's > children and so on, so that when nodes need to be serialized again > later, we only need to output the ref id, and the whole value can be > omitted. But then, presumably if this entire structure gets shimmered (e.g. by an append), all of these embedded references lose their internal reps (possibly causing the actual reference to be GC'd). Now if we extract one of these later nodes that only have the reference ID then we no longer have enough information to recreate that reference, as we have lost both the original reference and any string rep that could be used to recreate it. The integrity of each individual reference is only preserved as long as the integrity of the entire structure is preserved. [...] > >> I don't really understand what benefits these references have over >> just using unique globally-scoped variable names. > > Lifetime management, mutable vs. immutable, copy-on write come to > mind. Lifetime management I get, but surely references are both mutable and not copy-on-write (which would make them immutable), just like variables. > The challenge is to create a system that preserves the existing Tcl > philosophy while allowing strong references. The lack of such > references > is the reason why DOM bindings such as TclDOM or tdom require special > handling of the DOM's childNodes method. For example, tdom's > childNodes > require a plain immutable Tcl list, whereas childNodesLive returns a > "live", opaque, mutable list object. In other languages the structures > are simply mutable and the DOM semantics is preserved. My point is that real "strong" references are entirely incompatible with EIAS. >> [...] > >> set d [dict create foo 12 bar $&jim] >> puts $sock $d >> ... on other end ... >> set d [read $sock] >> set $&d(bar) ... ;# is that the right syntax? >> >> Does that work? I assume for that to work that the string rep of a >> reference must identify itself as a reference (e.g. the "{ref foo} >> bar" syntax), but then how do I get just the current value from a >> reference if $myref returns the reference? > > $myref works as expected, whether it holds a reference or not. > > For example: > > % set jim (1 2 3) > % set d [dict create foo 12 bar $&jim baz $&jim] > foo 12 bar {ref <id>}(1 2 3) baz {ref <id>}{} > % set jim (a b c) > % set d > foo 12 bar {ref <id>}(a b c) baz {ref <id>}{} > % set d(baz) > a b c Why does the item denoted by $d(baz) have a different string rep when serialised inside the dictionary to when it is used on its own? I.e., why isn't the output of that last command: % set d(baz) {ref <id>}{} or (more sophisticated): % set d(baz) {ref <id>}(a b c) ? Presumably, [set] is looking at the value at $d(baz) and seeing (by inspection of the internal rep, or by string matching if the rep has been lost) that it looks like a reference and then doing another de- reference to get the actual value? Whereas the $&foo syntax just grabs the reference itself and returns it's string rep? Except that in this case 'jim' doesn't contain a reference to begin with, but is a normal variable, so $&jim creates the reference, presumably. Then, do subsequent $&jim's guarantee to return the same reference, or a fresh one? Or are variables themselves implemented as references? > [...] > > With the new system, commands will have to decide whether they expect > variable names, or values. Commands expecting variable names would > come > in small number, as they would be limited to declarative commands such > as global or variable, or introspection. They are the kind that would > benefit from proc enhancements, such as auto-upvar'ing with & prefix, > but wouldn't from references. So the vast majority of commands would > expect values and not variable names. Of these commands, some would > work > in a purely functional way with immutable values, others could accept > both values and references. At the script level everything would be > handled transparently: modify an argument variable, and the change > would > propagate downwards if the passed argument was a reference. Weak > references to non-existing variables can be detected by checking > for the > existence of the argument variable. Again, I worry about the security problems this introduces, and more generally the difficulties it introduces. For example, consider some code like the following: proc execute-query {query} { lappend query -user foo -password sekret db eval $query } set i [interp create -safe] $i alias query execute-query $i eval $untrustedcode In current Tcl that is safe: the username and password remain in trusted code. With mutable references there is now a brand new line of attack where the caller simply passes a reference instead of a value and can now read back the user name and password once the query completes. Mutable references should not be transparently interchangeable with values. The security aspects are just the most extreme example of the problems this can cause. Much code is written now with the assumption that changes to local variables aren't visible outside of the proc unless a specific broadening of scope is performed (via upvar, variable, global etc). (Incidentally, while testing this I discovered that a safe slave can introspect the aliases defined in it, including retrieving the target command that would be executed in the master interpreter, which seems like a security flaw of its own). -- Neil This message has been checked for viruses but the contents of an attachment may still contain software viruses, which could damage your computer system: you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation. |