tcl9-cloverfield Mailing List for Tcl9 (Page 6)
Status: Alpha
Brought to you by:
fbonnet
You can subscribe to this list here.
2008 |
Jan
|
Feb
(3) |
Mar
(9) |
Apr
(22) |
May
(64) |
Jun
(13) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 |
Jan
(12) |
Feb
(1) |
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
(2) |
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
(1) |
Mar
(1) |
Apr
(1) |
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(5) |
Dec
(1) |
From: Andy G. <unu...@ai...> - 2008-04-05 16:40:49
|
On Sat, 5 Apr 2008 11:42:35 +0200, Alexandre Ferrieux wrote > On 4/5/08, Andy Goth <unu...@ai...> wrote: > > [...] but still, many people can't handle > > precise language in any form. So it doesn't matter if we accurately > > document Cloverfield; it still won't be used effectively > > If you allow a distant observer to chime in, Please, by all means! This is a public mailing list for a reason. A REASON IS YOU! Sorry, I had a little Engrish moment there. ;^) > I find it a bit dangerous to target explicitly "people who can't handle > precise language in any form". I agree. I wrote my email partly to bring about this very discussion we're having right now, in the hope that we will moderate our goals some. I also wanted to record the salient points of my conversation with my friend, so that we can come back to them later and see if we addressed any of them in the way we designed our language and (more importantly!) wrote our documentation. > My personal take on this is that this category of people shouldn't be > programming anyway. Yeah, and seeming numerical majority doesn't automatically mean that Tcl/Cloverfield is wrong for excluding them. It could also mean that other languages are wrong for including them, that they (the languages) were driven by marketing concerns instead of a desire to enhance the state of computing. This floods the field with people who don't belong, which of course severely degrades the state of computing. Or maybe I'm wrong; maybe I just have Ivory Tower syndrome. Maybe all of Tcl does. And maybe we have earned it. ;^) However, proper education can help some of these people to become qualified. Cloverfield might be able to help with that. A novice programmer can go far if given good introductions, good tutorials, good reference documentation, a genuinely simple yet powerful language, and a wicked-awesome GUI toolkit that needs no "visual" toolchain. The hard part is reforming people who think themselves to be experts but in fact are little more than cargo cultists. > And even if they insist, there are already plenty of half-witted languages > filling that niche (VB, PHP, you name them). I heartily agree that VB is half-witted, but I don't know about PHP. Seriously, I don't know about it; I haven't used it in so long. Please tell me a little bit about why you think it is so. Sure, I agree that many of the people who use it have the same problems we've been discussing, but I want to know what it is about the language that is inviting to them. > By contrast, but tell me if I'm wrong, my perception of Cloverfield > was that of (1) a syntax "tabula rasa" to free (y)ourselves from > Tcl's historical, too-late-to-be-fixed blemishes, I guess that's a right way to put it. We're changing the behavior of a few things to be less surprising. We're also adding inline comments, whose nonexistence is the surprise. We're removing the need for the [list] command, because the idiom [list a $b c] is too much work for something so fundamental, and because people are likely to be lazy and say "a $b c", even though that is incorrect; instead do (a $b c)! We have variable indexing that I hope removes the need for arrays, which in my opinion are a horrible hack. ;^) And there's miscellaneous other stuff like null and references and delays and metadata that I haven't quite found uses for just yet, but I have confidence that I will when I turn my attention to them. There's also a third goal, by the way, but it doesn't need to be discussed in this thread. I'll just state it anyway. Fred and I want to experiment with language design and implementation! For instance, see our discussion of ropes, on which I have much to say in an upcoming email. > (2) an ambition to target the very same minds Tcl appeals to. We agree that Tcl is a simple language that is simple to program in. But from talking to outsiders--- people who are mired in the mindsets of Python, Java, C++, and even Ada--- I get the impression that using Tcl is simple for only those who genuinely know Tcl. Learning Tcl is surprisingly difficult, until you develop the mental discipline required for keeping it separate in your mind from other languages with deceptively similar syntax. In particular I'm thinking about {matched braces} and #comments. Everything is surprising until you learn that Tcl truly is simple and doesn't attempt to think for you; after that point, it's not surprising at all. The next time you will be surprised is when you encounter a command that does try to think for you, for example tcltest; see Tcl bug #1528500. I'm told that tcltest has a DWIMmy attitude in general. Tcl's not DWIMmy at all, at least it won't be until TIP #131 is finished. > In short, I believe it is not possible (nor desirable) to swallow the > whole spectrum of "programmers at large" in one language. Cloverfield is a bit of an outreach, with the changes I listed above that are designed to be less surprising. But it ain't BASIC. > Do I misinterpret your message ? If I do, which end of the spectrum is > your primary target ? No, you interpret my message correctly, and you wrote the response I was hoping someone would write. You wrote it in good time too. ;^) I want for this thread to result in a better definition of our target audience. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Alexandre F. <ale...@gm...> - 2008-04-05 09:42:29
|
On 4/5/08, Andy Goth <unu...@ai...> wrote: > [...] but still, many people can't handle > precise language in any form. So it doesn't matter if we accurately > document Cloverfield; it still won't be used effectively If you allow a distant observer to chime in, I find it a bit dangerous to target explicitly "people who can't handle precise language in any form". My personal take on this is that this category of people shouldn't be programming anyway. And even if they insist, there are already plenty of half-witted languages filling that niche (VB, PHP, you name them). By contrast, but tell me if I'm wrong, my perception of Cloverfield was that of (1) a syntax "tabula rasa" to free (y)ourselves from Tcl's historical, too-late-to-be-fixed blemishes, and (2) an ambition to target the very same minds Tcl appeals to. In short, I believe it is not possible (nor desirable) to swallow the whole spectrum of "programmers at large" in one language. Do I misinterpret your message ? If I do, which end of the spectrum is your primary target ? -Alex |
From: Andy G. <unu...@ai...> - 2008-04-05 01:19:26
|
On Fri, 4 Apr 2008 18:56:24 -0600, Andy Goth wrote > I think the best thing about an [expr] shorthand is that it would encourage > programmers to avoid the unsafe usafe of [expr]--- instead they will use the > shorthand, which the interpreter already knows not to substitute. Make that "the unsafe use"; heh. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-04-05 00:56:36
|
One goal of Cloverfield is to make it less surprising for people who haven't taken every word of its Tridekalogue to heart. That would be a majority of its programmers, unfortunately. I had a long conversation about Tcl with a friend who tells me that he finds the Tcl Dodekalogue to be as impenetrable and obtuse as the legalese he sometimes encounters at work. I think you'll agree with me that it's very clear, but still, many people can't handle precise language in any form. So it doesn't matter if we accurately document Cloverfield; it still won't be used effectively, so long as it is too different from other languages, languages which were learned by reading examples and inferring the syntax rules. This is a sad state of affairs, to be sure, but it is what I perceive the reality of programming to be. Truly logical thinking is rare, and superstition rules in its place. And so, even though Tcl has a rule [6], people are still surprised that they can type a{b} but they can't type {a}b . Either works as expected when surrounded in braces or double quotes or preceded by a backslash, but one who is surprised by this behavior has already stopped being rational. :^( This friend said that he suggests requiring that all literals be surrounded in quotes. I asked, does that mean I should write code like the following? "set" "var" "1" He said, no, of course not, set is a keyword and not a literal, var is a variable name and not a literal, and 1 is a number and not a string. You know there is nothing in the Dodekalogue to suggest any of this, but he read it in anyway because of his experience with other languages. He found $a$b to be a bizarre way to concatenate two strings, saying that concatenation should always be explicit, like in PHP which has a dot (.) operator for that purpose. I pointed out that PHP also supports surrounding that same expression in quotes ("$a$b") to get the same result. I also pointed out that all strings are the concatenation of single-character substrings. Tcl, of course, has no operators, so $a$b is the only available syntax, or maybe a command [string concat $a $b] , which would be clumsy but occasionally useful. He said that while Tcl may claim to be simple, it is in fact very complex to use in practice because of its simplemindedness. He expects a language that is smart enough to understand, for example, that the third argument to proc is a script, that scripts have comments in them, and that braces inside comments should not affect the brace count. We address this last issue, but it applies everywhere, not just in places that will eventually be interpreted as a script. This can be a problem in some cases, such as one I will document in a later email. (For now, use your imagination.) But one who has actually taken the Tridekalogue to heart knows how to handle each case without trouble--- just use proper quoting to be sure you correctly communicate your intentions to the interpreter. My friend and I also talked about C and a few other language and OS topics (C++ exceptions, garbage collection, Linux, and the Unununium OS to which I am not related). Regarding C, he said he really likes it, and he thinks he fell in love with it because it's the first language he really learned. Before C he used Pascal and assembly, but he didn't really learn them; he just combine parts of examples until he got what he wanted. But C he learned in about two days' time by reading code I had written for him. The only thing that gave him pause was pointers. He said he read many long explanations and tutorials on pointers, and none of it made sense to him. He didn't understand pointers until he asked me, and my response was "a pointer is a memory address". And then he understood perfectly! That's the part that was apparently missing from all the explanations and tutorials he'd read. He said that every time he tries to program in Tcl he gets stuck spending hours reading the documentation and getting frustrated that he can't make enough sense of it to pull it together into a program. So instead he does his programming in C or bash, maybe PHP if that's what he's being paid to do. I think it's curious that he prefers bash to Tcl, since bash is a much less regular language that is totally riddled with gotchas and security problems. I pointed out this one: a=" 1 2 3 4 " echo $a This will print out "1 2 3 4", losing all the spaces. Yet the spaces are correctly stored in the variable. What gives? When I asked him how to fix it, he said to replace the double quotes with single quotes, which of course does nothing in this case. So I explained that after doing a substitution, bash automatically expands it to multiple words. That's the default behavior, which has to be explicitly turned off by putting $a in double quotes. When I pointed this out, he said he didn't really care about corner cases like these. When the very essence of security is relegated to the status of corner case, we have a problem. And yet this same guy is very serious about SQL injection attacks; his makes sure all data received from the user is adequately quoted. I'm not sure he has yet grasped the reason why SQLite (when properly used) is safe from injection attacks, the reason why expr (when improperly used) is vulnerable, and the reason why the safe way is also the most efficient way. Oh, that brings me to another point. He thinks it's very important to have math directly in the language. (Of course, the truth is that he just wants C.) This would make Cloverfield a different type of language, one with syntax and operators and precedence and a BNF. I showed him that the following works: [+ 5 [- 6 1]] And he was satisfied. But for myself, I think I might like an [expr] shorthand syntax, if we can work it in without bloat or other trouble. I think the best thing about an [expr] shorthand is that it would encourage programmers to avoid the unsafe usafe of [expr]--- instead they will use the shorthand, which the interpreter already knows not to substitute. >From all this I take it that maybe a concise Tridekalogue is not the best way to document our language. It's a nice selling point to say that the language is so simple that it is completely described by thirteen rules. But to reach out to people like my friend, we need to include an introduction with the distribution, and the Tridekalogue must either be annotated (or have an annotated version available) with examples that one could learn from. I'd rather we be the ones to write the examples that define the language in the minds of people who can't handle precise language. If we don't, others will, and most introductory programming books and curricula I have encountered are wrong on many points, exacerbating the problem we face. I'm still working on the reply to your other email because I've been busy and I have been thinking very hard about your quoting proposal. I also have taken a shot at unifying strings and lists, and I think you'll be pleased with the results. Stay tuned! -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Frédéric B. <fb...@en...> - 2008-03-26 15:22:17
|
Andy Goth wrote: > On Wed, 19 Mar 2008 18:27:52 +0100, Frédéric Bonnet wrote >> Andy Goth wrote: >> > That sounds pretty interesting, actually! How does this square with your >> > vision of skiplists in action? >> >> That's exactly how it's supposed to work. The lowest level would be a list >> of "atoms", i.e. literals (pointing to the original source), and >> backslashed data. E.g. "foob\x61r" is {LITERAL foob} {ESCAPE a} {LITERAL >> r}, or {LITERAL 0 4} {ESCAPE a} {LITERAL 8 1} when using index and length. > > For performance reasons, I would have merged the whole thing into a single > syllable. [...] > While on the subject of performance, I have some thoughts about making ropes > faster for scripts, which have lots of single-character segments. A syllable is of a conceptually higher level than a string. Syllables need not be represented as individual strings but rather point to a portion of a string; and several syllables can point to the same string. So fragmentation doesn't occur on the string level but on the syllable level (and rightly so). So in the above example the two outer syllables point to the original string, whereas the center one points to a dynamically alloc'd node. In most cases, nearly all syllables will point to the same unmodified source script, which would be a single contiguous string and not a rope made of many small strings. >> But syllables, atomic or not, could also be dynamic (e.g. functions). > > Functions? Could you give an example of where this will help? Are you > thinking of references and delayed expansion? Think of functions as simple string operations. The simplest and most natural function is concatenation: a node made of several subnodes applies an implied concatenation on its children. This is the raison d'être of ropes. But other functions may be used, such as repetition. For example, [string repeat a 1000] currently allocates a new 1000-char buffer and copies the string "a" 1000 times. With a "repeat" function, [string repeat a 1000] would simply return a node with a "repeat 1000" function and the source string "a". Using such a rope is not a problem given the right data structure, as long as you know the length of each node. For example, splitting [string repeat a 1000] at index 100 would create the two nodes [string repeat a 100] and [string repeat a 900]. > If I read you right, you are suggesting further splitting up the rope at the > boundaries between one-byte, two-byte, and whatever-byte characters. I > guess another field could be added to the rope to indicate the character > type it contains: one-byte, two-byte, etc., or a mixture. Leaf nodes would > be constrainted to not be a mixture. Would len be the length in bytes or > characters? Length in characters would be more useful, and length in bytes > could be determined by multiplying the lengths by the character widths. Indeed. Indexing is done at the character level. > A long syllable that alternates between one-byte and two-byte characters > will be the worst case for this scheme. To avoid degenerate cases, we could use heuristics for choosing the character representation, so as to avoid building excessively fragmented ropes. For example, strings having a a sufficiently high proportion of Unicode chars will be represented as one single chunk of wide chars. > (I'm not sure > "balancing" is really the right word here, since that implies rotating nodes > around, which I don't think we want.) You've got the right intuition there. Yes, balancing is the right word, and is critical to delivering good performances. The key to a good rope structure is good balancing because it gives optimal indexing. I suggest you read the following paper for a good intro to ropes: http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue12/spe986.pdf Let's consider the big picture where strings and lists use the same underlying structure. From an abstract point of view, a string is simply a list of characters. And a list can be represented as a tree with operators at each node, such as concatenation, repetition or expansion. (speaking of ropes, a rope is simply an internal structure for strings, so let's speak about strings for now). For example the list (a b {*}(c d) e f) is a 6-element list made of chars a to f. But the underlying structure is actually a tree: first level is a 5-element list whose 3rd element is a 2-element list with an expansion operator. From an abstract point of view one could also see this list as the string "abcdef". This abstract data type could be the result of the following operations: # As a list: set l1 (c d) set l (a b {*}*l1 e f) # As a string: set s1 "cd" set s "ab${s1}ef" My goal is to experiment with an internal structure that could be used for both strings and lists. This structure would be a tree with operators on each node. A strings is a list of characters, so the tree structure is implicitly flattened; this puts no constraint on the actual implementation of the tree. However, lists (and parse trees) are trees of strings, so the actual structure must be controllable as it holds significant information; the actual implementation must thus provide this property. Ropes (i.e. flattened trees of strings) are usually implemented using B-trees or other self-balancing structures. This gives good overall performances as the gobal balancing of the tree optimizes the indexing. However the resulting tree structure is uncontrollable. OTOH skiplists' balancing algorithm is much simpler and is totally controllable when needed: it is based on the promotion of nodes to upper levels, which serve as "fast lanes" by skipping lower elements (hence the name). While the statistical (i.e. random) balancing algorithm gives theoretically lower performances when compared to self-balancing structures, average performances shouldn't be far off. Moreover the promotion algorithm can be totally controlled when needed and thus can be used to hold structural information. Finally, every level is a simple flat linked list, which is especially nice for the lowest level. > >> The only reason not to use ropes is C runtime compatibility, which >> requires 8-bit, NUL-terminated flat strings. > > How important is C runtime compatibility? The libc string functions don't > take ropes anyway, so we already have to write our own suite of rope-enabled > string functions. This is only important in the current context of Tcl8, that is the reason why its string representation cannot be changed. With Tcl9, we are free to break compatibility at the internal level, and supply our own set of libc-like API (which Tcl already does to circumvent bugs in some implementations, see the compat dir in the Tcl sources). IMHO libc compatibility is no longer a requirement as Tcl is already a portable C library itself, so there is no more compelling reason NOT to use ropes. >> > You wish to unify ropes, parse trees, and Cloverfield objects, correct? >> > I'm not certain the unification is necessary, so long as these different >> > structures are capable of pointing to each other, similar to >> > containment. >> >> In Tcl, everything is a string and can be represented as a rope, which is >> conceptually a tree, however most values are also lists or trees, so why >> not try to unify both? > > I suspect it might be prohibitive to always encode the internal (non-string) > representation directly into the structure of the rope. For example, this > interferes with having the parser directly perform backslash substitutions. My goal is to use the same set of structures and algorithms, but not necessarily to store all information into one single object. Let's take the example of a list of strings. If the list and its strings had to share structures, then another list could not store the same strings, and we would lose the copy-on-write feature of Tcl. So each object is of course distinct, but different data types may use the same set of structures and algorithms. At present strings and lists are implemented using use distinct sets of structures and algorithms, whereas they are very similar data types. > Or what about dictionaries? Where would the hash table go? The unification of internal structures only concern "immediate" values, such as strings and lists, i.e. the external representation. Object types such as dicts still require a specific internal representation à la Tcl_Obj. But thanks to the interface mechanism that comes with rule [8] there is still the possibility to avoid shimmering whenever possible. > In my mind, the meaning of semicolon is well defined. Semicolons are > command separators, the same as newlines. Command separators are also word > separators. Therefore, the first nonspace character following a semicolon > is the first character of a word. Think of the ;# idiom in Tcl. But semicolons are not word separators in lists. For example, [lindex {a b;c d} 2] gives "d", not "c", and list element #1 is "b;c". The same would be true with Cloverfield rule [6]. Add to that rule [8] and variable index parsing, and things get really complicated. Hence my proposal below. >> Moreove, Tcl and the current Cloverfield spec recognize open braces >> everywhere, even when they wouldn't at toplevel. > > When they wouldn't what? Wouldn't matter? This sentence is a bit clumsy. What I meant was: at the toplevel, braces are only matched at word start (or with variables, but whatever). So an open brace in the middle of a word doesn't need a matching close brace. However, between braces (Tcl rule [6] and Cloverfield rule [5]), all open braces must have a matching close brace. Clearly this is inconsistent, because a script that would be valid at the toplevel isn't valid between braces. BTW the same kind of problem occurs with hashes. More below. >> To do so, I proposed above that comments and {data} be recognized >> everywhere, as with braces. > > I don't see where you said that. That's right, I thought I had. But this is no longer required with my new proposal, as this would be easier to recognize the places where quotes, comments and {data} may occur: after whitespaces, open braces/parentheses/brackets, and semicolons. >> But it's far from satisfying, as the new rules would work the same as the >> existing ones that are inconsistent. > > Again, I don't see your example as inconsistent, at least not in a way that > matters or is solvable. My goal is to make code that works at toplevel still work between braces. So using braces within words should always require escaping. For symmetry, matching delimiters between braces should also match at toplevel. So if we have: if {true} { # Valid set a{1 2} 3 set a\{ b # Invalid set a{1 2 } then we should also have: # Valid set a{1 2} 3 set a\{ b # Invalid set a{1 2 >> A last alternative would be to allow braces and parentheses everywhere in >> the word, and not only at the beginning. So they would always be balanced. > > This is a huge change, and I believe it to be of questionable merit. It > would make Cloverfield more like the Bourne shell, which supports bizarre > things like 'Andy said, "My dog'"'"'s fur is white and brown."' . Fortunately I don't want to go that far ;-) But allowing brace and parenthesis matching everywhere solves many problems and improves the overall consistency. Currently brace matching rules differ between rule [5] and at toplevel. What I want is improve overall consistency without creating too many new rules or breaking existing ones. > My solution to this problem (and some related problems) is for variable > names to always be given in the form of a substitution or a reference, never > a bare literal. Also, since this would make variable references much more > common, I would shorten the syntax from $& to @. Lastly, in your "3 Feb > 2008 22:24:01 +0100" email you said that references only work as references > if they're the only syllable in the word, so I'll only have @ be recognized > as a variable reference if it's the first character of a word. To the > parser, it would be similar to a quoting style. (If you don't like @, we > can pick something else that's unlikely to appear at the start of a word.) > > % set @a{1 2} 3 > % puts $a{1 2} This is a bit too Perlish IMHO. With @ you're basically creating a new quoting rule that activates braces/parentheses matching. OTOH requiring references in place of names is a major departure from Tcl: it isn't consistent with other commands such as global, variable or upvar (unless you want to change them in a similar way), or introspection, as we still want the possibility to pass variable names around. And semantically speaking it requires "weak" references (which variable names are) whereas the current references are "strong", in the sense that referenced variables must pre-exist (as with variable substitutions). >> So if an open brace or parenthesis occurs at the beginning of a word, then >> the word is totally enclosed between the pair, which are not retained as >> part of the word. If they occur in the middle of a word, then they must be >> balanced and are retained as part of the word (and don't terminate it). > > How does this play with word modifier syntax? Are you disallowing having > only the first part of the word being quoted with braces or parentheses? It only activates brace or parenthesis matching in the middle of a word, the same way brackets are already matched. Including an open brace or parenthesis within a word would require a backslash, as is already the case with braced code anyway. > I expect this will be a great source of confusion; users will have to > remember when the quotes are kept in the word and when they're dropped. The only rule to remember is whether the delimiter starts the word or not. If it does then the matching delimiter must end the word and they strictly enclose the word (i.e. they aren't part of it). If it doesn't then the word continues after the matching delimiter and they are retained. This is not a quoting rule, but rather a collating rule. The best I could do now is implement these features in my parser and write a test suite, to see how it renders. IMHO if the code looks good and consistent, then the rule is good. You can also write test code on your side to see how it would behave with both parsers. |
From: Andy G. <unu...@ai...> - 2008-03-23 03:56:27
|
On Wed, 19 Mar 2008 18:27:52 +0100, Frédéric Bonnet wrote > Andy Goth wrote: > > On Fri, 10 Mar 2008 17:23:06 +0100, Frédéric Bonnet wrote > >> The global state machine model is interesting at first but given the > >> peculiarities of Tcl it makes things less readable. > > > > I'm not sure Tcl is responsible for the readability problem. > > What I meant was that Tcl rules are not global contrary to other languages > such as C, so one needs a dedicated parser for each quoting rule. Except that there is much overlap between different quoting rules. However, the intersection (common code) of any two quoting rules is liable to be different depending on which two quoting rules are chosen. "A" is the case for quoting rules "1" and "2" but not "3"; "B" is the case for rules "1" and "3"; "C" is the case for "2" and "3"; etc. This is what will make it difficult for me to properly segment my unified state machine. > > That sounds pretty interesting, actually! How does this square with your > > vision of skiplists in action? > > That's exactly how it's supposed to work. The lowest level would be a list > of "atoms", i.e. literals (pointing to the original source), and > backslashed data. E.g. "foob\x61r" is {LITERAL foob} {ESCAPE a} {LITERAL > r}, or {LITERAL 0 4} {ESCAPE a} {LITERAL 8 1} when using index and length. For performance reasons, I would have merged the whole thing into a single syllable. But that optimization, while important for the common case, will make it more difficult to use the parser's output as input to an analyzer. I'm not sure it's worthwhile to sacrifice performance to make this uncommon case easier; I'd rather have the parser take an option to determine which output mode to use. Another possibility is to have a "low-level" parser which includes an analysis of each character in its output; a higher level wrapper will then collect this into the target format. If the target is the execution engine, backslashed syllables will be merged, and no-longer-necessary control characters (spaces, brackets, etc.) will be dropped. The downside is that this approach will segment some character sequences that otherwise would not be segmented. Example: set_a_(1_2_[foo_3_4]_$bar::baz(5){6}) would be segmented to: set _ a _ ( 1 _ 2 _ [ foo _ 3 _ 4 ] _ $ bar::baz ( 5 ) { 6 } ) instead of: set _ a _( 1 _ 2 _ [ foo _ 3 _ 4 ] _ $ bar::baz ( 5 ){ 6 }) But I don't think this is worth worrying about, as it will only slow down the rare case of concatenating the script back together again. When targeted at the execution engine, the parser wrapper will drop the spuriously segmented character sequences anyway, resulting in: set _ a _ ( 1 _ 2 _ [ foo _ 3 _ 4 ] _ $ bar::baz ( 5 ) { 6 } ) set a 1 _ 2 _ foo 3 4 _ bar::baz 5 6 While on the subject of performance, I have some thoughts about making ropes faster for scripts, which have lots of single-character segments. struct rope { int len; char* text; /* Payload. */ struct rope* child, *sibling; /* Tree structure. */ } This is the same structure as before, but now the text pointer is allowed to be non-NULL for internal nodes when the child leaf nodes all point to strings that are contiguous in memory. This makes it unnecessary to descend all the way to the leaf nodes to get the concatenated string back. When a script is read into memory, it is stored in a rope. If the script is stored into a single malloc()'ed region, the rope initially will consist of only a leaf node, so its child and sibling pointers will be NULL. If the script is divided across several discontiguous buffers, the rope will have two levels, and the text pointer of the root will have to be NULL since the memory isn't contiguous. During parsing/analysis, the rope will be subdivided at syllable boundaries. This process does not actually change the rope, only break it up so that it is possible to create a parse tree (or whatever) that points at substrings. One level will be added to the rope; all the new segments will be siblings, except in the case of the original rope being split across multiple buffers. All the existing text pointers will remain valid; none of them are changed to NULL. In the case of the script having been read into a single buffer, concatenation is trivial; just use the text pointer of the root node. In the case of the script having been read into multiple buffers, concatenate the strings indicated by the text pointers of the root node's immediate children. This is exactly the same as before subdivision. What happens if a single syllable spans two buffers? One possibility is to reshuffle the tree so that the root node has three children. The first and third children are the first and second buffers, minus the split syllable. The second child is the split syllable, its text pointer is NULL, and it has two children. The two children are the intersections of the split syllable against the first and second buffers. Hmm, there's a problem with that. Changing the first and third children of the root node will screw up anything that is using them. Reference counting can fix this, though. If the children are not shared, they can simply be changed. If they are shared, they are left untouched, and the root node is changed to point at new child nodes. (I can't imagine why anything but the parser would be using the child nodes at this point, but nevertheless this case should be considered.) I've mentioned this before, but the rope is further subdivided when any syllables are treated as scripts or lists or anything else that is a conglomerate of strings. These subsequent subdivisions form additional levels in the tree. No text pointer invalidation will take place, so the performance of concatenation will not suffer. > But syllables, atomic or not, could also be dynamic (e.g. functions). Functions? Could you give an example of where this will help? Are you thinking of references and delayed expansion? > With a rope structure we could build strings made of chunks of fixed-width > encodings, with O(1) char access within each chunk. Using ropes of > fixed-width chars would actually reduce the number of iteration steps in > string access and IHMO greatly improve string handling performances. If I read you right, you are suggesting further splitting up the rope at the boundaries between one-byte, two-byte, and whatever-byte characters. I guess another field could be added to the rope to indicate the character type it contains: one-byte, two-byte, etc., or a mixture. Leaf nodes would be constrainted to not be a mixture. Would len be the length in bytes or characters? Length in characters would be more useful, and length in bytes could be determined by multiplying the lengths by the character widths. To find a given character, if the node contains the desired character but has a mixture of character widths, descend to the child. Then, using the len field to keep track of position, advance through the linked list of siblings until the right node is found. Finally, multiply the desired index (minus the starting index of the node, which is the sum of the lengths of the siblings already traversed) by the character width of the node, and add that value to the node's text pointer. A long syllable that alternates between one-byte and two-byte characters will be the worst case for this scheme. To improve this worst case, the tree should be kept fairly balanced so as to avoid having one very long linked list of siblings. So the tree structure is determined not only by subdivision from parsing but also by a balancing algorithm. (I'm not sure "balancing" is really the right word here, since that implies rotating nodes around, which I don't think we want.) > The only reason not to use ropes is C runtime compatibility, which > requires 8-bit, NUL-terminated flat strings. How important is C runtime compatibility? The libc string functions don't take ropes anyway, so we already have to write our own suite of rope-enabled string functions. > Of course you can trade memory for performance by using Unicode byte > arrays, but in this case you discard the internal rep and get shimmering. Isn't this what the Tcl regexp code does internally? Bringing regexp support into Cloverfield will certainly be interesting. :^) > > You wish to unify ropes, parse trees, and Cloverfield objects, correct? > > I'm not certain the unification is necessary, so long as these different > > structures are capable of pointing to each other, similar to > > containment. > > In Tcl, everything is a string and can be represented as a rope, which is > conceptually a tree, however most values are also lists or trees, so why > not try to unify both? I suspect it might be prohibitive to always encode the internal (non-string) representation directly into the structure of the rope. For example, this interferes with having the parser directly perform backslash substitutions. Or what about dictionaries? Where would the hash table go? How about the hybrid parse tree-skiplist I described in my previous email? I suggest modifying it to not point to ropes but instead to objects. An object points to a rope, an internal representation (or interface; you'll have to explain that to me), or both. The entire script is represented as an object pointing to (1) a rope containing the text of the script, and (2) the parse tree as the internal representation. > But what happens if we "help" the promotion algorithm so that it chooses > the "right" leaves? ;-) With my approach, the structure of the rope will be determined both by the semantics of the script and by a balancing algorithm. However, I won't require that it be possible to determine the semantics of the script given the structure of the rope. > OK, so the parser must recognized word starts properly. As I said above, > this is harder to do *robustly* and *consistently* than I thought. As > braces and parentheses may occur at the middle of a word, the parser must > know whether they obey rule [5] and [6]. Rule [6] doesn't have any impact on finding matching braces for rule [5]. > But then you have to recognize variable substitutions, Yeah, $" and $( need to be matched with closing " and ). Tricky! But possible. proc x } {return $}} # not valid proc x } {return $\}} # should already work, but... * proc x } {return $"}"} # $"..." support needs to be added proc x } {return ${}}} # not valid proc x } {return $(})} # $(...) support needs to be added Yes, x is a proc that takes one argument called "}", and it returns it. (*) but the variable will be named "", and close brace will be appended to its value. Or not; Tcl treats "$\x" (for any x) like "\$\x". > and semicolon handling is a big question mark. In my mind, the meaning of semicolon is well defined. Semicolons are command separators, the same as newlines. Command separators are also word separators. Therefore, the first nonspace character following a semicolon is the first character of a word. Think of the ;# idiom in Tcl. I already support semicolons, since they're in $wordsep in my parser. When I put "{1 2;#3}\n}" (which is two lines; replace \n with a newline) in my parser, it yields a single literal, because it doesn't count the first close brace toward the brace count. Delete the semicolon, and two literals are reported, because the # no longer the first character of a word, therefore it is not the start of a comment, and the first close brace is counted. The second literal is the final close brace character. > Moreove, Tcl and the current Cloverfield spec recognize open braces > everywhere, even when they wouldn't at toplevel. When they wouldn't what? Wouldn't matter? > For example: > set a{ 1 # Works. > puts $a{ # Fails. > if {true} {set a{ 1} # Fails. > But this works differently with parentheses: > set a( 1 # Works. > puts $a( # Fails. > if {true} {set a( 1} # Works. > Or with quotes: > set "a{" 1 # Works. > puts $"a{" # Works. > if {true} {set "a{" 1} # Works. > That's exactly the kind of inconsistency I want to suppress. I don't see this as a problem. We want to make it possible to do weird stuff like put { in a variable name, even though { is also used for brace counting. That's what we have quoting for. There are many options: if {true} {set "a{" 1} if {true} {set a\{ 1} if {true} "set a{ 1" if {true} (set a{ 1) if {true} set\ a{\ 1 Directly writing "set a{ 1" is only an option at toplevel because no brace counting is taking place. > To do so, I proposed above that comments and {data} be recognized > everywhere, as with braces. I don't see where you said that. A couple emails ago you proposed a change to {data} that would allow it to be used inline. You also corrected inline comments so as to not clash with [5]. > But it's far from satisfying, as the new rules would work the same as the > existing ones that are inconsistent. Again, I don't see your example as inconsistent, at least not in a way that matters or is solvable. > A last alternative would be to allow braces and parentheses everywhere in > the word, and not only at the beginning. So they would always be balanced. This is a huge change, and I believe it to be of questionable merit. It would make Cloverfield more like the Bourne shell, which supports bizarre things like 'Andy said, "My dog'"'"'s fur is white and brown."' . > Besides, this has the nice side effect of unifying $ and set: > % set a{1 2} 3 > % puts $a{1 2} > 3 Not really. set's first argument would be "a1 2", which is not what we want. We need for it to get the braces. Oh wait, you address this below. My solution to this problem (and some related problems) is for variable names to always be given in the form of a substitution or a reference, never a bare literal. Also, since this would make variable references much more common, I would shorten the syntax from $& to @. Lastly, in your "3 Feb 2008 22:24:01 +0100" email you said that references only work as references if they're the only syllable in the word, so I'll only have @ be recognized as a variable reference if it's the first character of a word. To the parser, it would be similar to a quoting style. (If you don't like @, we can pick something else that's unlikely to appear at the start of a word.) % set @a{1 2} 3 % puts $a{1 2} > So if an open brace or parenthesis occurs at the beginning of a word, then > the word is totally enclosed between the pair, which are not retained as > part of the word. If they occur in the middle of a word, then they must be > balanced and are retained as part of the word (and don't terminate it). How does this play with word modifier syntax? Are you disallowing having only the first part of the word being quoted with braces or parentheses? I expect this will be a great source of confusion; users will have to remember when the quotes are kept in the word and when they're dropped. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Frédéric B. <fb...@en...> - 2008-03-19 17:23:55
|
Andy Goth wrote: > On Fri, 10 Mar 2008 17:23:06 +0100, Frédéric Bonnet wrote >> Finally I have my 'net access at home, so things should go faster by now. > > That's good news. :^) > >> > I'm considering refactoring it [so that] each state and quoting style is >> > handled by a separate proc. >> >> The global state machine model is interesting at first but given the >> peculiarities of Tcl it makes things less readable. > > I'm not sure Tcl is responsible for the readability problem. Well, I didn't mean that; I should have said "specificities" instead of "peculiarities". English is not my native language and I didn't realize the negative connotation of "peculiarities" vs. "particularités" :*) What I meant was that Tcl rules are not global contrary to other languages such as C, so one needs a dedicated parser for each quoting rule. > Any complex > state machine can get that way, unless it's segmented like I propose. The > trick is finding the right way to slice it. Should each combination of > quoting* handled by a separate proc? No, that would result in too many > procs, making it hard to find the right code. Should everything be handled > by a single proc? No, that would give one huge (possibly nested) switch > statement, again making it hard to find the right code. The correct path > lies somewhere in the middle. Probably. :^) I agree, and that's the way I've written my parser. Each quoting rule gets its own parser, but they can call each other. >> I've written parsing code for most rules except braces (rule [5]). > > I believe the code I have on the wiki completely handles rule [5], except > for the bit about [11.4]. It correctly handles the above "combination of > quoting" example, except for the outer two parentheses, which aren't > supported yet. Comments (and backslashes inside comments) are handled as > well, for the purpose of brace counting. I encourage you to play with it. > PUT MY CODE IN YOUR TCLSH! I've run your code on a bunch of test scripts and it works correctly. However it doesn't handle variables for now, and that's where the real fun begins ;-) More on rule [5] below. >> Back to my parser. As I said, I try to use skiplists instead of simple >> trees. Parsers usually generate the AST in the form of a tree (ie >> recursive lists). Skiplists use multi-level flat lists: >> >> - each level collates one or several elements of the below level. >> - the bottom-most level is the linear list of all substrings whose >> concatenation gives back the whole string, so iteration is trivial. > > Iteration is trivial? Could you please give me an example of where and how > this will actually take place? For instance, during serialization. Just concatenate all level-0 syllables and you get the string rep. And if the interpreter handles ropes natively, concatenation need not be physically performed (ie no need to allocate a new string). > I'll try to come up with one of my own, but > I'd like to know your views as well. But first let me describe a tree. > > I think a tree is a good fit for the actual workload of the execution > engine. At the top level is a list of commands over which it iterates. For > each command, there is a list of words. For each word there is a list of > syllables. Some syllables can be command substitutions, so the tree depth > is unbounded (at least in theory). To construct the arguments to the > command, the execution engine iterates over the list of words. Each word's > substitutions and modifiers are applied, eventually yielding the list of > arguments. Recursion may occur. This is a fairly traditional approach. > > With a skiplist, I suppose the execution engine can operate upside down, > starting with the lowest level and working upwards, collecting arguments as > it goes. The lowest level would be a list of syllables not requiring > substitution, which aren't directly useful without knowing how they're > combined into words, so maybe it would start at the second lowest level. > (Or start at the lowest level anyway so that the parse tree skiplist doubles > as a rope over the original string.) Basically the substitution specifiers, > modifiers, and implied concatenation all serve as operators, and the lowest > level would be the literal operands. > > That sounds pretty interesting, actually! How does this square with your > vision of skiplists in action? That's exactly how it's supposed to work. The lowest level would be a list of "atoms", i.e. literals (pointing to the original source), and backslashed data. E.g. "foob\x61r" is {LITERAL foob} {ESCAPE a} {LITERAL r}, or {LITERAL 0 4} {ESCAPE a} {LITERAL 8 1} when using index and length. But syllables, atomic or not, could also be dynamic (e.g. functions). Adding several level builds an implicit tree structure. I say implicit because there is no direct relation between a parent and its children outside of the structure as a whole. BTW don't pay too much attention to my current implementation of skiplists as it totally sucks (Tcl is simply not suited for this kind of data structure). However it's fine for validation purpose. In C, I would use the same kind of structure as you gave: > struct rope { > int len; /* strlen(text) or sum of child lens, as appropriate. */ > char* text; /* pointer to first character, or NULL if not leaf. */ > struct rope* child; /* pointer to leftmost child, or NULL if leaf. */ > struct rope* sibling; /* NULL if there is no right-hand sibling. */ > } > > Hmm, from looking at your sample skiplist, it seems I'm backwards from you. > Your lowest level may be a list of all syllables, but your second lowest > level is (more or less) a list of all arguments to the command, which is > what I would call the highest level. (Or third highest; my highest level is > the entire script, followed by a list of all commands.) For implementation purpose, level 0 is the lowest level, i.e. it should be level(+INF). So level 1 is the highest level. > What does "AST" stand for? Abstract Syntax Tree, i.e. the result of the parsing. > in the case of backslash substitutions, the parse tree points to ropes > that aren't even present in the original script. Backslash subsitutions are > performed by the parser itself, and the syllables containing them are > replaced with new ropes allocated on the heap. The new ropes have all > backslash substitutions already done. Exactly. > At least, this is possible in a traditional tree. Skiplists present some > problems, since the data after parsing is a patchwork of the original script > and new strings derived from the original. This isn't a problem that can't > be solved, but it does make me want to consider an alternative which > provides benefits of both trees and skiplists. That's where the actual implementation becomes critical. Conceptually, each lowest-level syllable would be a separate chunk, but for performances consecutive substrings should be kept in one single structure. So the lowest-level syllable wouldn't be as much fragmented. In a sense this is no different that the current Tcl string handling: as the native encoding is UTF-8, which is a variable-width encoding, you get O(n) char access (*). So conceptually Tcl strings are lists of characters. With a rope structure we could build strings made of chunks of fixed-width encodings, with O(1) char access within each chunk. Using ropes of fixed-width chars would actually reduce the number of iteration steps in string access and IHMO greatly improve string handling performances. The only reason not to use ropes is C runtime compatibility, which requires 8-bit, NUL-terminated flat strings. (*} Of course you can trade memory for performance by using Unicode byte arrays, but in this case you discard the internal rep and get shimmering. For example, "abcdéfghi" is a 9-char string which needs 10 bytes in Tcl using UTF-8 but 18 in Unicode wide chars. Indexing the last character in UTF-8 takes 10 steps, one for each byte. A rope would be a concatenation of the ASCII string "abcd", the Unicode wide string "é", and the ASCII string "fghi". Indexing the last character only takes 3 steps to locate the right substring. > You wish to unify ropes, parse trees, and Cloverfield objects, correct? I'm > not certain the unification is necessary, so long as these different > structures are capable of pointing to each other, similar to containment. That's why I'm only experimenting for now. But I got this idea of unification when I began studying rope data structures. In Tcl, everything is a string and can be represented as a rope, which is conceptually a tree, however most values are also lists or trees, so why not try to unify both? Ropes are typically implemented using balanced trees, for performance reasons; most implementations use B-trees, however the balancing breaks the structure and there is no way to coerce this mechanism so that it respects a given structure. Skiplist-based ropes don't use trees but several levels of linked lists (but there is a bijection between a tree and a skiplist) with vertical links between them: some leaves are randomly "promoted" to higher levels and thus the structure is statistically balanced. But what happens if we "help" the promotion algorithm so that it chooses the "right" leaves? ;-) After all there is no strict constraint on the resulting structure of the skiplist, so there is no difference between a randomized skiplist and a directed one, other than in the way they are balanced. On the other hand, we have Tcl where strings are internally converted to lists, holding strings that are themselves substrings of the original string. Moreover, since Cloverfield adds expansion of dynamic data (references, delays...), some list elements are themselves lists that must be expanded inline. So lists are implemented as trees that must be conceptually flattened. Now replace list with string and list element with character (or substring) and you get a rope! In conclusion: every data structure can be defined as a list, every list can be represented as a string, lists and strings can be implemented as trees, so why not use a single internal data structure? Of course only time will tell whether this approach is viable. > I see from your > choice of word type names that the structs and enums I already wrote have > been helpful, although I would have said CMDSUB instead of COMMAND. Of > course, either name will do. I chose CMDSUB in my parser ;-) >> #{This is a block comment}# > > I like this best, since it's easier to see where the comment ends. Is it > alright with you that this breaks comments that begin with an open brace? Yes. Users will have to add a space in between. A small price to pay IMHO. > >> Regarding rule [5], I think that hash characters should be recognized >> everywhere in braced sequences, and not exclusively at word start as with >> the current rule [10]. The reason is that brace matching becomes overly >> complicated otherwise, as the parser has to identify word starts within >> braced blocks. > > I disagree. I like rule [5] the way it is. And I have successfully > implemented it, so it can't be too complicated. > >> This is harder to do robustly than I first thought, given that the >> sequence doesn't necessarily follow the regular parsing rules (typically, >> expr-like syntax doesn't use the same word delimiters). > > I don't think it's worth supporting the following (replace \n with newline): > > expr {5+# a full line comment\n6} > expr {5+#{an inline comment}#6} > > It's simple enough to require the user to do: > > expr {5+ # a full line comment\n6} > expr {5+ #{an inline comment}# 6} OK, so the parser must recognized word starts properly. As I said above, this is harder to do *robustly* and *consistently* than I thought. As braces and parentheses may occur at the middle of a word, the parser must know whether they obey rule [5] and [6]. But then you have to recognize variable substitutions, and semicolon handling is a big question mark. Moreove, Tcl and the current Cloverfield spec recognize open braces everywhere, even when they wouldn't at toplevel. For example: # Works. set a{ 1 # Fails. puts $a{ # Fails. if {true} { set a{ 1 } But this works differently with parentheses: # Works. set a( 1 # Fails. puts $a( # Works. if {true} { set a( 1 } Or with quotes: # Works. set "a{" 1 # Works. puts $"a{" # Works. if {true} { set "a{" 1 } That's exactly the kind of inconsistency I want to suppress. To do so, I proposed above that comments and {data} be recognized everywhere, as with braces. But it's far from satisfying, as the new rules would work the same as the existing ones that are inconsistent. An alternative would be to parse braced data as if it were a regular script, but this totally breaks the Tcl philosophy (and it doesn't play nice with expr or if). A last alternative would be to allow braces and parentheses everywhere in the word, and not only at the beginning. So they would always be balanced. Besides, this has the nice side effect of unifying $ and set: % set a{1 2} 3 % puts $a{1 2} 3 So if an open brace or parenthesis occurs at the beginning of a word, then the word is totally enclosed between the pair, which are not retained as part of the word. If they occur in the middle of a word, then they must be balanced and are retained as part of the word (and don't terminate it). So braces and parentheses would work in a similar way to brackets (except that brackets are not retained as part of the word). For example: % set foo "bar baz" bar baz % set a {$foo 3} $foo 3 % set a 12{$foo 3}45 12{$foo 3}45 % set a ($foo 3) "bar baz" 3 % set a 12($foo 3)45 12("bar baz" 3)45 So: - easier to parse and match - more readable IMHO - no more inconsistency between $ and set - word starts are easier to recognize, so comments in braces are easier to parse >> As a sidenote I'd also like to extend the {data} modifier syntax in a >> similar way, e.g: >> >> # New forms allow delimiters around the sequence. >> set var {data}TAG{foo bar baz}TAG >> set var {data}TAG"foo bar baz"TAG >> >> That way raw data need not span multiple lines but can be inline. Useful >> for regexps. > > An inline heredoc. Looks cool. But few regexps will benefit, since most > have matching braces (usually zero). Still I say: go for it. To be more consistent with the above proposal, I also thought about dropping heredoc support as the {data} word modifier and rather use a prefix to quotes, e.g.: set var TAG"foo bar baz"TAG (tag being [0-9a-zA-Z_]*) But I'm not quite sure about that, as the original {data} is more powerful. So why not add the above as an additional syntax? > On Fri, 14 Mar 2008 18:08:04 +0100, Frédéric Bonnet wrote >> My parser is close to completion! You can find it attached. > > How do I make it work? I tried [test $script 0 script] but only got > "unexpected token UNKNOWN" from script::parse line 12. Hmm strange I don't get the same error. The script parser needs an extra argument "SCRIPT" (alternatives are LIST and CMDSUB): test $script 0 script SCRIPT |
From: Andy G. <unu...@ai...> - 2008-03-15 19:43:57
|
On Fri, 10 Mar 2008 17:23:06 +0100, Frédéric Bonnet wrote > Finally I have my 'net access at home, so things should go faster by now. That's good news. :^) > > I'm considering refactoring it [so that] each state and quoting style is > > handled by a separate proc. > > The global state machine model is interesting at first but given the > peculiarities of Tcl it makes things less readable. I'm not sure Tcl is responsible for the readability problem. Any complex state machine can get that way, unless it's segmented like I propose. The trick is finding the right way to slice it. Should each combination of quoting* handled by a separate proc? No, that would result in too many procs, making it hard to find the right code. Should everything be handled by a single proc? No, that would give one huge (possibly nested) switch statement, again making it hard to find the right code. The correct path lies somewhere in the middle. Probably. :^) (*) Combination of quoting. I'll just give an example: ({"\"})"}) . This is a single-word list. (Is there a difference between a single-word list and that single word? Does the answer depend on the presence or absence of whitespace within the word?) > I've written parsing code for most rules except braces (rule [5]). I believe the code I have on the wiki completely handles rule [5], except for the bit about [11.4]. It correctly handles the above "combination of quoting" example, except for the outer two parentheses, which aren't supported yet. Comments (and backslashes inside comments) are handled as well, for the purpose of brace counting. I encourage you to play with it. PUT MY CODE IN YOUR TCLSH! > For the latter, I'm trying to find simple rules so as not to parse the > braced expression as a recursive script. Feel free to examine and borrow from my code. If you dare. :^) My state handlers are all intricately woven together, I'm afraid. Pulling apart my state machine may be as difficult as pulling apart an atom, and it might yield the same result. > Back to my parser. As I said, I try to use skiplists instead of simple > trees. Parsers usually generate the AST in the form of a tree (ie > recursive lists). Skiplists use multi-level flat lists: > > - each level collates one or several elements of the below level. > - the bottom-most level is the linear list of all substrings whose > concatenation gives back the whole string, so iteration is trivial. Iteration is trivial? Could you please give me an example of where and how this will actually take place? I'll try to come up with one of my own, but I'd like to know your views as well. But first let me describe a tree. I think a tree is a good fit for the actual workload of the execution engine. At the top level is a list of commands over which it iterates. For each command, there is a list of words. For each word there is a list of syllables. Some syllables can be command substitutions, so the tree depth is unbounded (at least in theory). To construct the arguments to the command, the execution engine iterates over the list of words. Each word's substitutions and modifiers are applied, eventually yielding the list of arguments. Recursion may occur. This is a fairly traditional approach. With a skiplist, I suppose the execution engine can operate upside down, starting with the lowest level and working upwards, collecting arguments as it goes. The lowest level would be a list of syllables not requiring substitution, which aren't directly useful without knowing how they're combined into words, so maybe it would start at the second lowest level. (Or start at the lowest level anyway so that the parse tree skiplist doubles as a rope over the original string.) Basically the substitution specifiers, modifiers, and implied concatenation all serve as operators, and the lowest level would be the literal operands. That sounds pretty interesting, actually! How does this square with your vision of skiplists in action? Hmm, from looking at your sample skiplist, it seems I'm backwards from you. Your lowest level may be a list of all syllables, but your second lowest level is (more or less) a list of all arguments to the command, which is what I would call the highest level. (Or third highest; my highest level is the entire script, followed by a list of all commands.) > set a (1 2 [foo 3 4] $bar::baz(5){6}) Here's how I'd handle this using trees and ropes. First, the script in its entirely is read into a simple character array, which is pointed to by a data structure with at least these fields: struct rope { int len; /* strlen(text) or sum of child lens, as appropriate. */ char* text; /* pointer to first character, or NULL if not leaf. */ struct rope* child; /* pointer to leftmost child, or NULL if leaf. */ struct rope* sibling; /* NULL if there is no right-hand sibling. */ } Before any analysis*, the whole script is just a single string. After tokenization is complete, the script has been subdivided into a bunch of substrings linked to each other in a list. The rope node originally pointing to the string will now point to a new rope node which acts as the head of the linked list. Anything that happened to be pointing to the original rope node still has access to the string, but it has to do a little bit more work to get it all back together. (*) Analysis is a really good word for this process. In Greek it means "breaking up", as in dividing something complex into simpler constituents. Antonym: synthesis. The substrings are (or could be): set _ a _( 1 _ 2 _ [ foo _ 3 _ 4 ] _ $ bar::baz ( 5 ){ 6 }) . (For clarity, I replace spaces with underscores.) A separate data structure is created that describes the semantics of the string when interpreted as a Cloverfield script. The data structure points back into the rope nodes. The data structure looks much like the tree you give in your email. (What does "AST" stand for?) With two important exceptions, it does not point to any of the whitespace characters, parentheses, braces, brackets, or dollar signs. The exceptions are the three spaces in the parenthesized list, since they are part of the list's string representation. They are marked as type SPACE (or DELIM, which I like better), instead of LITERAL. Also I group _( ){ and }) together because there's no need for them to be separate. In each of these three cases, no two parse tree nodes point to the separate characters, so no rope subdivision would have taken place. In fact, no parse tree nodes point to them at all, but that's beside the point. The second line below shows which substrings are actually pointed to by the parse tree: set _ a _( 1 _ 2 _ [ foo _ 3 _ 4 ] _ $ bar::baz ( 5 ){ 6 }) set a 1 _ 2 _ foo 3 4 _ bar::baz 5 6 And in the case of backslash substitutions, the parse tree points to ropes that aren't even present in the original script. Backslash subsitutions are performed by the parser itself, and the syllables containing them are replaced with new ropes allocated on the heap. The new ropes have all backslash substitutions already done. It's also possible to have each backslash substitution form a new syllable, which will be substituted by the execution engine instead of the parser. But I'd rather have the parser itself canonicalize things as much as is practical, since parsing happens at most once and execution happens any number of times. At least, this is possible in a traditional tree. Skiplists present some problems, since the data after parsing is a patchwork of the original script and new strings derived from the original. This isn't a problem that can't be solved, but it does make me want to consider an alternative which provides benefits of both trees and skiplists. How about a hybrid approach, something that attacks the problems from two directions at once? Start with a list of all commands in the script. Each element of this list is a list of all literals in the command. Each literal is annotated with information on how it is combined with adjacent literals to form words. > set a (1 2 [foo 3 4] $bar::baz(5){6}) Let me try to reprocess this using the hybrid structure I describe. There's only one command in the script, so I'll skip ahead to the list of literals in the command. Spaces are replaced with underscores. set a 1 _ 2 _ foo 3 4 _ bar::baz 5 6 Now I individually examine each element. 0. "set" Single-syllable word. Maybe has the FLATTEN modifier, unless we decide that this operation is better done by the name resolver than the execution engine. 1. "a" Single-syllable word. 2. "1" Beginning of eleven-syllable parenthesized list. Single-syllable list subword. 3. "_" List delimiter. (These are significant within parenthesized lists and must be retained. Quoting characters must also be kept, even though they will also be treated as list delimiters. Adjacent list delimiter characters are grouped into a single syllable, except they are split across the boundaries of a nested list.) 4. "2" Single-syllable list subword. 5. "_" List delimiter. 6. "foo" Beginning of three-syllable list subword. Beginning of script substitution consisting of three syllables. Beginning of command consisting of three syllables. Maybe has the FLATTEN modifier. 7. "3" Single-syllable word. 8. "4" Single-syllable word. (This is the last word of the command, the script substitution, and the list subword. But this information doesn't need to be stored since it's implied by the lengths given in the "foo" syllable.) 9. "_" List delimiter. 10. "bar::baz" Beginning of three-syllable variable substitution. Single-syllable variable name. 11. "5" Single-syllable keyed index. 12. "6" Single-syllable vectored index. (End of variable substitution and parenthesized list.) The execution engine needs a stack (not necessarily recursive invocation of itself) to process this kind of data structure. This stack keeps track of the engine's position in nested parenthesized lists, script substitutions, variable substitutions, and multi-syllable words, so that it knows when each ends. Alternately, rather than tagging the beginning syllables of each with the number of syllables in the construct, this information can be put in the *ending* syllables. When the engine encounters the final syllable of any of these constructs, it can pop a number of elements from its already-completed syllable list, combine them as appropriate, and push the result as a single element. (How's that for backwards!?) Each syllable structure needs to be able to support being the beginning (or ending, with my alternate design) of any number of grouped constructs. So there's a variadic list inside the syllable. Cloverfield objects (like Tcl_Objs) should be cached somewhere inside the parse "tree". The single-syllable words can easily be replaced with links to objects containing the single-syllable words. The parenthesized list can't be a cached object because it potentially has a different value every time the command is executed. Unless... It could be a cached object if all the constituent substitutions are given an implicit DELAY modifier. No, that's still not right, since the implied DELAY modifier must not be passed to "set" when the command is actually executed. A new, internal modifier perhaps? Or rather, support storing immediate substitution in an object; when the execution engine encounters an object with immediate substitution, it makes a copy of the object with all immediate substitutions performed, and it passes that copy to the command as an argument. However, if we go down that route, we're back to the traditional parse tree. > 0: set a ( 1 2 [ foo 3 4 ] $ bar::baz ( 5 ) { 6 } ) > 1: *** * <----------------------------------------> > 2: * * <---------> <--------------------> > 3: *** * * <------> <---> <---> > 4: ******** <---> <---> > 5: * * > > which is structurally identical to the above tree. The difference is > that the skiplist stores string and structural information in the > same place, in such a way that replacing a given substring is a > simple operation. So a basic interpreter only has to perform string > operations on the source script to get the resulting string. You wish to unify ropes, parse trees, and Cloverfield objects, correct? I'm not certain the unification is necessary, so long as these different structures are capable of pointing to each other, similar to containment. Let's take the script [set a \n]. A Cloverfield object for this would point to the rope {set _ a _ \n} and the parse tree {{set} {a} {newline}}, where newline is an actual newline character. I don't see the inadequacy here, the need for more unification. > Of course this is very experimental, and the inclusion of such a > structure into production code will depend on its overall performances. Even though I may be coding in Tcl, I am simultaneously contemplating a C implementation. I might even write some more structs. (I see from your choice of word type names that the structs and enums I already wrote have been helpful, although I would have said CMDSUB instead of COMMAND. Of course, either name will do.) On Thu, 13 Mar 2008 15:54:43 +0100, Frédéric Bonnet wrote > I'm not very fond of [11.2], as it clashes with rule [5], You're right. I hadn't thought of that. This must be fixed. > but I still want a block (or word) comment syntax. So I'd like to merge > both forms into one simple syntax: > > # This is a line comment > #{This is a > block comment} > > Alternatively, the comment could end with the symmetric }# sequence: > > #{This is a block comment}# I like this best, since it's easier to see where the comment ends. Is it alright with you that this breaks comments that begin with an open brace? Another solution is to replace {#}{...} with {comment}{...}. I don't like this at all, but it's possible. The word "comment" can be replaced with anything other than #... the one character that makes sense. Shrug! > Regarding rule [5], I think that hash characters should be recognized > everywhere in braced sequences, and not exclusively at word start as with > the current rule [10]. The reason is that brace matching becomes overly > complicated otherwise, as the parser has to identify word starts within > braced blocks. I disagree. I like rule [5] the way it is. And I have successfully implemented it, so it can't be too complicated. > This is harder to do robustly than I first thought, given that the > sequence doesn't necessarily follow the regular parsing rules (typically, > expr-like syntax doesn't use the same word delimiters). I don't think it's worth supporting the following (replace \n with newline): expr {5+# a full line comment\n6} expr {5+#{an inline comment}#6} It's simple enough to require the user to do: expr {5+ # a full line comment\n6} expr {5+ #{an inline comment}# 6} > As a sidenote I'd also like to extend the {data} modifier syntax in a > similar way, e.g: > > # New forms allow delimiters around the sequence. > set var {data}TAG{foo bar baz}TAG > set var {data}TAG"foo bar baz"TAG > > That way raw data need not span multiple lines but can be inline. Useful > for regexps. An inline heredoc. Looks cool. But few regexps will benefit, since most have matching braces (usually zero). Still I say: go for it. On Fri, 14 Mar 2008 18:08:04 +0100, Frédéric Bonnet wrote > My parser is close to completion! You can find it attached. How do I make it work? I tried [test $script 0 script] but only got "unexpected token UNKNOWN" from script::parse line 12. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Frédéric B. <fb...@en...> - 2008-03-14 17:04:16
|
Andy Goth wrote: > The last couple of days I've been stealing free moments to compose a reply to your two recent emails. I have more work to do before it's ready to send. Today I have another long car ride to look forward to, so I might be done tonight. (Of course, I'm in the US Central time zone, so my "tonight" is probably your "early morning.") > Ok no problem ;-) BTW I've seen the photos of your new "house". It remind me of my previous flat before remodeling (all done by myself). Hopefully it's sold now, just in time before the market starts to shrink (not as badly as the US market, though, although we also feel the impact of the subprime crisis on this side of the pond). Anyway good luck for your project! Back to Cloverfield. My parser is close to completion! You can find it attached. It only lacks the {data} parsing in scripts (which is trivial anyway but I don't have the time to do it now). Now all it needs is a good test suite. I tried to make it as much structured and readable as possible, possibly at the expense of performances. So no regexps. The goal is to ease experimentation and allow a future port to C. The generated structure is as described in a previous message. It is a bit obscure for now because it will eventually be implemented with linked lists and shared strings, and I didn't want to implement an object system or a rope structure in Tcl. So for now level 0 syllables only store the start index of each sequence, ending implicitly before the start index of the next sequence. Also, indices for level 1 and above refer to level 0 syllables, not characters (mimicking a pointer in a skiplist). But it would be trivial to turn into something more readable, e.g. ranges or substrings. I'll try to write a bunch of utility procs for debugging. Cheers, Fred. |
From: Andy G. <unu...@ai...> - 2008-03-14 16:23:18
|
The last couple of days I've been stealing free moments to compose a reply to your two recent emails. I have more work to do before it's ready to send. Today I have another long car ride to look forward to, so I might be done tonight. (Of course, I'm in the US Central time zone, so my "tonight" is probably your "early morning.") -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Frédéric B. <fb...@en...> - 2008-03-13 14:50:40
|
I'm considering some changes to the comment syntax and need some advice. At present comments can take two forms: -------------------------------- [10] Line comments. If a hash character (“#”) appears at a point where <Cloverfield> is expecting the first character of a word, then the hash character and the characters that follow it, up through the next (unescaped) newline, are treated as a comment and ignored. -------------------------------- [11.2] Word comments. If a word is preceded by the modifier “{#}”, then the word is parsed as any other word, but not substituted. The word is finally ignored and removed from the command being substituted. For instance, “cmd a {#}{b c} d” is equivalent to “cmd a d”. -------------------------------- I'm not very fond of [11.2], as it clashes with rule [5], but I still want a block (or word) comment syntax. So I'd like to merge both forms into one simple syntax: # This is a line comment #{This is a block comment} If the hash is immediately followed by a brace then the comment stops at the first matching brace, else it stops at the first newline as with Tcl rule [10]. So this syntax allows both line and block comments. Alternatively, the comment could end with the symmetric }# sequence: #{This is a block comment}# That way comments can embed close braces and inline comments are more readable. And it plays nicely with rule [5]. Regarding rule [5], I think that hash characters should be recognized everywhere in braced sequences, and not exclusively at word start as with the current rule [10]. The reason is that brace matching becomes overly complicated otherwise, as the parser has to identify word starts within braced blocks. This is harder to do robustly than I first thought, given that the sequence doesn't necessarily follow the regular parsing rules (typically, expr-like syntax doesn't use the same word delimiters). So allowing hash characters everywhere should simplify rule [5]. Of course this implies that hash chars should be properly escaped, but this is a small price to pay IMHO. (BTW the same remark applies with quotes and {data} modifier). As a sidenote I'd also like to extend the {data} modifier syntax in a similar way, e.g: # All the following forms give the same result "foo bar baz". # Currently allowed form. set var {data}TAG remainder of line is ignored foo bar baz beginning of line is also ignored TAG # but this is not. # New forms allow delimiters around the sequence. set var {data}TAG{foo bar baz}TAG set var {data}TAG"foo bar baz"TAG That way raw data need not span multiple lines but can be inline. Useful for regexps. |
From: Frédéric B. <fb...@en...> - 2008-03-10 16:19:23
|
[reposting on tcl9-cloverfield as I forgot to CC my message there] Andy Goth wrote: > Yay, some activity! :^) Thanks for updating the Tridekalogue page. Hi, I was about to send you a mail ;-) Finally I have my 'net access at home, so things should go faster by now. > Have you looked at the partial parser I wrote? Do you have any suggestions? > I'm considering refactoring it to be more like Lars Hellström's code, in whicheach state and quoting style is handled by a separate proc. But this meansnailing down the state variables and keeping them as globals or passing themaround. It also will result in some code duplication. What do you think? That's the path I'm following myself. The global state machine model is interesting at first but given the peculiarities of Tcl it makes things less readable. So I've written my parser with separate procs and tokenizers. The price to pay is some code duplication (and hence larger code), but the readability is greatly improved, and it is more easily modifiable. Given the experimental status of Cloverfield, this is a must. So I think you should follow your instinct and refactor your parser. Speaking of mine, I've been experimenting with skiplists lately, with interesting results. I've written parsing code for most rules except braces (rule [5]). For the latter, I'm trying to find simple rules so as not to parse the braced expression as a recursive script. Tcl rules were fairly simple (open/close brace counting), and I'm trying to find a similar, small set of character-based rules. This will eventually have an influence on the expression of Cloverfield's rule [5], for the better, as the current rule is a bit vague. Back to my parser. As I said, I try to use skiplists instead of simple trees. Parsers usually generate the AST in the form of a tree (ie recursive lists). Skiplists use multi-level flat lists: - each level collates one or several elements of the below level. - the bottom-most level is the linear list of all substrings whose concatenation gives back the whole string, so iteration is trivial. Let's consider the following script: set a (1 2 [foo 3 4] $bar::baz(5){6}) Using a simple tree, the AST is: {LITERAL set} {LITERAL a} {LIST { {LITERAL 1} {LITERAL 2} {COMMAND { {LITERAL foo} {LITERAL 3} {LITERAL 4} }} {VARSUB { {NAME { {LITERAL bar::baz} }} {KEYS { {LIST { {LITERAL 5} }} }} {INDICES { {BLOCK { {LITERAL 6} }} }} }} }} (Note: for simplicity, whitespaces and delimiters are not preserved) Using a skiplist, the AST is (level 0 is bottom-most, level 1 and up are in descending order and point to ranges of level 0 elements): 0 { {LITERAL set} {LITERAL a} {LITERAL \(} {LITERAL 1} {LITERAL 2} {LITERAL \[} {LITERAL foo} {LITERAL 3} {LITERAL 4} {LITERAL \]} {LITERAL $} {LITERAL bar::baz} {LITERAL \(} {LITERAL 5} {LITERAL \)} {LITERAL \{} {LITERAL 6} {LITERAL \}} {LITERAL \)} } 1 { {LITERAL 0 0} {LITERAL 1 1} {LIST 2 18} } 2 { {LITERAL 3 3} {LITERAL 4 4} {COMMAND 5 9} {VARSUB 10 17} } 3 { {LITERAL 6 6} {LITERAL 7 7} {LITERAL 8 8} {NAME 11 11} {KEYS 12 14} {INDICES 15 17} } 4 { {LITERAL 11 11} {LIST 12 14} {BLOCK 15 17} } 5 { {LITERAL 13 13} {LITERAL 16 16} } So the skiplist structure looks like this: (* = literal, <-> = recursive structure): 0: set a ( 1 2 [ foo 3 4 ] $ bar::baz ( 5 ) { 6 } ) 1: *** * <----------------------------------------> 2: * * <---------> <--------------------> 3: *** * * <------> <---> <---> 4: ******** <---> <---> 5: * * which is structurally identical to the above tree. The difference is that the skiplist stores string and structural information in the same place, in such a way that replacing a given substring is a simple operation. So a basic interpreter only has to perform string operations on the source script to get the resulting string. Of course this is very experimental, and the inclusion of such a structure into production code will depend on its overall performances. Cheers, Fred. |
From: Andy G. <unu...@ai...> - 2008-03-10 11:34:02
|
Yay, some activity! :^) Thanks for updating the Tridekalogue page. Have you looked at the partial parser I wrote? Do you have any suggestions? I'm considering refactoring it to be more like Lars Hellström's code, in which each state and quoting style is handled by a separate proc. But this means nailing down the state variables and keeping them as globals or passing them around. It also will result in some code duplication. What do you think? -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-02-28 15:58:11
|
Any news? -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-02-13 02:27:56
|
Actually, out of the country! I'll be gone to Canada (I live in Texas) for a week. That doesn't mean I'll be away from email or my computer, though, but I will be busier (and colder!) than normal. -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |
From: Andy G. <unu...@ai...> - 2008-02-10 23:15:35
|
Hehehe. :^) -- Andy Goth | <unu...@ai...> | http://andy.junkdrome.org/ |