Re: [Tcl9-cloverfield] Parser

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Andy Goth wrote:
> On Fri, 10 Mar 2008 17:23:06 +0100, Frédéric Bonnet wrote
>> Finally I have my 'net access at home, so things should go faster by now.
> 
> That's good news. :^)
> 
>> > I'm considering refactoring it [so that] each state and quoting style is
>> > handled by a separate proc.
>>
>> The global state machine model is interesting at first but given the
>> peculiarities of Tcl it makes things less readable.
> 
> I'm not sure Tcl is responsible for the readability problem. 

Well, I didn't mean that; I should have said "specificities" instead of
"peculiarities". English is not my native language and I didn't realize
the negative connotation of "peculiarities" vs. "particularités" :*)
What I meant was that Tcl rules are not global contrary to other
languages such as C, so one needs a dedicated parser for each quoting rule.

> Any complex
> state machine can get that way, unless it's segmented like I propose. The
> trick is finding the right way to slice it. Should each combination of
> quoting* handled by a separate proc? No, that would result in too many
> procs, making it hard to find the right code. Should everything be handled
> by a single proc? No, that would give one huge (possibly nested) switch
> statement, again making it hard to find the right code. The correct path
> lies somewhere in the middle. Probably. :^)

I agree, and that's the way I've written my parser. Each quoting rule
gets its own parser, but they can call each other.

>> I've written parsing code for most rules except braces (rule [5]).
> 
> I believe the code I have on the wiki completely handles rule [5], except
> for the bit about [11.4]. It correctly handles the above "combination of
> quoting" example, except for the outer two parentheses, which aren't
> supported yet. Comments (and backslashes inside comments) are handled as
> well, for the purpose of brace counting. I encourage you to play with it.
> PUT MY CODE IN YOUR TCLSH!

I've run your code on a bunch of test scripts and it works correctly. 
However it doesn't handle variables for now, and that's where the real 
fun begins ;-)  More on rule [5] below.

>> Back to my parser. As I said, I try to use skiplists instead of simple
>> trees. Parsers usually generate the AST in the form of a tree (ie
>> recursive lists). Skiplists use multi-level flat lists:
>>
>> - each level collates one or several elements of the below level.
>> - the bottom-most level is the linear list of all substrings whose
>> concatenation gives back the whole string, so iteration is trivial.
> 
> Iteration is trivial? Could you please give me an example of where and how
> this will actually take place? 

For instance, during serialization. Just concatenate all level-0 
syllables and you get the string rep. And if the interpreter handles 
ropes natively, concatenation need not be physically performed (ie no 
need to allocate a new string).

> I'll try to come up with one of my own, but
> I'd like to know your views as well. But first let me describe a tree.
> 
> I think a tree is a good fit for the actual workload of the execution
> engine. At the top level is a list of commands over which it iterates. For
> each command, there is a list of words. For each word there is a list of
> syllables. Some syllables can be command substitutions, so the tree depth
> is unbounded (at least in theory). To construct the arguments to the
> command, the execution engine iterates over the list of words. Each word's
> substitutions and modifiers are applied, eventually yielding the list of
> arguments. Recursion may occur. This is a fairly traditional approach.
> 
> With a skiplist, I suppose the execution engine can operate upside down,
> starting with the lowest level and working upwards, collecting arguments as
> it goes. The lowest level would be a list of syllables not requiring
> substitution, which aren't directly useful without knowing how they're
> combined into words, so maybe it would start at the second lowest level.
> (Or start at the lowest level anyway so that the parse tree skiplist doubles
> as a rope over the original string.) Basically the substitution specifiers,
> modifiers, and implied concatenation all serve as operators, and the lowest
> level would be the literal operands.
> 
> That sounds pretty interesting, actually! How does this square with your
> vision of skiplists in action?

That's exactly how it's supposed to work. The lowest level would be a 
list of "atoms", i.e. literals (pointing to the original source), and 
backslashed data. E.g. "foob\x61r" is {LITERAL foob} {ESCAPE a} {LITERAL 
r}, or {LITERAL 0 4} {ESCAPE a} {LITERAL 8 1} when using index and 
length. But syllables, atomic or not, could also be dynamic (e.g. 
functions).

Adding several level builds an implicit tree structure. I say implicit 
because there is no direct relation between a parent and its children 
outside of the structure as a whole.

BTW don't pay too much attention to my current implementation of 
skiplists as it totally sucks (Tcl is simply not suited for this kind of 
data structure). However it's fine for validation purpose.

In C, I would use the same kind of structure as you gave:

 > struct rope {
 > int len; /* strlen(text) or sum of child lens, as appropriate. */
 > char* text; /* pointer to first character, or NULL if not leaf. */
 > struct rope* child; /* pointer to leftmost child, or NULL if leaf. */
 > struct rope* sibling; /* NULL if there is no right-hand sibling. */
 > }

> 
> Hmm, from looking at your sample skiplist, it seems I'm backwards from you.
> Your lowest level may be a list of all syllables, but your second lowest
> level is (more or less) a list of all arguments to the command, which is
> what I would call the highest level. (Or third highest; my highest level is
> the entire script, followed by a list of all commands.)

For implementation purpose, level 0 is the lowest level, i.e. it should 
be level(+INF). So level 1 is the highest level.

> What does "AST" stand for?

Abstract Syntax Tree, i.e. the result of the parsing.

> in the case of backslash substitutions, the parse tree points to ropes
> that aren't even present in the original script. Backslash subsitutions are
> performed by the parser itself, and the syllables containing them are
> replaced with new ropes allocated on the heap. The new ropes have all
> backslash substitutions already done.

Exactly.

> At least, this is possible in a traditional tree. Skiplists present some
> problems, since the data after parsing is a patchwork of the original script
> and new strings derived from the original. This isn't a problem that can't
> be solved, but it does make me want to consider an alternative which
> provides benefits of both trees and skiplists.

That's where the actual implementation becomes critical. Conceptually, 
each lowest-level syllable would be a separate chunk, but for 
performances consecutive substrings should be kept in one single 
structure. So the lowest-level syllable wouldn't be as much fragmented.

In a sense this is no different that the current Tcl string handling: as 
the native encoding is UTF-8, which is a variable-width encoding, you 
get O(n) char access (*). So conceptually Tcl strings are lists of 
characters. With a rope structure we could build strings made of chunks 
of fixed-width encodings, with O(1) char access within each chunk. Using 
ropes of fixed-width chars would actually reduce the number of iteration 
steps in string access and IHMO greatly improve string handling 
performances. The only reason not to use ropes is C runtime 
compatibility, which requires 8-bit, NUL-terminated flat strings.

(*} Of course you can trade memory for performance by using Unicode byte 
arrays, but in this case you discard the internal rep and get shimmering.

For example, "abcdéfghi" is a 9-char string which needs 10 bytes in Tcl 
using UTF-8 but 18 in Unicode wide chars. Indexing the last character in 
UTF-8 takes 10 steps, one for each byte. A rope would be a concatenation 
of the ASCII string "abcd", the Unicode wide string "é", and the ASCII 
string "fghi". Indexing the last character only takes 3 steps to locate 
the right substring.

> You wish to unify ropes, parse trees, and Cloverfield objects, correct? I'm
> not certain the unification is necessary, so long as these different
> structures are capable of pointing to each other, similar to containment.

That's why I'm only experimenting for now. But I got this idea of 
unification when I began studying rope data structures. In Tcl, 
everything is a string and can be represented as a rope, which is 
conceptually a tree, however most values are also lists or trees, so why 
not try to unify both?

Ropes are typically implemented using balanced trees, for performance 
reasons; most implementations use B-trees, however the balancing breaks 
the structure and there is no way to coerce this mechanism so that it 
respects a given structure.

Skiplist-based ropes don't use trees but several levels of linked lists 
(but there is a bijection between a tree and a skiplist) with vertical 
links between them: some leaves are randomly "promoted" to higher levels 
and thus the structure is statistically balanced.

But what happens if we "help" the promotion algorithm so that it chooses 
the "right" leaves? ;-)  After all there is no strict constraint on the 
resulting structure of the skiplist, so there is no difference between a 
randomized skiplist and a directed one, other than in the way they are 
balanced.

On the other hand, we have Tcl where strings are internally converted to 
lists, holding strings that are themselves substrings of the original 
string. Moreover, since Cloverfield adds expansion of dynamic data 
(references, delays...), some list elements are themselves lists that 
must be expanded inline. So lists are implemented as trees that must be 
conceptually flattened. Now replace list with string and list element 
with character (or substring) and you get a rope!

In conclusion: every data structure can be defined as a list, every list 
can be represented as a string, lists and strings can be implemented as 
trees, so why not use a single internal data structure?

Of course only time will tell whether this approach is viable.

> I see from your
> choice of word type names that the structs and enums I already wrote have
> been helpful, although I would have said CMDSUB instead of COMMAND. Of
> course, either name will do.

I chose CMDSUB in my parser ;-)

>> #{This is a block comment}#
> 
> I like this best, since it's easier to see where the comment ends. Is it
> alright with you that this breaks comments that begin with an open brace?

Yes. Users will have to add a space in between. A small price to pay IMHO.
> 
>> Regarding rule [5], I think that hash characters should be recognized
>> everywhere in braced sequences, and not exclusively at word start as with
>> the current rule [10]. The reason is that brace matching becomes overly
>> complicated otherwise, as the parser has to identify word starts within
>> braced blocks.
> 
> I disagree. I like rule [5] the way it is. And I have successfully
> implemented it, so it can't be too complicated.
> 
>> This is harder to do robustly than I first thought, given that the
>> sequence doesn't necessarily follow the regular parsing rules (typically,
>> expr-like syntax doesn't use the same word delimiters).
> 
> I don't think it's worth supporting the following (replace \n with newline):
> 
> expr {5+# a full line comment\n6}
> expr {5+#{an inline comment}#6}
> 
> It's simple enough to require the user to do:
> 
> expr {5+ # a full line comment\n6}
> expr {5+ #{an inline comment}# 6}

OK, so the parser must recognized word starts properly. As I said above, 
this is harder to do *robustly* and *consistently* than I thought. As 
braces and parentheses may occur at the middle of a word, the parser 
must know whether they obey rule [5] and [6]. But then you have to 
recognize variable substitutions, and semicolon handling is a big 
question mark.

Moreove, Tcl and the current Cloverfield spec recognize open braces 
everywhere, even when they wouldn't at toplevel. For example:

   # Works.
   set a{ 1

   # Fails.
   puts $a{

   # Fails.
   if {true} {
     set a{ 1
   }

But this works differently with parentheses:

   # Works.
   set a( 1

   # Fails.
   puts $a(

   # Works.
   if {true} {
     set a( 1
   }

Or with quotes:

   # Works.
   set "a{" 1

   # Works.
   puts $"a{"

   # Works.
   if {true} {
     set "a{" 1
   }

That's exactly the kind of inconsistency I want to suppress. To do so, I 
proposed above that comments and {data} be recognized everywhere, as 
with braces. But it's far from satisfying, as the new rules would work 
the same as the existing ones that are inconsistent.

An alternative would be to parse braced data as if it were a regular 
script, but this totally breaks the Tcl philosophy (and it doesn't play 
nice with expr or if).

A last alternative would be to allow braces and parentheses everywhere 
in the word, and not only at the beginning. So they would always be 
balanced. Besides, this has the nice side effect of unifying $ and set:

   % set a{1 2} 3
   % puts $a{1 2}
   3

So if an open brace or parenthesis occurs at the beginning of a word, 
then the word is totally enclosed between the pair, which are not 
retained as part of the word. If they occur in the middle of a word, 
then they must be balanced and are retained as part of the word (and 
don't terminate it). So braces and parentheses would work in a similar 
way to brackets (except that brackets are not retained as part of the word).

For example:

   % set foo "bar baz"
   bar baz
   % set a {$foo 3}
   $foo 3
   % set a 12{$foo 3}45
   12{$foo 3}45
   % set a ($foo 3)
   "bar baz" 3
   % set a 12($foo 3)45
   12("bar baz" 3)45

So:
  - easier to parse and match
  - more readable IMHO
  - no more inconsistency between $ and set
  - word starts are easier to recognize, so comments in braces are 
easier to parse

>> As a sidenote I'd also like to extend the {data} modifier syntax in a
>> similar way, e.g:
>>
>> # New forms allow delimiters around the sequence.
>> set var {data}TAG{foo bar baz}TAG
>> set var {data}TAG"foo bar baz"TAG
>>
>> That way raw data need not span multiple lines but can be inline. Useful
>> for regexps.
> 
> An inline heredoc. Looks cool. But few regexps will benefit, since most
> have matching braces (usually zero). Still I say: go for it.

To be more consistent with the above proposal, I also thought about 
dropping heredoc support as the {data} word modifier and rather use a 
prefix to quotes, e.g.:

set var TAG"foo bar baz"TAG

(tag being [0-9a-zA-Z_]*)

But I'm not quite sure about that, as the original {data} is more 
powerful. So why not add the above as an additional syntax?

> On Fri, 14 Mar 2008 18:08:04 +0100, Frédéric Bonnet wrote
>> My parser is close to completion! You can find it attached.
> 
> How do I make it work? I tried [test $script 0 script] but only got
> "unexpected token UNKNOWN" from script::parse line 12.

Hmm strange I don't get the same error. The script parser needs an extra 
argument "SCRIPT" (alternatives are LIST and CMDSUB):

test $script 0 script SCRIPT