|
From: Lars H. <Lar...@re...> - 2011-05-08 10:34:25
|
Colin McCormack skrev 2011-05-08 02.56: > I haven't been following closely, but I noticed 'tcl parsing' and wanted > to point out http://wiki.tcl.tk/9620 and also and especially > http://wiki.tcl.tk/9649 (which I use and find very good.) Then I should point out http://abel.math.umu.se/~lars/tcl/parsetcl.pdf contains more documentation that the wiki page, and also has been updated to support the {*} feature of Tcl 8.5. Regarding Arnulf's worry about whitespace: Since character indices are kept track of, it is straightforward to record whitespace in a post-processing phase. parsetcl::reinsert_indentation shows how to do that for indentation, and the same technique can be applied to interword whitespace. Parsing is tricky in itself, so there is no need to further complicate it with whitespace when that is not needed. KISS. However, I suspect these Tcl-oriented approaches may be suboptimal for the Netbeans project; if a parser for context-free languages is available more natively, then using that is probably easier than operating Tcl parsing by remote. My reasoning is basically the following. 1. First distinguish the phases of lexing and parsing, for this discussion. (It is generally possible to unify them, but the resulting grammars don't tend to be something for human consumption.) 2. "Most" languages (well, C, Java, Pascal, and the like) tend to be regular at the lexing phase -- you could write a regexp for "the next token" -- but roughly context-free at the parsing phase. The latter is why people write BNFs when describing their syntax. 3. Tcl, on the other hand, is non-regular context-free at the lexing phase, and roughly regular[*] at the parsing phase. In fact, I think Tcl might be LR(0) at the lexing phase (which is probably why it was feasible to write parsetcl as an ad-hoc parser in the first place). Most of the Dodekalogue (Tcl(n) manpage) is about the lexing grammar, whereas the parsing grammar is presented on a per-command basis. [*] Since the set of "tokens" is infinite, some care is needed when defining what it means to be "regular" in this case. I think one could still have a requirement that there are only finitely many "token classes" for the grammar to distinguish. Of course, some of those token classes are things like "Tcl script" and "Tcl [expr]ession", so there is a recursion which makes things complicated. Whether it is a problem depends on what you want to do. Anyway, I think the basic point of "context-free lexer, regular parser" might provide some insight into the peculiarities of parsing Tcl. Lars Hellström |