[Tcl-bugs] [ tcl-Bugs-2251175 ] Expand ({*}), constants, and list syntax

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #2251175, was opened at 2008-11-09 16:14
Message generated for change (Comment added) made by dgp
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2251175&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 45. Parsing and Eval
Group: current: 8.6a3
>Status: Open
Resolution: Fixed
Priority: 9
Private: No
Submitted By: Lars Hellström (lars_h)
Assigned to: Don Porter (dgp)
Summary: Expand ({*}), constants, and list syntax

Initial Comment:
The syntax {*} uses when reparsing a literate constant appears to deviate from the usual list syntax in that it does not perform backslash substitution.

Short example:
% set L1 [list \{] ; # List intrep
\{
% set L2 {\{} ; # No intrep
\{
% set L3 [list {*}$L1 {*}$L2 {*}{\{} {*}"\\\{"]
\{ \{ {\{} \{

In all of these cases, the string being reparsed by {*} has a backslash as first character and a left brace as second character (i.e., the canonical string representation for the length 1 list whose only element is the left brace), but when that string is brace-delimited then something goes horribly wrong; it appears the {*} list parser thinks the backslash is just an ordinary character:

% set L4 {{a b} {\\} \\ "c \n d"}
{a b} {\\} \\ "c \n d"
% list {*}$L4 {*}{{a b} {\\} \\ "c \n d"}
{a b} {\\} \\ {c 
 d} {a b} {\\} {\\} {c \n d}

This may have been supported by the following language in the dodekalogue:
[5] Argument expansion. If a word starts with the string {*} followed by a non-whitespace character, then the leading {*} is removed and the rest of the word is parsed and substituted as any other word. After substitution, the word is parsed again without substitutions, and its words are added to the command being substituted.

That is not consistent with TIP#157 however, since this clearly specifies the {*}-reparsing as "After substitution, the word is then parsed as a _list_ (as if with Tcl_SplitList() or Tcl_GetListFromObj)". Backslash substitution is a necessary part of list syntax, since the string representation for list elements which are not balanced with respect to braces requires backslash substitution. 

I also note that there doesn't seem to be any tests for {*} which involves applying it to lists with a string reperesentation that contains backslashes.

----------------------------------------------------------------------

>Comment By: Don Porter (dgp)
Date: 2008-12-01 17:31

Message:

Thanks for the work on this.  I see that
tclCompile.c is now back to the same code
as before patches here; back to Revision 1.160,
and the fix for this bug is now completely
changes to tclParse.c.

Committing a few code standards changes to HEAD.

The remaining improvement I would make (low priority)
would be to break up literal lists into simple
parts and complex parts so the simple parts can
be expanded in the parser, leaving TCL_TOKEN_EXPAND_WORD
only for the elements that need it.  This might
involve an improved variant of TclFindElement() that
reports back the data we need so we don't have to
make a second pass scanning for backslashes.

Can you confirm my reading of the logs that so far
there's been no attempt to backport a fix to the
core-8-5-branch ?

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-28 10:35

Message:
Fixing status finger-slip ;-)

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-28 10:31

Message:
Committed.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-27 01:41

Message:

It will be at least several days
before I can review.  Since you've
got tests and I trust you've run them,
go ahead and commit.  I'll review
from there (and others will too).

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-25 19:18

Message:
Attaching expand2.patch implementing Solution2, i.e. detecting naked
backslashes in to-expand literals and disabling parse-time expansion in
that case. Test suite updated to acknowledge the fact.
Also, your example now yields:
% proc \{ {} {puts [info frame 1]}
% {*}[list \{]
type eval line -1 cmd {{*}[list \{]} level 1
% {*}{\{}
type eval line -1 cmd {{*}{\{}} level 1

Notice the patch is much smaller than its apparent size due to the large
number of '-', since it includes the reversion of expand1c.patch.

Please review.
File Added: expand2.patch

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-24 17:44

Message:
OK so you're voting for Solution 2 after all... Why not just say so ;-)
Thanks for the journey in the parser though. No regrets...

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-24 17:40

Message:

What I would like is a solution that
keeps TCL_TOKEN_EXPAND_WORD out of the
picture for the overwhelmingly common
case where breaking into equivalent
TCL_TOKEN_SIMPLE_WORDs can be done without
error.

For the cases like those pointed out in
the bug report, I would just punt the whole
problem to runtime, just the way the whole
"not a valid list" error reporting is handled.

I think that combination can be done with no
new Tcl_Token types.

Schedule is pulling me away again for a week or
more, but I will contribute something here
besides griping, I promise.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-18 19:06

Message:
Just committed the simplifications induced by further analyzing dead
branches. tclCompCmds.c is pristine again. tclParse.c's secondary cases
eliminated as well (were not WORD tokens). tclCompile.c pruned as you
suggested. In total only 3 instances of collapsing remain:
TclCompileTokens, TclWordKnownAtCompileTime, TclSubstTokens. Of these, I am
only able to produce a realistic use-case of the first. However, I thing
they should be kept in case of future evolutions (like new compiled
commands using them).
As you can see, the residual size of the patch is getting small. However
I'm still interested in your opinion re Solution 2.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-18 18:10

Message:
You are right about 1469-1506: they are under the SIMPLE_WORD case, while
we carefully demoted from SIMPLE_WORD to WORD the containers of UNCOLLAPSED
tokens (to avoid ending up collapsing in all CompCmds). Removing them.

Now about your suggestion: this is what I called Solution2 earlier in the
thread. Indeed, what makes {*}$x work from the beginning is the fact that
there is no parse-time expansion, it is runtime expansion. So Solution2
would consist of removing that optimization that is parse-time literal
expansion, and always generate EXPAND tokens even for literals. We'll be
losing some performance for large expanded constants. Dunno how important
they are in real life though... Are you advocating Solution 2 ?

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-18 16:10

Message:

On a different track, when I expressed
reservations about complexity, I was
referring to the addition of a new
Tcl_Token type.  I still need to find
time for a more thorough review, but
my impression is that such an addition
should not be necessary to fix the bug.

This example has always worked:

% set l {a \n b}
a \n b
% list {*}$l
a {
} b

and there was no need for
the additional token type
to enable that, so I would think
the same capabilities could make

list {*}{a \n b}

work as well without the new type.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-18 16:03

Message:

It appears that running the current
test suite does not reach line 1469
or line 1506 of tclCompile.c.  Do you
have any cases that would do so?  If
these branches aren't really necessary,
it would be nice to remove them.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-18 14:26

Message:

Do the new tests cover all the
branches added by the patch ?

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-17 17:30

Message:
Committed as per Donald's suggestion.
Leaving open for review.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-16 06:18

Message:
OK. The reason I did this was to limit the range of changes. Assuming
checks for TCL_TOKEN_TEXT were spread all over the code, I had the
hope that some of them could work "transparently" for uncollapsed text
(assuming they are done by bit-masking).

Now in hindsight it appears that all clients of text tokens must, by
definition, handle differently normal and to-collapse test.
So your argument wins. Updating the patch.
File Added: expand1c.patch

----------------------------------------------------------------------

Comment By: Jan Nijtmans (nijtmans)
Date: 2008-11-16 03:05

Message:
My only remark is that things like:
    case TCL_TOKEN_TEXT|TCL_TOKEN_MUSTCOLLAPSE:
look confusing: it tests for the new token type
(a variant of TCL_TOKEN_TEXT). I would not introduce
the concept of 'modifiers' for tokens, especially not if
this will be backported to Tcl 8.5. Why not just use
a new token type, like TCL_TOKEN_UNCOLLAPSED_TEXT
(bikeshed warning: I don't mind the exact name)

Then in tcl.h we get:
     #define TCL_TOKEN_UNCOLLAPSED_TEXT 512

The above
    case TCL_TOKEN_TEXT|TCL_TOKEN_MUSTCOLLAPSE:
will simplify to
    case TCL_TOKEN_UNCOLLAPSED_TEXT:

Expressions like originally:
    if (bodyToken[i]->type == TCL_TOKEN_TEXT) {
which you changed to:
    if (bodyToken[i]->type & TCL_TOKEN_TEXT) {
would become:
    if (bodyToken[i]->type & (TCL_TOKEN_TEXT|TCL_TOKEN_UNCOLLAPSED_TEXT))
{

This would make the intention of the patch more clear. It makes the
remark you put in tcl.h unneccessary, and the case statement
less confusing.

Apart form that, it looks fine to me.

My $.02

Regards,
          Jan Nijtmans

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-14 19:28

Message:
Fixed. Attaching the completed patch + tests (including parser tests).

File Added: expand1b.patch

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-13 13:30

Message:
Oops just found one extra failing case:

 % proc h {} {return [list {*}{a \n b}]}
 % h
 a {\n} b

I know what's happening, I remember I forgot to update that part
(CompileTokens) because it was a bit late :-)
Will fix it shortly.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-13 12:55

Message:
I have to admit I too knew nothing of it 4 days ago...
I know Miguel is under awful load, but really I'd appreciate a Dirac of
review activity here :-}
(if one of you can ^Z him...)
In the meantime I'll write a few extra tests. 

----------------------------------------------------------------------

Comment By: Andreas Kupries (andreas_kupries)
Date: 2008-11-13 12:04

Message:
Looking over the patch I see no reason to object.
I should qualify that this looks to touch areas (parser) I do not know
much about.
Giving back to dgp.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-12 12:19

Message:
Sorry if I have given the impression of a high complexity; indeed the patch
is very small.
My long paragraph was there only to explain how I had wedged a bit in
Tcl_Tokens without extending the sizeof(), and while staying fast (unlike a
bitfield).
The short story is:

   - set this bit on expanded TCL_TOKEN_TEXT's with backslashes

   - in both clients (compile and eval), detect this bit *after* TIP280
data extraction, and collapse

As for the line number "1" I do not know the details, but I'm not really
surprised because the patch tries hard to stay on rock-solid "ROM source",
meaning storing real offsets in the original string. So line "1" is at
least consistent with that ;-)

Andreas, your green light ?

-Alex

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-12 11:54

Message:

andreas should have look too

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-12 11:50

Message:

I've not looked at the code of
expand1.patch, but the description
sounds more complex than I would
prefer, at least as my first impulse.

Applying and testing it I see:

% proc \{ {} {puts [info frame 1]}
% {*}[list \{]
type eval line -1 cmd {{*}[list \{]} level 1
% {*}{\{}
type eval line 1 cmd {{*}{\{}} level 1

which has what appears to be the proper
behavior.  There's no proper literals in the
original script to point to, so we fall
back to the "eval" reporting.  I don't
understand why one form reports line 1
and the other reports line -1, but since
I don't understand, I don't want to make
a claim of incorrectness either.

If there's a reasonably obvious simpler
way to achieve this, I think I'd prefer it,
but if not, or if not with reasonable
short term effort, then committing 
expand1.patch is fine with me so that
we have *some* fix in place and we
can debate whether or not it is the
best fix at a more leisurely pace.

Some additional tests are also in order.

Thanks for tackling this Alexandre.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-11 21:53

Message:

No evening session tonight.

Confirmed the bug was introduced
by the first TIP 280 commit on
2006-11-28.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-11 19:01

Message:
Please find attached a patch implementing Solution (1): delayed collapse.

It passes the test suite, including [info frame], since it respects the
tying between TIP280-data and the original (ROM) source, while doing the
necessary collapse just before use, which is (a) before literal creation in
TclCompileScript and (b) before string appendage in TclEvalEx.

One word about complexity: I did my best to minimize the impact for the
"rest of us" never expanding a literal ;-)
For this, I kept the size of the Tcl_Token struct unchanged. Only I
exploited the fact (documented in comments in tcl.h) that TCL_TOKEN_*
values were actually bits so that class tests can be done easily, that is,
by bitwise AND. The funny fact is that nowhere in the core is this actually
done, all tests are done with "==". For this patch I had to define a
modifier bit in this field, defined only for type TCL_TOKEN_TEXT. So I did
change all existing checks "==TCL_TOKEN_TEXT" into "&TCL_TOKEN_TEXT", which
allows for graceful ignorance of the modifier everywhere it needs be. I
believe the perf impact of this change should be between zero and epsilon
;-)

Please review.
File Added: expand1.patch

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-11 18:17

Message:

Confirmed bug introduce sometime
between 8.5a5 and 8.5a6:

% info patch
8.5a5
% {expand}{\{}
invalid command name "{"

% info patch
8.5a6
% {expand}{\{}
invalid command name "\{"

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-11 17:04

Message:
I have not checked, but I found this in tclParse.c's history

revision 1.53
date: 2007/05/30 18:12:59;  author: dgp;  state: Exp;  lines: +138 -53
        * generic/tclBasic.c:   Removed code that dealt with
        * generic/tclCompile.c: TCL_TOKEN_EXPAND_WORD tokens representing
        * generic/tclCompile.h: expanded literal words.  These sections
were
        mostly in place to enable [info frame] to discover line
information
        in expanded literals.  Since the parser now generates a token for
        each post-expansion word referring to the right location in the
        original script string, [info frame] gets all the data it needs.

        * generic/tclInt.h:     Revised the parser so that it never
produces
        * generic/tclParse.c:   TCL_TOKEN_EXPAND_WORD tokens when parsing
an
        * tests/parse.test:     expanded literal word; that is, something
like
        {*}{x y z}.  Instead, generate the series of
TCL_TOKEN_SIMPLE_WORD
        tokens to represent the words that expansion of the literal
string
        produces.  [RFE 1725186]

So, it looks like prior to this date, things were going as per what I
called "Solution (2)". Since that implies runtime expansion, which involves
Tcl_GetListFromObj, I'd risk a guess that the bug appeared at that time
:-{

So the dilemma becomes:

   (1) complexify Tcl_Token handling a bit to flag must-collapse-backslash
tokens, to be collapsed by the client.

   (2) go back to situation as of revision 1.53 of tclParse.c and bury
[RFE 1725186]

What do you think ?

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2008-11-11 16:41

Message:

Thanks for reporting this.

I acknowledge this is important
to resolve one way or another
before another release.
I've raised priority to signal that.

In return please acknowledge that
this will require care and attention,
and that will have to wait until I
return from the conference I am
currently attending.

The next step I will/would take is
to look at the original implementation
of {expand} before TIP 280 was an
issue.  Did it have the same troubles?

----------------------------------------------------------------------

Comment By: Peter Spjuth (pspjuth)
Date: 2008-11-10 11:48

Message:
In patch 1870327 I also encountered the problem of
compiling dynamic source.  I did some ugly workaround.
I'm interested in any solution to this problem.  

https://sourceforge.net/tracker/?func=detail&aid=1870327&group_id=10894&atid=310894

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-10 09:43

Message:

Progress report: I now realize that this approach has problems.
Maintaining proper line info while violating the "source in ROM" principle
is tricky.
I can think of two alternatives:
(1) Delay the TclCopyAndCollapse by adding to tokens a flag bit
TCL_TOKEN_COLLAPSE, so that only the clients of the parse tree (eval and
compile) call TCAC at the latest possible moment (i.e. just when building
the objv[]) and still operate on source-ROM for line info.
(2) Consider compile-time expansion of literals as a rather useless
optimization, and generate a runtime-expand instead.

Don, Miguel, please advise !!!

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-10 06:51

Message:
Attaching a patch fixing the issue, using the latter (lazier) merthod: a
linked list of individual ckallocs.

However, this patch *does* break tests info-33.[13], (info frame).
I suspect the reason is that it introduces text pointer values that are
outside the source, and that get recorded in the bytecode.
This can also be seen by ::tcl::unsupported::disassemble on the generated
proc of these tests.
I will look into these details now, but I'd appreciate Miguel to take over
;-)

File Added: expand.patch

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-10 04:42

Message:
Hmm, maybe the reason is that it is not easy to tie any new allocations to
the result of the parsing. Indeed the rest of tclParse.c arranges for token
cells to be allocated/reallocated in a contiguous array, with their member
strings pointing into the original source. So, if we (correctly) decide to
call TclCopyAndCollapse for each "found element", we'll need to spit out
the resulting backslash-subst'ed strings in some space that will die at
Tcl_FreeParse time. Maybe we need to complexify the Tcl_Parse struct a bit
for that (eg by keeping a single growing malloc'ed block similar to the
token array but for massaged strings). Or maybe this is premature
optimization again, and a linked list of individual mallocs will do...

Miguel, do you confirm/infirm ?

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2008-11-10 03:30

Message:
Ouch :-)

First, a simpler way of summing up the bug is to notice the violation of
the invariant

   list {*}{SOMETHING}  ==  [lrange {SOMETHING} 0 end]

Second, we witness this violation only for expanded literals, that is,
those that are expanded directly within the token stream of the parser.

Third, this indeed is due to the fact that the relevant part of the code
in tclParse.c only calls TclFindElement instead of Tcl_GetListFromObj or
equivalent.
This makes a whole lot of a difference of course, since
Tcl_GetListFromObj..SetListFromAny calls both TclFindElement *and*
TclCopyAndCollapse, which does the necessary backslash subst.

Miguel, why this ? Premature optimization ;-) ?

-Alex

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2251175&group_id=10894

[Tcl-bugs] [ tcl-Bugs-2251175 ] Expand ({*}), constants, and list syntax

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-2251175 ] Expand ({*}), constants, and list syntax