From: SourceForge.net <no...@so...> - 2008-11-15 00:28:30
|
Bugs item #2251175, was opened at 2008-11-09 22:14 Message generated for change (Comment added) made by ferrieux You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2251175&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 45. Parsing and Eval Group: current: 8.6a3 Status: Open Resolution: None Priority: 9 Private: No Submitted By: Lars Hellström (lars_h) Assigned to: Don Porter (dgp) Summary: Expand ({*}), constants, and list syntax Initial Comment: The syntax {*} uses when reparsing a literate constant appears to deviate from the usual list syntax in that it does not perform backslash substitution. Short example: % set L1 [list \{] ; # List intrep \{ % set L2 {\{} ; # No intrep \{ % set L3 [list {*}$L1 {*}$L2 {*}{\{} {*}"\\\{"] \{ \{ {\{} \{ In all of these cases, the string being reparsed by {*} has a backslash as first character and a left brace as second character (i.e., the canonical string representation for the length 1 list whose only element is the left brace), but when that string is brace-delimited then something goes horribly wrong; it appears the {*} list parser thinks the backslash is just an ordinary character: % set L4 {{a b} {\\} \\ "c \n d"} {a b} {\\} \\ "c \n d" % list {*}$L4 {*}{{a b} {\\} \\ "c \n d"} {a b} {\\} \\ {c d} {a b} {\\} {\\} {c \n d} This may have been supported by the following language in the dodekalogue: [5] Argument expansion. If a word starts with the string {*} followed by a non-whitespace character, then the leading {*} is removed and the rest of the word is parsed and substituted as any other word. After substitution, the word is parsed again without substitutions, and its words are added to the command being substituted. That is not consistent with TIP#157 however, since this clearly specifies the {*}-reparsing as "After substitution, the word is then parsed as a _list_ (as if with Tcl_SplitList() or Tcl_GetListFromObj)". Backslash substitution is a necessary part of list syntax, since the string representation for list elements which are not balanced with respect to braces requires backslash substitution. I also note that there doesn't seem to be any tests for {*} which involves applying it to lists with a string reperesentation that contains backslashes. ---------------------------------------------------------------------- >Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-15 01:28 Message: Fixed. Attaching the completed patch + tests (including parser tests). File Added: expand1b.patch ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-13 19:30 Message: Oops just found one extra failing case: % proc h {} {return [list {*}{a \n b}]} % h a {\n} b I know what's happening, I remember I forgot to update that part (CompileTokens) because it was a bit late :-) Will fix it shortly. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-13 18:55 Message: I have to admit I too knew nothing of it 4 days ago... I know Miguel is under awful load, but really I'd appreciate a Dirac of review activity here :-} (if one of you can ^Z him...) In the meantime I'll write a few extra tests. ---------------------------------------------------------------------- Comment By: Andreas Kupries (andreas_kupries) Date: 2008-11-13 18:04 Message: Looking over the patch I see no reason to object. I should qualify that this looks to touch areas (parser) I do not know much about. Giving back to dgp. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-12 18:19 Message: Sorry if I have given the impression of a high complexity; indeed the patch is very small. My long paragraph was there only to explain how I had wedged a bit in Tcl_Tokens without extending the sizeof(), and while staying fast (unlike a bitfield). The short story is: - set this bit on expanded TCL_TOKEN_TEXT's with backslashes - in both clients (compile and eval), detect this bit *after* TIP280 data extraction, and collapse As for the line number "1" I do not know the details, but I'm not really surprised because the patch tries hard to stay on rock-solid "ROM source", meaning storing real offsets in the original string. So line "1" is at least consistent with that ;-) Andreas, your green light ? -Alex ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2008-11-12 17:54 Message: andreas should have look too ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2008-11-12 17:50 Message: I've not looked at the code of expand1.patch, but the description sounds more complex than I would prefer, at least as my first impulse. Applying and testing it I see: % proc \{ {} {puts [info frame 1]} % {*}[list \{] type eval line -1 cmd {{*}[list \{]} level 1 % {*}{\{} type eval line 1 cmd {{*}{\{}} level 1 which has what appears to be the proper behavior. There's no proper literals in the original script to point to, so we fall back to the "eval" reporting. I don't understand why one form reports line 1 and the other reports line -1, but since I don't understand, I don't want to make a claim of incorrectness either. If there's a reasonably obvious simpler way to achieve this, I think I'd prefer it, but if not, or if not with reasonable short term effort, then committing expand1.patch is fine with me so that we have *some* fix in place and we can debate whether or not it is the best fix at a more leisurely pace. Some additional tests are also in order. Thanks for tackling this Alexandre. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2008-11-12 03:53 Message: No evening session tonight. Confirmed the bug was introduced by the first TIP 280 commit on 2006-11-28. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-12 01:01 Message: Please find attached a patch implementing Solution (1): delayed collapse. It passes the test suite, including [info frame], since it respects the tying between TIP280-data and the original (ROM) source, while doing the necessary collapse just before use, which is (a) before literal creation in TclCompileScript and (b) before string appendage in TclEvalEx. One word about complexity: I did my best to minimize the impact for the "rest of us" never expanding a literal ;-) For this, I kept the size of the Tcl_Token struct unchanged. Only I exploited the fact (documented in comments in tcl.h) that TCL_TOKEN_* values were actually bits so that class tests can be done easily, that is, by bitwise AND. The funny fact is that nowhere in the core is this actually done, all tests are done with "==". For this patch I had to define a modifier bit in this field, defined only for type TCL_TOKEN_TEXT. So I did change all existing checks "==TCL_TOKEN_TEXT" into "&TCL_TOKEN_TEXT", which allows for graceful ignorance of the modifier everywhere it needs be. I believe the perf impact of this change should be between zero and epsilon ;-) Please review. File Added: expand1.patch ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2008-11-12 00:17 Message: Confirmed bug introduce sometime between 8.5a5 and 8.5a6: % info patch 8.5a5 % {expand}{\{} invalid command name "{" % info patch 8.5a6 % {expand}{\{} invalid command name "\{" ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-11 23:04 Message: I have not checked, but I found this in tclParse.c's history revision 1.53 date: 2007/05/30 18:12:59; author: dgp; state: Exp; lines: +138 -53 * generic/tclBasic.c: Removed code that dealt with * generic/tclCompile.c: TCL_TOKEN_EXPAND_WORD tokens representing * generic/tclCompile.h: expanded literal words. These sections were mostly in place to enable [info frame] to discover line information in expanded literals. Since the parser now generates a token for each post-expansion word referring to the right location in the original script string, [info frame] gets all the data it needs. * generic/tclInt.h: Revised the parser so that it never produces * generic/tclParse.c: TCL_TOKEN_EXPAND_WORD tokens when parsing an * tests/parse.test: expanded literal word; that is, something like {*}{x y z}. Instead, generate the series of TCL_TOKEN_SIMPLE_WORD tokens to represent the words that expansion of the literal string produces. [RFE 1725186] So, it looks like prior to this date, things were going as per what I called "Solution (2)". Since that implies runtime expansion, which involves Tcl_GetListFromObj, I'd risk a guess that the bug appeared at that time :-{ So the dilemma becomes: (1) complexify Tcl_Token handling a bit to flag must-collapse-backslash tokens, to be collapsed by the client. (2) go back to situation as of revision 1.53 of tclParse.c and bury [RFE 1725186] What do you think ? ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2008-11-11 22:41 Message: Thanks for reporting this. I acknowledge this is important to resolve one way or another before another release. I've raised priority to signal that. In return please acknowledge that this will require care and attention, and that will have to wait until I return from the conference I am currently attending. The next step I will/would take is to look at the original implementation of {expand} before TIP 280 was an issue. Did it have the same troubles? ---------------------------------------------------------------------- Comment By: Peter Spjuth (pspjuth) Date: 2008-11-10 17:48 Message: In patch 1870327 I also encountered the problem of compiling dynamic source. I did some ugly workaround. I'm interested in any solution to this problem. https://sourceforge.net/tracker/?func=detail&aid=1870327&group_id=10894&atid=310894 ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-10 15:43 Message: Progress report: I now realize that this approach has problems. Maintaining proper line info while violating the "source in ROM" principle is tricky. I can think of two alternatives: (1) Delay the TclCopyAndCollapse by adding to tokens a flag bit TCL_TOKEN_COLLAPSE, so that only the clients of the parse tree (eval and compile) call TCAC at the latest possible moment (i.e. just when building the objv[]) and still operate on source-ROM for line info. (2) Consider compile-time expansion of literals as a rather useless optimization, and generate a runtime-expand instead. Don, Miguel, please advise !!! ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-10 12:51 Message: Attaching a patch fixing the issue, using the latter (lazier) merthod: a linked list of individual ckallocs. However, this patch *does* break tests info-33.[13], (info frame). I suspect the reason is that it introduces text pointer values that are outside the source, and that get recorded in the bytecode. This can also be seen by ::tcl::unsupported::disassemble on the generated proc of these tests. I will look into these details now, but I'd appreciate Miguel to take over ;-) File Added: expand.patch ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-10 10:42 Message: Hmm, maybe the reason is that it is not easy to tie any new allocations to the result of the parsing. Indeed the rest of tclParse.c arranges for token cells to be allocated/reallocated in a contiguous array, with their member strings pointing into the original source. So, if we (correctly) decide to call TclCopyAndCollapse for each "found element", we'll need to spit out the resulting backslash-subst'ed strings in some space that will die at Tcl_FreeParse time. Maybe we need to complexify the Tcl_Parse struct a bit for that (eg by keeping a single growing malloc'ed block similar to the token array but for massaged strings). Or maybe this is premature optimization again, and a linked list of individual mallocs will do... Miguel, do you confirm/infirm ? ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2008-11-10 09:30 Message: Ouch :-) First, a simpler way of summing up the bug is to notice the violation of the invariant list {*}{SOMETHING} == [lrange {SOMETHING} 0 end] Second, we witness this violation only for expanded literals, that is, those that are expanded directly within the token stream of the parser. Third, this indeed is due to the fact that the relevant part of the code in tclParse.c only calls TclFindElement instead of Tcl_GetListFromObj or equivalent. This makes a whole lot of a difference of course, since Tcl_GetListFromObj..SetListFromAny calls both TclFindElement *and* TclCopyAndCollapse, which does the necessary backslash subst. Miguel, why this ? Premature optimization ;-) ? -Alex ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=2251175&group_id=10894 |