You can subscribe to this list here.
2017 |
Jan
|
Feb
(2) |
Mar
(6) |
Apr
(4) |
May
(20) |
Jun
(15) |
Jul
(4) |
Aug
(2) |
Sep
(6) |
Oct
(6) |
Nov
(20) |
Dec
(3) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2018 |
Jan
(16) |
Feb
(3) |
Mar
(7) |
Apr
(40) |
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
(1) |
2019 |
Jan
(7) |
Feb
(5) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Kevin K. <kev...@gm...> - 2019-02-18 05:58:30
|
I'm continuing to make some progress on refactoring callframe operations. The code in the 'kbk-callframe-refactor' branch is extremely slow, but works on all the test cases except for ones that are explicitly patched out in demos/perftest/tester.tcl. The patched-out ones include + everything that emits direct* operations. These need to be revised to carry the callframe explicitly - as we've discussed in earlier messages. + magicreturn. This falls down trying to construct a FAIL NUMERIC. We've had an email thread about this already. and the tests that are patched out on trunk. (Some of these, like errortest7 and expandtest::test[56] are related to how we don't handle loop exception ranges.) As I said earlier, the way that this works is to make the front end insert 'moveFromCallFrame' for every variable reference, and 'moveToCallFrame' for every variable assignment. The plan is then to have passes that do the escape analysis and figure out what moves are safe to delete. (Unless Donal or someone screams again, I'm planning to move forward on the assumption that traces for repeated reads and traces for repeated writes can be coalesced, as long as at least one trace of the given type fires. Essentially, it will consider code between escape points to be monolithic, and require only the first read and the last write to fire traces. (It will be permissible to optimize less agressively and fire other traces corresponding to accesses in the program - but only the first read and the last write will be guaranteed.) I've started a writeup on how I propose to do the optimization, which can be found at http://core.tcl.tk/tclquadcode/doc/kbk-refactor-callframe/doc/20190216callframe/callframe.md - none of what is written there is implemented yet! The sections on loop-invariant code motion have yet to be written, although I have several pages of scribblings on the subject in a paper notebook. Essentiially, upward motion will follow the same general plan as anticipability analysis in 'quadcode/pre.tcl' - insert 'moveFromCallFrame' at exit from blocks where it is fully anticipable and partially available on entry to the successor but not available on exit in the predecessor. (This is a generalization of loop-invariant code motion.) Downward motion is very nearly the inverse, but instead of looking for SSA values that are already partly available, it looks for Tcl values that are live in a successor and partly anticipated on entry to the successor but not fully anticipated in the successor, and inserts the 'moveToCallFrame' on entry to the successor. (I need to work out some details before I can write this one up in full.) I think that these changes will actually exceed the performance of the current code, because they can be more aggressive about eliminating data motion. They're also fairly flexible, since the data flows are fairly abstract. We can tweak the effects of individual instructions if we don't like the escape analysis. Anyway, I'd appreciate if Donal at least could have a look at what I've got so far and let me know if I'm totally barking up the wrong tree. http://core.tcl.tk/tclquadcode/doc/kbk-refactor-callframe/doc/20190216callframe/callframe.md Kevin |
From: Kevin K. <kev...@gm...> - 2019-02-05 18:53:01
|
One more thing with returns. For [return -level N -code C $value] with N>1, there's some jiggery-pokery at the [return] site, to set iPtr->returnLevel to N, iPtr->returnCode to C, and then return TCL_RETURN. That means that there needs to be no check for the level at the call site until we've detected that we're in the TCL_RETURN path. For optimization, I'd like to keep track of which exception codes (ERROR, RETURN, BREAK, CONTINUE, other) that a proc might return, so as to be able to remove excess checks from after the invoke. I think that the base case before optimization will have to look something like: result ←invoke(callframe, command, arg1, ...) callframe ← extractCallframe(result) result ← retrieveResult(result) jumpMaybe exceptionPath, result jump normalPath normalPath: result ←extractMaybe(result) ... execution continues ... exceptionPath: # NOTE: This is where the hooks for escape to interpreted code, or for dynamic recompilation, would go. # QUESTION: Do we need to check iPtr->execEnvPtr->rewind? result ← extractFail(result) # If the innermost exception range is a loop (track this in translate.tcl): r1 ← resultIsBreak(result); # new test for TCL_BREAK jumpTrue breakPC, r1 jump e1 e1: r1 ← resultIsContinue(result); new test for TCL_CONTINUE jumpTrue continuePC, r1 jump e2 # end of code handling the loop exception range e2: # QUESTION: Code to append to the backtrace (analogous to TclLogCommandInfo) needed? # QUESTION: Do we need to check TclCanceled(interp)? TclLimitExceeded(iPtr->limit)? # If there's an enclosing catch, jump to it, and otherwise fall out to the error pathway. jump enclosingCatchOrError In addition, once we're on the error path of a proc, we need to check for abnormal returns. In bytecode, this goes through TEOEx_ByteCodeCallback. It checks for whether -level has reached 0 - if it hasn't, we'll be passing the exception through to the proc's caller. When -level has reached zero, if the return code is TCL_RETURN, it unboxes the options to replace with whatever they were hiding. Otherwise, if the result is not TCL_OK or TCL_ERROR, it formats 'invoked "break" outside a loop' or 'invoked "continue" outside a loop' or 'command returned bad code: C' and throws that as an error. I think we can detect this case statically because we have the exception range at hand when compiling for the invoke, so if there's neither a loop nor a catch exception range and a command might return BREAK, CONTINUE or [other], have the test for the result code jump to an 'initException / jump error' sequence. So we have: -- on return - we need to handle -level by updating iPtr->returnLevel and iPtr->returnCode. -- after an [invoke] etc. returns - we need to check for BREAK/CONTINUE (vs the innermost exception range, with different handling for loop and catch), have the special handling for BREAK/CONTINUE/[other] to turn into errors if nothing catches them, and have all the logic to unpack -level when the result code is TCL_RETURN. It would be desirable to have FAIL divide into multiple subtypes: ´could be an error' 'could be a break' 'could be a continue' 'could be a return' 'could be an unusual return code', so as to be able to optimize out the checks for the impossible cases. These subtypes wouldn't need any separate internal rep (and I could have the typename be FAIL for all of them) since they would all be 'integer Tcl result code, plus whatever baggage that carries in the interpreter internals' |
From: Kevin K. <kev...@gm...> - 2019-02-04 05:00:28
|
Following up a little farther: One related issue is that I'm also currently sort of blocked on implementing further procedure inlining, largely because I don't know how I might go about reporting an error in an inner context. Right now, it appears that there's a hidden pathway by which /@debug-line/ and /@debug-script/ information gets into /initException/. Moreover, I have no idea how I might create the correct internal values in the interpreter for an inner context when throwing an error. (To say nothing of what to do with [return -code break] and [return -code continue], but I already discussed those.) On Sun, Feb 3, 2019 at 9:29 PM Kevin Kenny <kev...@gm...> wrote: > > On 2/3/19 9:24 AM, Donal K. Fellows wrote: > > I've been trying to comprehend what is going on (and going wrong) with > > a number of our test cases: the problems usually seem to be closely > > related to the results of procedres. I'm not sure that we're modelling > > results correctly: they're actually fairly annoyingly complex, so if > > we're getting stuff wrong then that might be the cause of at least som > > of the trouble that we've been experiencing. > > I don't think it's as bad as you imagine. > > The fundamental problem that I'm aware of is that we are NOT handling > the 'return', 'break', 'continue' and 'other' exception codes correctly > when they're coming from a procedure. > > All the agonizing over what the fields in the interpreter are is most > likely not necessary. A FAIL is simply 'a nonzero return code, together > with whatever it's left in the interp' (objResult, errorInfo, errorCode, > errorStack options dictionary). (Note that [return -code] is capable of > creating unusual combinations of these, but meaningless fields are > generally ignored.) A FAIL SOMETHINGELSE is simply the union of that > return code and the SOMETHINGELSE data type, whatever that is. So what > we should have for the internal representation is: > > FAIL - As long as we can get the return value from the interp, then > FAIL can adequately be represented with just an integer. > > FAIL SOMETHINGELSE - An integer, plus a SOMETHINGELSE whose value > is meaningful only if the integer is TCL_OK > > Anything else - Of course, whatever the type's native > representation is. > > The reason that we can get away with that is that the bytecode->quadcode > translation is coded so that there ought only ever to be one FAIL in > flight at a given time. All the stuff that's in the interp doesn't need > to be copied around FAIL objects because it can just be retrieved from > the interp again when needed. > > Where I *know* things are going badly wrong is in the compilation of all > /invoke/ sequences. We don't examine loop exception ranges at all in > quadcode/translate.tcl. Instead, any nonzero status winds up going to > the enclosing [catch] if there is one, and to the error return block for > the procedure otherwise. In TEBC, the action will be complicated: > > First, examine -level, and if it's still nonzero, jump to the enclosing > [catch] or return. > > If the -level is zero, then the return code determines the correct action: > > TCL_OK - extract the return value and continue. > > TCL_ERROR - go to the enclosing [catch] or return. > > TCL_RETURN - replace the return code with the -code, and go to the > enclosing [catch] or return > > TCL_BREAK - go to the 'break' in the innermost exception range if > it's a loop, or to the enclosing [catch] otherwise. If there's no > enclosing [catch], throw a 'break but not in a loop' error. > > TCL_CONTINUE - go to the 'continue' in the innermost loop exception > range, or to the enclosing [catch]. If there's no enclosing [catch] , > throw 'continue but not in a loop.' > > Anything else - go to the enclosing [catch] or return. > > (This is approximate, from memory, so don't believe me, believe the code > in TEBC. It's close, though.) > > All of this is done by TEBC implicitly, just as a result of a return > from /invoke/ or /invokeExpanded/. Quadcode doesn't seem to have any > particular plan for handling most of these cases - only a zero -level > and an OK or ERROR return code appear to be addressed. > > When we first started implementing /invoke/, I recall expressing some > concern about exception ranges, and hearing an answer where the gist was > either "don't worry about it," or "don't worry about it *yet*." I don't > recall which, but apparently the latter ought to have been the answer in > any case. Has the time come to worry about it? > > |
From: Kevin K. <kev...@gm...> - 2019-02-04 02:30:01
|
On 2/3/19 9:24 AM, Donal K. Fellows wrote: > I've been trying to comprehend what is going on (and going wrong) with > a number of our test cases: the problems usually seem to be closely > related to the results of procedres. I'm not sure that we're modelling > results correctly: they're actually fairly annoyingly complex, so if > we're getting stuff wrong then that might be the cause of at least som > of the trouble that we've been experiencing. I don't think it's as bad as you imagine. The fundamental problem that I'm aware of is that we are NOT handling the 'return', 'break', 'continue' and 'other' exception codes correctly when they're coming from a procedure. All the agonizing over what the fields in the interpreter are is most likely not necessary. A FAIL is simply 'a nonzero return code, together with whatever it's left in the interp' (objResult, errorInfo, errorCode, errorStack options dictionary). (Note that [return -code] is capable of creating unusual combinations of these, but meaningless fields are generally ignored.) A FAIL SOMETHINGELSE is simply the union of that return code and the SOMETHINGELSE data type, whatever that is. So what we should have for the internal representation is: FAIL - As long as we can get the return value from the interp, then FAIL can adequately be represented with just an integer. FAIL SOMETHINGELSE - An integer, plus a SOMETHINGELSE whose value is meaningful only if the integer is TCL_OK Anything else - Of course, whatever the type's native representation is. The reason that we can get away with that is that the bytecode->quadcode translation is coded so that there ought only ever to be one FAIL in flight at a given time. All the stuff that's in the interp doesn't need to be copied around FAIL objects because it can just be retrieved from the interp again when needed. Where I *know* things are going badly wrong is in the compilation of all /invoke/ sequences. We don't examine loop exception ranges at all in quadcode/translate.tcl. Instead, any nonzero status winds up going to the enclosing [catch] if there is one, and to the error return block for the procedure otherwise. In TEBC, the action will be complicated: First, examine -level, and if it's still nonzero, jump to the enclosing [catch] or return. If the -level is zero, then the return code determines the correct action: TCL_OK - extract the return value and continue. TCL_ERROR - go to the enclosing [catch] or return. TCL_RETURN - replace the return code with the -code, and go to the enclosing [catch] or return TCL_BREAK - go to the 'break' in the innermost exception range if it's a loop, or to the enclosing [catch] otherwise. If there's no enclosing [catch], throw a 'break but not in a loop' error. TCL_CONTINUE - go to the 'continue' in the innermost loop exception range, or to the enclosing [catch]. If there's no enclosing [catch] , throw 'continue but not in a loop.' Anything else - go to the enclosing [catch] or return. (This is approximate, from memory, so don't believe me, believe the code in TEBC. It's close, though.) All of this is done by TEBC implicitly, just as a result of a return from /invoke/ or /invokeExpanded/. Quadcode doesn't seem to have any particular plan for handling most of these cases - only a zero -level and an OK or ERROR return code appear to be addressed. When we first started implementing /invoke/, I recall expressing some concern about exception ranges, and hearing an answer where the gist was either "don't worry about it," or "don't worry about it *yet*." I don't recall which, but apparently the latter ought to have been the answer in any case. Has the time come to worry about it? |
From: Donal K. F. <don...@ma...> - 2019-02-03 14:24:29
|
I've been trying to comprehend what is going on (and going wrong) with a number of our test cases: the problems usually seem to be closely related to the results of procedres. I'm not sure that we're modelling results correctly: they're actually fairly annoyingly complex, so if we're getting stuff wrong then that might be the cause of at least som of the trouble that we've been experiencing. Warning: what follows isn't very organized. :-) Results in Tcl consist of an immediate int value (the result code) and a number of values in the interpreter. Not all parts are always present. (I hope my summary below is correct: note that LIST and DICT are STRING on most branches in our codebase, but the type restrictions there are obvious.) Code: 0 (TCL_OK) interp->objResultPtr defined (wraps a result of type α) Code: 1 (TCL_ERROR) interp->objResultPtr defined (wraps a message of type STRING) interp->errorInfo defined (STRING) interp->errorCode defined (LIST[*]) interp->returnOpts defined (DICT) Code: 2 (TCL_RETURN) interp->objResultPtr defined (wraps a result of type βⁿ) βⁿ need not be consistent with α or δ βⁿ need not be unified with other βⁿ (depending on returnLevel) βⁿ should be unified with α for correct level when returnLevel matches and returnCode is 0, or with δ for correct level when returnLevel matches and returnCode is ‘other’. interp->returnCode defined (int) interp->returnOpts defined (DICT) interp->returnLevel defined (int) Code: 3 (TCL_BREAK) no extra fields defined (interp->objResultPtr is EMPTY) RARE: most [break] jumps end up directly routed Code: 4 (TCL_CONTINUE) no extra fields defined (interp->objResultPtr is EMPTY) RARE: most [continue] jumps end up directly routed Code: other (≤ -1 or ≥ 5) interp->objResultPtr defined (wraps a result of type δ) δ need not be consistent with α or βⁿ δ should be unified with other δ from same function δ might need to be required to be STRING interp->returnOpts defined (DICT) [I didn't use a gamma because it looked too like a “v”.] This is ugly stuff. The current implementation is that a FAIL is a non-0 return code (with everything else implicit in the interpreter) and that a FOO FAIL is a tuple of an int and a FOO (the FOO is defined only for the 0 return code case). That's clearly not correct. Non-zero returns from a procedure should go to the exception handling coda that forms the right error state for the caller (and builds the error trace and so on) and produces a FAIL; that's the main objective of the ‘procLeave’ opcode. I suspect that there's something going wrong Donal. [* Formally not actually true, except random things break if it isn't. I ought to propose enforcing this in a TIP. ] |
From: Kevin K. <kev...@gm...> - 2019-01-29 19:42:37
|
Donal, As you know, I'm working on refactoring the moveToCallFrame and moveFromCallFrame stuff to be amenable to better analysis. The first step is to get everything running correctly if atrociously slowly, by making all references to Tcl variables access the copies in the callframe. (Then I can put the optimizations back in, and have a better theory of safety, that can even possibly be amenable to some traces.) I'm still struggling with getting everything working. I've managed to reduce the number of test failures in the 'kbk-refactor-callframe' branch substantially, but there is still a handful of things that I'm grappling with. ==== 1. magicreturn. The 'magicreturn' test is crashing code generation. I think that the issue is that the case: set l 0 set z [return -level $l [expr {$y + $y}]] has never tested what it purported to test. By the time we implemented [return -level], the quadcode front end already had constant folding, so it generated code similar to the first case. The return options reduced to {literal {-level 0}}. Now that $l is being retrieved from the callframe (and the retrieval is intentionally not yet being optimized) we are generating a code burst that looks something like: moveFromCallFrame {var l} {temp callframe} {literal l} list {temp 0} {literal -level} {var l} ... argument checking for 'add' ... add {temp 1} {temp y} {temp y} initException {temp exception} {temp 1} {temp 0} so that the type signature of the initException will be NUMERIC -> STRING -> FAIL NUMERIC This runs through 'compile' and the Big Switch passes it off to InitException in compile.tcl. It doesn't hit the pretty '-level 0' special case, and now (for the first time, perhaps?) gets into the code that looks like my StoreResult $tgt [$b $opcode {*}$vals $value $maintype \ $errorCode $name] and it's not finding an appropriately typed initException. Is there supposed to be some sort of promotion of NUMERIC here, or is there something missing that's supposed to allow it to promote a NUMERIC to a FAIL NUMERIC? ==== 2. lsetest lstetest still fails mysteriously, and I haven't yet found an explanation. I'd appreciate a second set of eyeballs. It's almost certainly something I'm doing in code gen, but I'm just not seeing it! ==== 3. direct* Because of aliases in the callframe, all the direct* operations at least notionally need to appear to modify the callframe for data flow analysis to work. If you could give me a worked example of modifying one of them (directAppend or directIncr, perhaps), or at least a sketch of how to proceed, to change from directAppend listOrErrorOut listIn value... jumpMaybe catchBlock, listOrErrorOut to directAppend combinedOut callframeIn listIn value... extractCallFrame callframeOut combinedOut retrieveResult listOrErrorOut listIn value... jumpMaybe catchBlock, listOrErrorOut that would be a huge help. Mostly, I need to learn how to construct and manage reference counting for the combined result. In the case of directAppend, the result type will be a CALLFRAME FAIL STRING, and it isn't clear to me what all the correct accessors are. I think that http://core.tcl.tk/tclquadcode/tktview?name=58e3e71962 may relate to this issue as well. (There is also a raft of TODO comments around the direct ops in translate.tcl, having to do with direct operations on non-fully-qualified names. Was this something you were leaving for me to chase? Can we at least have tickets for these, if that's the case?) ==== I'd really like to have the tests stable again before I start plugging away on doing rudimentary escape analysis, which would then allow us to attempt all four combinations of elimination of operations: load-load (the same variable pulled from the callframe or a namespace twice without anything intervening that might have changed it), load-store (saving a value to the callframe that is known to be identical to the value that's already there, and for which any necessary traces have fired.), store-load (retrieving a value from the callframe that cannot have changed since we put it there), and store-store (overwriting a value in the callframe that nothing can have examined with a same or different value). Ths should be more stable and flexible than the current approach of updating the callframe only at escape points (mostly invokes). It should also allow some amount of code motion; a load operation inside a loop that does not modify the variable can be hoisted out of the loop, for instance. Part of the reason I did pre.tcl is that it provides a good partial framework in which to implement this sort of thing. The concepts of 'available value' and 'anticipable value' that it uses generalize readily to slots in the callframe. The fact that it's working well now gives me some hope that we can get to intelligent load/store management. But I'm really bogged down with this last handful of tests, and really reluctant to move farther forward until they're stable. Help? |
From: Donal K. F. <don...@ma...> - 2019-01-06 08:58:37
|
Optimization is hard. Even small bugs can result in code that can be exploited to do arbitrary code execution. https://abiondo.me/2019/01/02/exploiting-math-expm1-v8/ Fortunately for us, I don't think Tcl can possibly be vulnerable this particular way as we don't conflate integers (and indices) with floating point numbers (and so can't have the strangenesses in the IEEE spec cross over into the critical domain to catch us unawares). But it is still a good cautionary tale. Donal. |
From: Donal K. F. <don...@ma...> - 2019-01-04 08:59:20
|
On 03/01/2019 20:23, Kevin Kenny wrote: > (a) Read traces (if we support them at all) will fire at least once > for any quadcode sequence that reads a value from a namespace > variable. Subsequent reads of a value that is known not to have been > modified since the previous read need not be traced. The major issue with read traces is linked variables (as created with Tcl_LinkVar) since not all user code is good about notifying us when it has updated the underlying C variable. Sometimes that's for good reason (such as it being a state variable of a non-Tcl system) but the upshot is the same: for variables with a read trace, it isn't safe to assume anything about their value other than by reading it. Donal. |
From: Kevin K. <kev...@gm...> - 2019-01-03 20:23:44
|
I had a few further thoughts about traces. When we're talking about traces as a feature that MUST be supported for quadcode to represent an acceptable subset of the language, we're mostly talking about, "what is needed for Tk to work?" [1] That's a lot less general than "support all use cases of traces". Tk uses write traces only, and generally accumulates data about what has happened and defers acting upon it to the idle loop. Because it works this way, trace elision actually sounds to me like a fairly good idea. The concept here would be: (a) Read traces (if we support them at all) will fire at least once for any quadcode sequence that reads a value from a namespace variable. Subsequent reads of a value that is known not to have been modified since the previous read need not be traced. (b) Write traces will fire when the value of a program variable is stored back in the Var data structure. It is not required that every place that a variable is set in a quadcode sequence must correspond with a write trace. Rather, all namespace variables must be updated prior to invoking code that might change them or cede control, and prior to return to a call thunk. (n other words, require that the variables actually be in sync only when giving control to something that isn't quadcode.) In addition, variables in the callframe must be in sync when invoking any code that may access them. There are a fair number of Tcl commands that access variables in a controlled, predictable fashion (for example, [gets], [read], [regexp], [regsub], [scan] and [binary scan]), and it is permissible to defer synchronization when invoking a command that does not access the variable in question. (c) Unset traces follow the same rule as write traces - they require synchronization prior to returning to a call thunk or invoking non-quadcode that might try to access the variable. Essentially, this makes the tracing rule for quadcode be, "before any non-quadcode subsystem gets control, whether by call, return, or coroutine transfer, at least one write or unset traces has fired presenting the current value of each traced variable that has been modified." That keeps Tk (and other subsystems depending on linked variables) working, and in fact saves them work if the same variable is hit multiple times. It also allows us considerable latitude for moving reads and writes around, for instance by pushing them above or below loops in which they appear. It's surely a change in language semantics between interpreted Tcl and quadcode, but I think it's most likely a change for the better, given what traces appear to be used for 'in the wild'.[2] ---- [1] There's also ::env, but I think a bunch of us concluded quite a long time ago that the read trace on ::env is NOT desirable. I really ought to work up the TIP for per-interp env, initialized from the process environment at initial interp creation time, copied from the parent interp when creating a child interp, and copied to the process environment (under mutual exclusion) during [exec] and other C logic that needs the environment in sync. The fact that neither we nor POSIX defines any sort of synchronization requirement for environment variable changes is crazy. End of dirgression. [2] I've also seen traces used for debugging instrumentation, in which case they *are* interested in each individual access or change to a variable. I'd imagine, though, that we'd want either to have a debugging option to the code generation to force such traces to be left in, or else run this sort of instrumentation only on interpreted code, rather than burden every use of quadcode with the cost of the additional hash table lookups and assignments. On Wed, Jan 2, 2019 at 9:58 AM Kevin Kenny <kev...@gm...> wrote: > > On Wed, Jan 2, 2019 at 6:35 AM Donal K. Fellows > <don...@ma...> wrote: > > On 02/01/2019 02:15, Kevin Kenny wrote: > > In theory, there's no problem with having multiple CALLFRAMEs within a > > single function once they're distinct entities (due to inlining). I've > > no idea if this is actually practical, of course! > > That's something else that I wanted to discuss. Right now, I'm doing > only the very simplest inlining. In effect, the inlined function has > to be trivial enough that I don't need its callframe. That rules out > having the inlined code do anything like [regexp] because without > multiple callframes, there's no way for the C implementation of > [regsub] to find the match variables. It also rules out throwing > errors until and unless I figure out what to do with 'procLeave'. But > that's an issue for another day. > > > > The new algorithms will be fairly aggressive about moving code around > > > - so that, for instance, a read-only use of a namespace variable > > > within a loop can be hoisted out of the loop if nothing reads or > > > writes the variable or any alias. IN designing this, I realize that > > > the logic is about to founder on directSet, directGet, directArraySet, > > > directArrayGet, etc. - fourteen operations in all. All of these at > > > least notionally modify or access the state of variables, and so need > > > to have the callframe among their parameters and results. > > > > Those operations are essentially designed for handling the cases where > > variables are accessed by qualified names, and CAN TRIGGER TRACES. We > > must handle them correctly, alas; lots of Tcl code is strongly dependent > > on traces. However, in the case that it is possible to prove that a > > variable name does not contain a namespace separator, something more > > efficient can be done; the side effects are much more contained (as we > > explicitly state that we won't support local variable traces). > > We do need to constrain scope somewhat! Traces, as they are currently > defined, are a feature about which I say, "if you need the > interpreter, you know where to find it!" Consider the following: > > proc tracer {n1 n2 op} { > uplevel 1 {set b 2} > } > trace add variable ::a write tracer > proc x {} { > set b 1 > set ::a 1 > return $b > } > puts [x] > > In the presence of traces, no assumptions about the semantics of code > are safe. 'tracer', instead of setting an irrelevant variable, could > have redefined 'set' or something equally nasty. > > What I suspect will save us is that nobody sane does radical semantic > alterations in performance-sensitive code. There will be some core set > of algorithms where a program spends its tiime that don't need the > full functionality of the interpreter and can be aggressively > optimized. > > > You're correct that they should be returning CALLFRAME FAIL STRING in > > the general case. > > Right now, on trunk, none of these functions takes the callframe. It's > just 'directSet varName value' and so on. That's the immediate > modification I'm proposing. > > > > Beyond this, the next thing to think about will be breaking up direct* > > > into smaller pieces. If 'directLappend result callframe listvar > > > element' could turn into a sequence like > > > > > > directRef listRef callframe var > > > refGet val listRef # don't know if I might > > > lappend val2 val element > > > refPut temp callframe listvar element > > > extractCallframe callframe2 temp > > > retrieveResult result temp > > > > > > (plus whatever error checking goes there) > > > > > > then I couid start looking at aggregating all the directRef's to the > > > same variable, hoisting them out of loops, and so on. Many of the > > > instructions in straight-line sequences like this are optimized to > > > zero-cost operations. > > > > Doing this is all about getting the semantics right. That's surprisingly > > difficult. For example, the code for the equivalent of: > > > > lappend ::foo::bar::var $a > > lappend ::foo::bar::var $b > > lappend ::foo::bar::var $c > > > > and > > > > lappend ::foo::bar::var $a $b $c > > > > needs to be able to avoid making extra references to the list in the > > variable if possible or we risk transforming O(N) algorithms into O(N²), > > and nobody will thank us for that. (Early versions of 8.6 had exactly > > this bug due to my getting things slightly wrong in the bytecode.) > > Indeed. I was talking about making references to the variable, not the > value. Already, for a local variable, we're able to retain the > variable ref - it's in the callframe after all - without trouble, As > far as retaining the value ref goes, that's also likely not to be an > issue because of the elimination of common subexpressions. > > I imagine a pattern like the following (all the error checking stuff > suppressed): > > getVarRef aRef callframe "::foo::bar::var" > lappendRef result aRef a > unset result > lappendRef result aRef b > unset result > lappendRef result aRef c > unset aRef > > The only reference to the list is still in the variable, but we avoid > the cost of resolution on all references except the first. > Essentially, aRef is holding a Var*. > > But that's if we think that ::foo::bar::var must be in sync at all > times, so read on... > > > Similarly, we *must* get the trace semantics mostly right on global > > variables, at least to the point of having at least one trace call in > > the right places so that access to variables synched into C code in > > extensions will work correctly. Also, I vaguely recall that [append] is > > strange in this regard. Or perhaps that was arguably a bug that is now > > fixed (or ought to be, for consistency's sake)? > > I think we're in agreement here, for sufficiently small values of > 'mostly right'. Rather than 'all variables in sync at all times', I > already have logic that assesses known effects. There are already some > commands that are identified as, 'this might change the value of any > variable, but otherwise have known effects', plus some commands like > [regexp] that have constrained effects: '[regexp] does not read any > variable and writes exactly those variables whose names are passed as > match variables'. > > So one definition of 'mostly right' could be that all variables are in > sync before and after calls to C commands that might examine or alter > them, and all namespace variables are in sync on procedure exit. > Dropping traces that would have fired between those points might be > fair game. > > > I suspect that these will reduce your scope for improvement a lot. > > > Otherwise, I'm generally happy to do this. I'm guessing that it might be > > easiest to split things so that we have operations that produce a VAR > > type (not a subclass of STRING) that we can then do the other changes > > on. That'd at least let us factor out the (potentially expensive!) > > variable lookup operations. The actual reads and writes are usually much > > cheaper (other than that the values must always be boxed and traces are > > a true issue). > > Given the problem that traces have arbitrary and unknowable side > effects, the best we can hope for is that some benign ones work. > > In such a world, there's also an opportunity for quadcode that we've > not yet properly addressed. If we allow variables to be out of sync > momentarily (provided that we've restored consistency any time we get > to external code that might use them), then we could mostly eliminate > the need for the K combinator. If we allow 'x' to have a transient > state where its value is lost, then > > set x [lreplace $x $a $b {*}$neweleents] > > would not incur a performance penalty with respect to the mysterious > > set x [lreplace $x[set x {}] $a $b {*}$newelements] > > because we could detect that the new value of x is always a > newly-created value and avoid the extra ref. (This would also involve > proving either that the [lreplace] cannot throw, or that the value of > $x will never again be used if it does.) > > Anyway, the immediate issue is that direct ops don't reference the > CALLFRAME formally, and need to. :) > > Kevin |
From: Kevin K. <kev...@gm...> - 2019-01-02 14:58:20
|
On Wed, Jan 2, 2019 at 6:35 AM Donal K. Fellows <don...@ma...> wrote: > On 02/01/2019 02:15, Kevin Kenny wrote: > In theory, there's no problem with having multiple CALLFRAMEs within a > single function once they're distinct entities (due to inlining). I've > no idea if this is actually practical, of course! That's something else that I wanted to discuss. Right now, I'm doing only the very simplest inlining. In effect, the inlined function has to be trivial enough that I don't need its callframe. That rules out having the inlined code do anything like [regexp] because without multiple callframes, there's no way for the C implementation of [regsub] to find the match variables. It also rules out throwing errors until and unless I figure out what to do with 'procLeave'. But that's an issue for another day. > > The new algorithms will be fairly aggressive about moving code around > > - so that, for instance, a read-only use of a namespace variable > > within a loop can be hoisted out of the loop if nothing reads or > > writes the variable or any alias. IN designing this, I realize that > > the logic is about to founder on directSet, directGet, directArraySet, > > directArrayGet, etc. - fourteen operations in all. All of these at > > least notionally modify or access the state of variables, and so need > > to have the callframe among their parameters and results. > > Those operations are essentially designed for handling the cases where > variables are accessed by qualified names, and CAN TRIGGER TRACES. We > must handle them correctly, alas; lots of Tcl code is strongly dependent > on traces. However, in the case that it is possible to prove that a > variable name does not contain a namespace separator, something more > efficient can be done; the side effects are much more contained (as we > explicitly state that we won't support local variable traces). We do need to constrain scope somewhat! Traces, as they are currently defined, are a feature about which I say, "if you need the interpreter, you know where to find it!" Consider the following: proc tracer {n1 n2 op} { uplevel 1 {set b 2} } trace add variable ::a write tracer proc x {} { set b 1 set ::a 1 return $b } puts [x] In the presence of traces, no assumptions about the semantics of code are safe. 'tracer', instead of setting an irrelevant variable, could have redefined 'set' or something equally nasty. What I suspect will save us is that nobody sane does radical semantic alterations in performance-sensitive code. There will be some core set of algorithms where a program spends its tiime that don't need the full functionality of the interpreter and can be aggressively optimized. > You're correct that they should be returning CALLFRAME FAIL STRING in > the general case. Right now, on trunk, none of these functions takes the callframe. It's just 'directSet varName value' and so on. That's the immediate modification I'm proposing. > > Beyond this, the next thing to think about will be breaking up direct* > > into smaller pieces. If 'directLappend result callframe listvar > > element' could turn into a sequence like > > > > directRef listRef callframe var > > refGet val listRef # don't know if I might > > lappend val2 val element > > refPut temp callframe listvar element > > extractCallframe callframe2 temp > > retrieveResult result temp > > > > (plus whatever error checking goes there) > > > > then I couid start looking at aggregating all the directRef's to the > > same variable, hoisting them out of loops, and so on. Many of the > > instructions in straight-line sequences like this are optimized to > > zero-cost operations. > > Doing this is all about getting the semantics right. That's surprisingly > difficult. For example, the code for the equivalent of: > > lappend ::foo::bar::var $a > lappend ::foo::bar::var $b > lappend ::foo::bar::var $c > > and > > lappend ::foo::bar::var $a $b $c > > needs to be able to avoid making extra references to the list in the > variable if possible or we risk transforming O(N) algorithms into O(N²), > and nobody will thank us for that. (Early versions of 8.6 had exactly > this bug due to my getting things slightly wrong in the bytecode.) Indeed. I was talking about making references to the variable, not the value. Already, for a local variable, we're able to retain the variable ref - it's in the callframe after all - without trouble, As far as retaining the value ref goes, that's also likely not to be an issue because of the elimination of common subexpressions. I imagine a pattern like the following (all the error checking stuff suppressed): getVarRef aRef callframe "::foo::bar::var" lappendRef result aRef a unset result lappendRef result aRef b unset result lappendRef result aRef c unset aRef The only reference to the list is still in the variable, but we avoid the cost of resolution on all references except the first. Essentially, aRef is holding a Var*. But that's if we think that ::foo::bar::var must be in sync at all times, so read on... > Similarly, we *must* get the trace semantics mostly right on global > variables, at least to the point of having at least one trace call in > the right places so that access to variables synched into C code in > extensions will work correctly. Also, I vaguely recall that [append] is > strange in this regard. Or perhaps that was arguably a bug that is now > fixed (or ought to be, for consistency's sake)? I think we're in agreement here, for sufficiently small values of 'mostly right'. Rather than 'all variables in sync at all times', I already have logic that assesses known effects. There are already some commands that are identified as, 'this might change the value of any variable, but otherwise have known effects', plus some commands like [regexp] that have constrained effects: '[regexp] does not read any variable and writes exactly those variables whose names are passed as match variables'. So one definition of 'mostly right' could be that all variables are in sync before and after calls to C commands that might examine or alter them, and all namespace variables are in sync on procedure exit. Dropping traces that would have fired between those points might be fair game. > I suspect that these will reduce your scope for improvement a lot. > Otherwise, I'm generally happy to do this. I'm guessing that it might be > easiest to split things so that we have operations that produce a VAR > type (not a subclass of STRING) that we can then do the other changes > on. That'd at least let us factor out the (potentially expensive!) > variable lookup operations. The actual reads and writes are usually much > cheaper (other than that the values must always be boxed and traces are > a true issue). Given the problem that traces have arbitrary and unknowable side effects, the best we can hope for is that some benign ones work. In such a world, there's also an opportunity for quadcode that we've not yet properly addressed. If we allow variables to be out of sync momentarily (provided that we've restored consistency any time we get to external code that might use them), then we could mostly eliminate the need for the K combinator. If we allow 'x' to have a transient state where its value is lost, then set x [lreplace $x $a $b {*}$neweleents] would not incur a performance penalty with respect to the mysterious set x [lreplace $x[set x {}] $a $b {*}$newelements] because we could detect that the new value of x is always a newly-created value and avoid the extra ref. (This would also involve proving either that the [lreplace] cannot throw, or that the value of $x will never again be used if it does.) Anyway, the immediate issue is that direct ops don't reference the CALLFRAME formally, and need to. :) Kevin |
From: Donal K. F. <don...@ma...> - 2019-01-02 11:36:09
|
On 02/01/2019 02:15, Kevin Kenny wrote: > I imagine that it's a pretty mechanical change because at present the > CALLFRAME doesn't actually mean anything. It's a notional thing that > keeps me from reordering quadcode in a way that would hoist loads > above the corresponding stores, that sort of thing, but there is only > one CALLFRAME per proc, ever, so there doesn't necessarily need to be > anything in the CALLFRAME value to designate the callframe. (On the > other hand, I think I see that you have given an internal > representation to the CALLFRAME data type, so I'm not sure what's > going on there.) I need to pass actual call frames to some functions that manipulate them, especially the ones for setting the frame up and writing to and reading from it for external calls. IIRC, it's a 'struct CallFrame*' in all but name. In theory, there's no problem with having multiple CALLFRAMEs within a single function once they're distinct entities (due to inlining). I've no idea if this is actually practical, of course! > The new algorithms will be fairly aggressive about moving code around > - so that, for instance, a read-only use of a namespace variable > within a loop can be hoisted out of the loop if nothing reads or > writes the variable or any alias. IN designing this, I realize that > the logic is about to founder on directSet, directGet, directArraySet, > directArrayGet, etc. - fourteen operations in all. All of these at > least notionally modify or access the state of variables, and so need > to have the callframe among their parameters and results. Those operations are essentially designed for handling the cases where variables are accessed by qualified names, and CAN TRIGGER TRACES. We must handle them correctly, alas; lots of Tcl code is strongly dependent on traces. However, in the case that it is possible to prove that a variable name does not contain a namespace separator, something more efficient can be done; the side effects are much more contained (as we explicitly state that we won't support local variable traces). You're correct that they should be returning CALLFRAME FAIL STRING in the general case. > Beyond this, the next thing to think about will be breaking up direct* > into smaller pieces. If 'directLappend result callframe listvar > element' could turn into a sequence like > > directRef listRef callframe var > refGet val listRef # don't know if I might > lappend val2 val element > refPut temp callframe listvar element > extractCallframe callframe2 temp > retrieveResult result temp > > (plus whatever error checking goes there) > > then I couid start looking at aggregating all the directRef's to the > same variable, hoisting them out of loops, and so on. Many of the > instructions in straight-line sequences like this are optimized to > zero-cost operations. Doing this is all about getting the semantics right. That's surprisingly difficult. For example, the code for the equivalent of: lappend ::foo::bar::var $a lappend ::foo::bar::var $b lappend ::foo::bar::var $c and lappend ::foo::bar::var $a $b $c needs to be able to avoid making extra references to the list in the variable if possible or we risk transforming O(N) algorithms into O(N²), and nobody will thank us for that. (Early versions of 8.6 had exactly this bug due to my getting things slightly wrong in the bytecode.) Similarly, we *must* get the trace semantics mostly right on global variables, at least to the point of having at least one trace call in the right places so that access to variables synched into C code in extensions will work correctly. Also, I vaguely recall that [append] is strange in this regard. Or perhaps that was arguably a bug that is now fixed (or ought to be, for consistency's sake)? I suspect that these will reduce your scope for improvement a lot. Otherwise, I'm generally happy to do this. I'm guessing that it might be easiest to split things so that we have operations that produce a VAR type (not a subclass of STRING) that we can then do the other changes on. That'd at least let us factor out the (potentially expensive!) variable lookup operations. The actual reads and writes are usually much cheaper (other than that the values must always be boxed and traces are a true issue). Donal. |
From: Kevin K. <kev...@gm...> - 2019-01-02 02:15:42
|
Donal, Happy 2019! Hope you're well. I've not been around on the mailing lists or the chat the last couple of weeks - this has been a time for spending with family and friends. But in my odd moments, I have been looking at quadcode some. (In particular, jump threading has a shiny new implementation that is many times faster than the old, and in addition, generates a much smaller explosion of the state space.) I've also done a new implementation of Global Value Numbering/Partial Redundancy Elimination (GVN-PRE) that subsumes common subexpression elimination and loop-invariant code motion, and could (not yet implemented, but possible!) also include constant folding (rather than having that in a separate pass), expression simplification (things like 0+a=a) and reduction of operator strength. Some of that I want to leave to LLVM, but I doubt that it can do a very good job simplifying dictionary or list operations! The choice of these was guided by the issues surrounding poly1305. The old jump threading was a huge drag on compile speed, and now that particular issue is removed. The next big issue in poly1305 is callframe management, so that's what I've jumped into next. (I took a long false path into trying to refactor varargs, and intend to get back to that, but it will get easier once the callframe stuff is refactored.) It's become fairly clear to me that the 'save to the callframe before invoke, restore after' is the wrong approach. What's right is to do what LLVM itself does: allocate slots in the callframe (analogous to the 'alloca' operations that reserve variables in LLVM), and initially generate code that loads ('moveFromCallFrame in quadcode notation) and stores (moveToCallFrame) on every read and write of a variable. That will start with all variables in sync, and I'm working on the algorithms (very, very closely related to GVN-PRE) for determining when it's safe to eliminate loads and stores. The new algorithms will be fairly aggressive about moving code around - so that, for instance, a read-only use of a namespace variable within a loop can be hoisted out of the loop if nothing reads or writes the variable or any alias. IN designing this, I realize that the logic is about to founder on directSet, directGet, directArraySet, directArrayGet, etc. - fourteen operations in all. All of these at least notionally modify or access the state of variables, and so need to have the callframe among their parameters and results. All of them need to get the callframe as input, and the ones that modify variables need to return {CALLFRAME FAIL STRING} - which will immediately fall into the conventional sequence of extractCallFrame, retrieveResult, jumpMaybe. In the kbk-refactor-callframe branch, I've changed translate.tcl to generate these operations (not really tested, I'm still working on the rest of the code!) but I've not yet even started on making the corresponding change on the codegen/ side. If you could possibly help with that - I imagine that it's a pretty mechanical change - that'd be a time-saver for me. I imagine that it's a pretty mechanical change because at present the CALLFRAME doesn't actually mean anything. It's a notional thing that keeps me from reordering quadcode in a way that would hoist loads above the corresponding stores, that sort of thing, but there is only one CALLFRAME per proc, ever, so there doesn't necessarily need to be anything in the CALLFRAME value to designate the callframe. (On the other hand, I think I see that you have given an internal representation to the CALLFRAME data type, so I'm not sure what's going on there.) There's no particular hurry with all of this. It'll take me a little while to get my own stuff in order. If you'd rather wait until I have something that you could test against, I'm fine with that, too. As a side effect, when this subproject is done, we should get a correct implementation of [set $a] and [set $a $b] - and a possibly substantial performance gain in access to [upvar] and namespace variables. Beyond this, the next thing to think about will be breaking up direct* into smaller pieces. If 'directLappend result callframe listvar element' could turn into a sequence like directRef listRef callframe var refGet val listRef # don't know if I might lappend val2 val element refPut temp callframe listvar element extractCallframe callframe2 temp retrieveResult result temp (plus whatever error checking goes there) then I couid start looking at aggregating all the directRef's to the same variable, hoisting them out of loops, and so on. Many of the instructions in straight-line sequences like this are optimized to zero-cost operations. Kevin |
From: Kevin K. <kev...@gm...> - 2018-12-17 23:34:18
|
I just committed and synced a new jump-threading pass to tcl-quadcode. This replaces the old 'nodesplit' pass, which in a large procedure could cause the entire optimization pipeline to rerun a great many times for a large procedure, since it would thread only one basic block at a time. (Moreover, it often faced a combinatorial explosion of threading opportunities - many more than were justified by the actual data types, because it did not detect converging pathways in the splits.) The new threading pass is the first one that I've done to require SSA deconstruction and reconstruction. For the large number of changes that the pass typically makes, it's faster and simpler simply to tear things down and put them back together again. With the new module in place, the 'renameTemps' pass, which was a thorn in the side of the compiler on large procedures (owing to a performance bug in Tcllib's 'struct::disjointset' module) can be removed. In addition, of course, the old 'nodesplit.tcl' is also no longer relevant. Donal, feel free to ignore my earlier message about addReference on values combined with NEXIST IMPURE, and returnCode on values that are FAIL+some other type. The revised jump threading separates those combined types before they get to the offending operations. (I now realize that the jump threading is *required* to compile [catch] because the failure and success paths *must* be separated until the FAIL is dismissed.) Ad astra! Kevin |
From: Donal K. F. <don...@ma...> - 2018-10-22 05:19:47
|
Ah, the benefits of an intercontinental flight! NRE now compiles and works with expansion (once I sync my fossil repo). It was fiddly, but was essentially just a blend of what existed under the *assumption* that by the time we hit this code, we're basically giving up on compiling deep inside. Theoretically, we might be able to do better with the case where the initial word is not expanded… but I don't check that. It assumes that arguments needing expansion are always well-formed lists. I can't remember if we check that elsewhere. :-) Donal. |
From: Donal K. F. <don...@ma...> - 2018-10-18 11:11:13
|
I've just published an update to llvmtcl to support LLVM 7; llvmtcl 3.9.2 is a very small change, but LLVM 7 is a slightly larger change that's forced me to do a small update to the code generator. The issue is that the type signatures of two intrinsics (for memcpy and memset) have changed; the new code is flexible between LLVM versions. People using branches other than trunk will want to merge trunk before using LLVM 7. (llvmtcl 3.9.2 also removes one API call, but we weren't making any use of it.) Donal. |
From: Kevin K. <kev...@gm...> - 2018-05-01 18:21:19
|
THe last few messages, I mentioned working on a mystery bug in calling procedures that returned CALLFRAME COROHANDLE. I now realize that the call-and-return mechanisms were innocent. Instead, the true culprit was the entry of a procedure that has variables that must be kept in the Tcl callframe. The local variable table was being allocated with '$builder allocArray', which turns into an 'alloca' instruction. That instruction was not in the entry block of the function, and for whatever reason, LLVM's optimizer decided it couldn't move it there. The result was that the LVT was allocated on the stack, and promptly went away at the first coroutine suspension, and nasal daemons were promptly invoked. People who work on this stuff should be aware that there's a variable $entryBlock in the compiler, that holds an unclosed block, and the builder has methods 'allocInBlock' and 'allocArrayInBlock' to make use of it. Anything brought in with a conventional 'alloca' cannot persist across suspending an LLVM coroutine. It appears that constant-size 'alloc' operations in the entry blocks of inline procedures are safely moved to the entry block of the procedure in which they appear, but other 'alloca's should all appear before the first 'call', lest the 'call' be inlined and the 'alloca' appear outside the entry block as a consequence. With the allocation issue fixed, 'rectest3' started working, so I'm back to having all the existing tests pass. I still need to do code and tests for NRE-aware invocation of commands (rather than compiled procs) and NRE-aware invokeExpanded. Then I'll turn on the change that makes the I/O commands require NRE (because several packages yield from stacked channel handlers). Last will come the implementation of 'tailcall' and 'yield'. 'yieldto' may be a bridge too far at this point; I have Absolutely No Idea what to do in a compiled caller of a coroutine that does 'yieldto'! So, continuing to plug ahead - along with boxing up my office, they're insisting that I have to move to a different space. :( |
From: Kevin K. <kev...@gm...> - 2018-04-30 19:16:06
|
On Mon, Apr 30, 2018 at 3:02 PM, Donal K. Fellows < don...@ma...> wrote: > On 30/04/2018 18:00, Kevin Kenny wrote: > >> I'm also noticing that the code gen is starting to emit pretty loose and >> bulky code, which is sure to cause cache pressure in the long run. I'm >> wondering if some of the bigger sequences (setting up a callframe and >> performing an [upvar] are two that come to mind) would actually be better >> with 'inlinehint' as opposed to 'alwaysinline.' >> > > Right now, that's governed by the definition of the Module:local method > (in codegen/struct.tcl) which is currently setting library functions to be > always inlined except when explicitly told to make them 'noinline'. I've > updated (on kbk-nre) the method to also recognise 'inlinehint' as another > override; add it after the type signature on the functions that need the > balanced approach (since most of the standard library depends on being > inlined to work efficiently, I'm going to keep the default what it is right > now). > OK, thanks! This is bringing up another intermediate-term idea - simplifying some of the code issuer by delegating more work up front. I am increasingly hating the INT type; why don't we just go int64 and have done with it, forcing int32 only when there's some external API that needs it? Also, I'm seeing that the jump threading that LLVM is doing on NUMERIC operations is ... incomplete, or at best inconsistent. I've half a mind to make the type promotion explicit in quadcode, so that numeric operations will always see int64 or DOUBLE, with conditional jumps enabling both if needed. That way, my jump threading can untangle the code. (I'm even starting to think about separating many procedures that now have NUMERIC args into separate int64 and double versions - so that they can in turn be threaded with their callers. That would get rid of a significant fraction of the 'must inline' stuff. But the idea comes to me partly just because Builder:unknown frightens me. |
From: Donal K. F. <don...@ma...> - 2018-04-30 19:10:09
|
On 30/04/2018 18:00, Kevin Kenny wrote: > In doing so, I'm noticing a few things that are causing me a little bit > of concern. I'd like to get a second opinion on them. The original concept was that we were compiling code to replace procedures in a specific interpreter, when making everything (interpreter and Tcl_Obj literals) “global” to the compilation was a reasonable approach. I agree that if we're going to be more like building libraries, we should do it by passing the interpreter explicitly and hanging any relevant “constants” off that via the usual approaches (especially as we don't currently use the ClientData parameter in the binding layer at all). I think that can all be done via the code generator; it doesn't need the quadcode level to be aware. > While I'm in there, I'm going to add an additional argument to > 'invoke' and 'invokeExpanded', which will be a list giving the > leading arguments that were replaced by the name being invoked. This > will let us start generating correct stack traces even though the > 'middle end' is precomputing function and ensemble resolutions. That so reminds me of what I did with 'invokeReplace' in TEBC… Donal. |
From: Donal K. F. <don...@ma...> - 2018-04-30 19:02:25
|
On 30/04/2018 18:00, Kevin Kenny wrote: > I'm also noticing that the code gen is starting to emit pretty loose and > bulky code, which is sure to cause cache pressure in the long run. I'm > wondering if some of the bigger sequences (setting up a callframe and > performing an [upvar] are two that come to mind) would actually be > better with 'inlinehint' as opposed to 'alwaysinline.' Right now, that's governed by the definition of the Module:local method (in codegen/struct.tcl) which is currently setting library functions to be always inlined except when explicitly told to make them 'noinline'. I've updated (on kbk-nre) the method to also recognise 'inlinehint' as another override; add it after the type signature on the functions that need the balanced approach (since most of the standard library depends on being inlined to work efficiently, I'm going to keep the default what it is right now). Donal. |
From: Kevin K. <kev...@gm...> - 2018-04-30 17:00:08
|
I'm still plowing through debugging the interactions introduced by the CALLFRAME COROHANDLE data type. (I'm sure that I have simply overlooked something obvious, but it's another one of those bugs that lands into code that should not be reachable - so I'm having to slog through a lot of disassembly in llvm.) In doing so, I'm noticing a few things that are causing me a little bit of concern. I'd like to get a second opinion on them. It appears that the Tcl interpreter pointer is being stashed statically and then referenced throughout the library. This is clearly not going to be multi-interp safe, defeating our apartment-threaded "multi-interp safe and thread-oblivious" model. It appears actually to have a fairly straightforward fix: pass the interpreter as the first parameter to compiled commands, and stash it in the Builder immediately on generating the entry sequence. This would be an extra parameter, but my guess is that it's used frequently enough that it would wind up being kept in %rsi anyway, so be effectively zero cost. I see the same thing with the constant pool - Tcl_Obj's are not normally safe to share among threads. The trick that many Tcl extensions do - have the constant pool be one of the things hanging off the ClientData of their commands - would work for us, too. Once again, the ClientData would have to be passed as a parameter to all our command implementations, consuming another parameter slot, but once again, it's used often enough that it would probably just live in %rdi. Sometime soon would be a good time for me to tackle this. I'm looking at the possibility of refactoring the generation of calls in the code issuer in any case - because the cross product of {function, command, command with arg expansion} and {NRE, no NRE} and {uses callframe, doesn't use callframe} is turning into an awful lot of apparently redundant code. If I'm refactoring 'invoke' and 'entry' anyway, I might as well work this stuff in. While I'm in there, I'm going to add an additional argument to 'invoke' and 'invokeExpanded', which will be a list giving the leading arguments that were replaced by the name being invoked. This will let us start generating correct stack traces even though the 'middle end' is precomputing function and ensemble resolutions. I'm also noticing that the code gen is starting to emit pretty loose and bulky code, which is sure to cause cache pressure in the long run. I'm wondering if some of the bigger sequences (setting up a callframe and performing an [upvar] are two that come to mind) would actually be better with 'inlinehint' as opposed to 'alwaysinline.' Anyway, as I said, a second opinion would be welcome! |
From: Donal K. F. <don...@ma...> - 2018-04-30 10:30:12
|
On 29/04/2018 20:24, Kevin Kenny wrote: > Does it fail with mcjit, or only with code gen for an external module? External module; a representative example is this (which is the CI build for the v3.9 tag): https://travis-ci.org/dkfellows/llvmtcl/builds/372739935 The overall run works (because tcltest insists on doing [exit 0] despite tests failing) but the tests external-1 and external-2 don't work. (They do pass on my local machine.) Also there's irritatingly many warnings during the compile, but they seem to be just annoyances and not a real problem. > If it succeeds with mcjit, does the configurator come up with the same target triple that the external-module code gen is expecting? I've had a suspicion for a while that our target triple selection is Not Quite Right on some of the platforms. Could be. The triples are annoyingly opaque, though I thought the approach of asking for the host triple during configuration time would work reliably (and that's easy). I'd really like to know if there's a difference between what different parts of LLVM think the host is. (If the LLVM available on Travis isn't capable of building properly for host, that's *astonishingly* nasty!) Donal. |
From: Kevin K. <kev...@gm...> - 2018-04-29 19:24:33
|
On Sun, Apr 29, 2018 at 1:43 PM, Donal K. Fellows < don...@ma...> wrote: > I've done a release of llvmtcl, 3.9, fixing up as much as I can and > getting everything into a good state. > > https://github.com/dkfellows/llvmtcl/releases/tag/v3.9 > > There's only one niggling issue that I know of right now; the travis build > doesn't pass its own tests because it can't issue code for the native > platform. It seems like something deeply weird is going on. Does it fail with mcjit, or only with code gen for an external module? If it succeeds with mcjit, does the configurator come up with the same target triple that the external-module code gen is expecting? I've had a suspicion for a while that our target triple selection is Not Quite Right on some of the platforms. |
From: Donal K. F. <don...@ma...> - 2018-04-29 17:44:03
|
I've done a release of llvmtcl, 3.9, fixing up as much as I can and getting everything into a good state. https://github.com/dkfellows/llvmtcl/releases/tag/v3.9 There's only one niggling issue that I know of right now; the travis build doesn't pass its own tests because it can't issue code for the native platform. It seems like something deeply weird is going on. Donal. |
From: Kevin K. <kev...@gm...> - 2018-04-24 04:03:45
|
I hacked together the patches that: - make sure that the target triple and target data layout are set in the module before any code generation takes place. - use the LLVM DataLayout for handling 'sizeof' and 'alignof' With those in place, I didn't need to tell any lies about the size of the coroutine promise - CoroSplit finds the correct size and generates correct code. There is a companion change in llvmtcl that requires advancing the version to 3.10 - I've sent a pull request with the changes. With these changes in place, the mrtest crash is gone - the code worked on the first try once I'd adapted coro.tcl to use the new 'sizeof' and 'alignof' primitives. All the tests in demos/perftest/tester.tcl pass. Performance isn't great, but I suspect is adequate. mrtest - which is really stressing the overhead of recursive procedures, goes from 960% to only 290%. qsort - a more realistic example with actual work done in the procedure before making recursive calls - goes from 244% to 187%. I'm not positive, but I think I can eliminate one Tcl_NRAddCallback per NRE procedure invocation, which would help mrtest significantly. I'll put that one on a back burner for now. I'm not merging to trunk yet, because I still haven't done invokeExpanded and invocation of noncompiled commands in the NRE environment. Now that LLVM is treating my coroutines sanely, that should be a Small Matter of Programming. The last five weeks have been quite a slog, but there's light at the end of the tunnel: my mystery crash, at long last, is gone! On Mon, Apr 23, 2018 at 6:14 PM, Kevin Kenny <kev...@gm...> wrote: > Would it be possible to commit to 'mcjit' or 'simple' before we get to > 'optimize' and attach the execution engine then? I'm pretty sure that the > empty string we pass as DataLayout isn't giving the same answers for > structure offsets. The LLVM code that's generating the incorrect offset in > the call thunk is using the data layout to do it - and I think the data > layout is changing when we bind the execution engine, since that's the > place where the target data layout is actually known. > |
From: Kevin K. <kev...@gm...> - 2018-04-23 22:14:24
|
It occurs to me: if CoroSplit is taking the coroutine frame as an (i8*), adding a fixed offset, and expecting that to point to a data slot beyond the coroutine promise, then it has to be getting the size of the promise statically from somewhere. This has to be the DataLayout associated with the Module. But we don't bind that correctly until [$module simple] or [$module mcjit], when we attach the execution engine. Would it be possible to commit to 'mcjit' or 'simple' before we get to 'optimize' and attach the execution engine then? I'm pretty sure that the empty string we pass as DataLayout isn't giving the same answers for structure offsets. The LLVM code that's generating the incorrect offset in the call thunk is using the data layout to do it - and I think the data layout is changing when we bind the execution engine, since that's the place where the target data layout is actually known. Relevant LLVM code is at http://llvm.org/doxygen/CoroFrame_8cpp_source.html#l00313 and the layout of the first few elements of the frame is accomplished (using that Padder class) at http://llvm.org/doxygen/CoroFrame_8cpp_source.html#l00384 I'm guessing that the right sequence is: LLVMCreateTargetMachine LLVMCreateTargetDataLayout LLVMSetDataLayout (for separate compilation) or LLVMCreateMCJITCompilerForModule LLVMGetExecutionEngineTargetMachine LLVMGetExecutionEngineTargetData so that we will have a target machine and data layout for the optimizer in both cases. In addition, we probably need to call LLVMAddAnalysisPasses on both the function and module pass managers, to get the target-dependent optimizations. It looks as if perftester/MCJIT goes through LLVM::optimise, and that the other two places I need to worry about are LLVM::compilepackagetodll and LLVM::compiletodll. It looks as if Module:simple and Module:interpreter are both dead at this point? Am I missing anything else? They don't make it easy, do they? I'll see about hacking up patches to struct.tcl, support.tcl and jit.tcl (and whatever might spill over elsewhere) - maybe this will clear up the issues. |