You can subscribe to this list here.
| 2017 |
Jan
|
Feb
(2) |
Mar
(6) |
Apr
(4) |
May
(20) |
Jun
(15) |
Jul
(4) |
Aug
(2) |
Sep
(6) |
Oct
(6) |
Nov
(20) |
Dec
(3) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2018 |
Jan
(16) |
Feb
(3) |
Mar
(7) |
Apr
(40) |
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(2) |
Nov
|
Dec
(1) |
| 2019 |
Jan
(7) |
Feb
(5) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
|
From: Kevin K. <kev...@gm...> - 2019-02-18 05:58:30
|
I'm continuing to make some progress on refactoring callframe operations. The code in the 'kbk-callframe-refactor' branch is extremely slow, but works on all the test cases except for ones that are explicitly patched out in demos/perftest/tester.tcl. The patched-out ones include + everything that emits direct* operations. These need to be revised to carry the callframe explicitly - as we've discussed in earlier messages. + magicreturn. This falls down trying to construct a FAIL NUMERIC. We've had an email thread about this already. and the tests that are patched out on trunk. (Some of these, like errortest7 and expandtest::test[56] are related to how we don't handle loop exception ranges.) As I said earlier, the way that this works is to make the front end insert 'moveFromCallFrame' for every variable reference, and 'moveToCallFrame' for every variable assignment. The plan is then to have passes that do the escape analysis and figure out what moves are safe to delete. (Unless Donal or someone screams again, I'm planning to move forward on the assumption that traces for repeated reads and traces for repeated writes can be coalesced, as long as at least one trace of the given type fires. Essentially, it will consider code between escape points to be monolithic, and require only the first read and the last write to fire traces. (It will be permissible to optimize less agressively and fire other traces corresponding to accesses in the program - but only the first read and the last write will be guaranteed.) I've started a writeup on how I propose to do the optimization, which can be found at http://core.tcl.tk/tclquadcode/doc/kbk-refactor-callframe/doc/20190216callframe/callframe.md - none of what is written there is implemented yet! The sections on loop-invariant code motion have yet to be written, although I have several pages of scribblings on the subject in a paper notebook. Essentiially, upward motion will follow the same general plan as anticipability analysis in 'quadcode/pre.tcl' - insert 'moveFromCallFrame' at exit from blocks where it is fully anticipable and partially available on entry to the successor but not available on exit in the predecessor. (This is a generalization of loop-invariant code motion.) Downward motion is very nearly the inverse, but instead of looking for SSA values that are already partly available, it looks for Tcl values that are live in a successor and partly anticipated on entry to the successor but not fully anticipated in the successor, and inserts the 'moveToCallFrame' on entry to the successor. (I need to work out some details before I can write this one up in full.) I think that these changes will actually exceed the performance of the current code, because they can be more aggressive about eliminating data motion. They're also fairly flexible, since the data flows are fairly abstract. We can tweak the effects of individual instructions if we don't like the escape analysis. Anyway, I'd appreciate if Donal at least could have a look at what I've got so far and let me know if I'm totally barking up the wrong tree. http://core.tcl.tk/tclquadcode/doc/kbk-refactor-callframe/doc/20190216callframe/callframe.md Kevin |
|
From: Kevin K. <kev...@gm...> - 2019-02-05 18:53:01
|
One more thing with returns. For [return -level N -code C $value] with N>1, there's some jiggery-pokery at the [return] site, to set iPtr->returnLevel to N, iPtr->returnCode to C, and then return TCL_RETURN. That means that there needs to be no check for the level at the call site until we've detected that we're in the TCL_RETURN path. For optimization, I'd like to keep track of which exception codes (ERROR, RETURN, BREAK, CONTINUE, other) that a proc might return, so as to be able to remove excess checks from after the invoke. I think that the base case before optimization will have to look something like: result ←invoke(callframe, command, arg1, ...) callframe ← extractCallframe(result) result ← retrieveResult(result) jumpMaybe exceptionPath, result jump normalPath normalPath: result ←extractMaybe(result) ... execution continues ... exceptionPath: # NOTE: This is where the hooks for escape to interpreted code, or for dynamic recompilation, would go. # QUESTION: Do we need to check iPtr->execEnvPtr->rewind? result ← extractFail(result) # If the innermost exception range is a loop (track this in translate.tcl): r1 ← resultIsBreak(result); # new test for TCL_BREAK jumpTrue breakPC, r1 jump e1 e1: r1 ← resultIsContinue(result); new test for TCL_CONTINUE jumpTrue continuePC, r1 jump e2 # end of code handling the loop exception range e2: # QUESTION: Code to append to the backtrace (analogous to TclLogCommandInfo) needed? # QUESTION: Do we need to check TclCanceled(interp)? TclLimitExceeded(iPtr->limit)? # If there's an enclosing catch, jump to it, and otherwise fall out to the error pathway. jump enclosingCatchOrError In addition, once we're on the error path of a proc, we need to check for abnormal returns. In bytecode, this goes through TEOEx_ByteCodeCallback. It checks for whether -level has reached 0 - if it hasn't, we'll be passing the exception through to the proc's caller. When -level has reached zero, if the return code is TCL_RETURN, it unboxes the options to replace with whatever they were hiding. Otherwise, if the result is not TCL_OK or TCL_ERROR, it formats 'invoked "break" outside a loop' or 'invoked "continue" outside a loop' or 'command returned bad code: C' and throws that as an error. I think we can detect this case statically because we have the exception range at hand when compiling for the invoke, so if there's neither a loop nor a catch exception range and a command might return BREAK, CONTINUE or [other], have the test for the result code jump to an 'initException / jump error' sequence. So we have: -- on return - we need to handle -level by updating iPtr->returnLevel and iPtr->returnCode. -- after an [invoke] etc. returns - we need to check for BREAK/CONTINUE (vs the innermost exception range, with different handling for loop and catch), have the special handling for BREAK/CONTINUE/[other] to turn into errors if nothing catches them, and have all the logic to unpack -level when the result code is TCL_RETURN. It would be desirable to have FAIL divide into multiple subtypes: ´could be an error' 'could be a break' 'could be a continue' 'could be a return' 'could be an unusual return code', so as to be able to optimize out the checks for the impossible cases. These subtypes wouldn't need any separate internal rep (and I could have the typename be FAIL for all of them) since they would all be 'integer Tcl result code, plus whatever baggage that carries in the interpreter internals' |
|
From: Kevin K. <kev...@gm...> - 2019-02-04 05:00:28
|
Following up a little farther: One related issue is that I'm also currently sort of blocked on implementing further procedure inlining, largely because I don't know how I might go about reporting an error in an inner context. Right now, it appears that there's a hidden pathway by which /@debug-line/ and /@debug-script/ information gets into /initException/. Moreover, I have no idea how I might create the correct internal values in the interpreter for an inner context when throwing an error. (To say nothing of what to do with [return -code break] and [return -code continue], but I already discussed those.) On Sun, Feb 3, 2019 at 9:29 PM Kevin Kenny <kev...@gm...> wrote: > > On 2/3/19 9:24 AM, Donal K. Fellows wrote: > > I've been trying to comprehend what is going on (and going wrong) with > > a number of our test cases: the problems usually seem to be closely > > related to the results of procedres. I'm not sure that we're modelling > > results correctly: they're actually fairly annoyingly complex, so if > > we're getting stuff wrong then that might be the cause of at least som > > of the trouble that we've been experiencing. > > I don't think it's as bad as you imagine. > > The fundamental problem that I'm aware of is that we are NOT handling > the 'return', 'break', 'continue' and 'other' exception codes correctly > when they're coming from a procedure. > > All the agonizing over what the fields in the interpreter are is most > likely not necessary. A FAIL is simply 'a nonzero return code, together > with whatever it's left in the interp' (objResult, errorInfo, errorCode, > errorStack options dictionary). (Note that [return -code] is capable of > creating unusual combinations of these, but meaningless fields are > generally ignored.) A FAIL SOMETHINGELSE is simply the union of that > return code and the SOMETHINGELSE data type, whatever that is. So what > we should have for the internal representation is: > > FAIL - As long as we can get the return value from the interp, then > FAIL can adequately be represented with just an integer. > > FAIL SOMETHINGELSE - An integer, plus a SOMETHINGELSE whose value > is meaningful only if the integer is TCL_OK > > Anything else - Of course, whatever the type's native > representation is. > > The reason that we can get away with that is that the bytecode->quadcode > translation is coded so that there ought only ever to be one FAIL in > flight at a given time. All the stuff that's in the interp doesn't need > to be copied around FAIL objects because it can just be retrieved from > the interp again when needed. > > Where I *know* things are going badly wrong is in the compilation of all > /invoke/ sequences. We don't examine loop exception ranges at all in > quadcode/translate.tcl. Instead, any nonzero status winds up going to > the enclosing [catch] if there is one, and to the error return block for > the procedure otherwise. In TEBC, the action will be complicated: > > First, examine -level, and if it's still nonzero, jump to the enclosing > [catch] or return. > > If the -level is zero, then the return code determines the correct action: > > TCL_OK - extract the return value and continue. > > TCL_ERROR - go to the enclosing [catch] or return. > > TCL_RETURN - replace the return code with the -code, and go to the > enclosing [catch] or return > > TCL_BREAK - go to the 'break' in the innermost exception range if > it's a loop, or to the enclosing [catch] otherwise. If there's no > enclosing [catch], throw a 'break but not in a loop' error. > > TCL_CONTINUE - go to the 'continue' in the innermost loop exception > range, or to the enclosing [catch]. If there's no enclosing [catch] , > throw 'continue but not in a loop.' > > Anything else - go to the enclosing [catch] or return. > > (This is approximate, from memory, so don't believe me, believe the code > in TEBC. It's close, though.) > > All of this is done by TEBC implicitly, just as a result of a return > from /invoke/ or /invokeExpanded/. Quadcode doesn't seem to have any > particular plan for handling most of these cases - only a zero -level > and an OK or ERROR return code appear to be addressed. > > When we first started implementing /invoke/, I recall expressing some > concern about exception ranges, and hearing an answer where the gist was > either "don't worry about it," or "don't worry about it *yet*." I don't > recall which, but apparently the latter ought to have been the answer in > any case. Has the time come to worry about it? > > |
|
From: Kevin K. <kev...@gm...> - 2019-02-04 02:30:01
|
On 2/3/19 9:24 AM, Donal K. Fellows wrote: > I've been trying to comprehend what is going on (and going wrong) with > a number of our test cases: the problems usually seem to be closely > related to the results of procedres. I'm not sure that we're modelling > results correctly: they're actually fairly annoyingly complex, so if > we're getting stuff wrong then that might be the cause of at least som > of the trouble that we've been experiencing. I don't think it's as bad as you imagine. The fundamental problem that I'm aware of is that we are NOT handling the 'return', 'break', 'continue' and 'other' exception codes correctly when they're coming from a procedure. All the agonizing over what the fields in the interpreter are is most likely not necessary. A FAIL is simply 'a nonzero return code, together with whatever it's left in the interp' (objResult, errorInfo, errorCode, errorStack options dictionary). (Note that [return -code] is capable of creating unusual combinations of these, but meaningless fields are generally ignored.) A FAIL SOMETHINGELSE is simply the union of that return code and the SOMETHINGELSE data type, whatever that is. So what we should have for the internal representation is: FAIL - As long as we can get the return value from the interp, then FAIL can adequately be represented with just an integer. FAIL SOMETHINGELSE - An integer, plus a SOMETHINGELSE whose value is meaningful only if the integer is TCL_OK Anything else - Of course, whatever the type's native representation is. The reason that we can get away with that is that the bytecode->quadcode translation is coded so that there ought only ever to be one FAIL in flight at a given time. All the stuff that's in the interp doesn't need to be copied around FAIL objects because it can just be retrieved from the interp again when needed. Where I *know* things are going badly wrong is in the compilation of all /invoke/ sequences. We don't examine loop exception ranges at all in quadcode/translate.tcl. Instead, any nonzero status winds up going to the enclosing [catch] if there is one, and to the error return block for the procedure otherwise. In TEBC, the action will be complicated: First, examine -level, and if it's still nonzero, jump to the enclosing [catch] or return. If the -level is zero, then the return code determines the correct action: TCL_OK - extract the return value and continue. TCL_ERROR - go to the enclosing [catch] or return. TCL_RETURN - replace the return code with the -code, and go to the enclosing [catch] or return TCL_BREAK - go to the 'break' in the innermost exception range if it's a loop, or to the enclosing [catch] otherwise. If there's no enclosing [catch], throw a 'break but not in a loop' error. TCL_CONTINUE - go to the 'continue' in the innermost loop exception range, or to the enclosing [catch]. If there's no enclosing [catch] , throw 'continue but not in a loop.' Anything else - go to the enclosing [catch] or return. (This is approximate, from memory, so don't believe me, believe the code in TEBC. It's close, though.) All of this is done by TEBC implicitly, just as a result of a return from /invoke/ or /invokeExpanded/. Quadcode doesn't seem to have any particular plan for handling most of these cases - only a zero -level and an OK or ERROR return code appear to be addressed. When we first started implementing /invoke/, I recall expressing some concern about exception ranges, and hearing an answer where the gist was either "don't worry about it," or "don't worry about it *yet*." I don't recall which, but apparently the latter ought to have been the answer in any case. Has the time come to worry about it? |
|
From: Donal K. F. <don...@ma...> - 2019-02-03 14:24:29
|
I've been trying to comprehend what is going on (and going wrong) with a
number of our test cases: the problems usually seem to be closely
related to the results of procedres. I'm not sure that we're modelling
results correctly: they're actually fairly annoyingly complex, so if
we're getting stuff wrong then that might be the cause of at least som
of the trouble that we've been experiencing.
Warning: what follows isn't very organized. :-)
Results in Tcl consist of an immediate int value (the result code) and a
number of values in the interpreter. Not all parts are always present.
(I hope my summary below is correct: note that LIST and DICT are STRING
on most branches in our codebase, but the type restrictions there are
obvious.)
Code: 0 (TCL_OK)
interp->objResultPtr defined (wraps a result of type α)
Code: 1 (TCL_ERROR)
interp->objResultPtr defined (wraps a message of type STRING)
interp->errorInfo defined (STRING)
interp->errorCode defined (LIST[*])
interp->returnOpts defined (DICT)
Code: 2 (TCL_RETURN)
interp->objResultPtr defined (wraps a result of type βⁿ)
βⁿ need not be consistent with α or δ
βⁿ need not be unified with other βⁿ (depending on returnLevel)
βⁿ should be unified with α for correct level when returnLevel
matches and returnCode is 0, or with δ for correct level when
returnLevel matches and returnCode is ‘other’.
interp->returnCode defined (int)
interp->returnOpts defined (DICT)
interp->returnLevel defined (int)
Code: 3 (TCL_BREAK)
no extra fields defined (interp->objResultPtr is EMPTY)
RARE: most [break] jumps end up directly routed
Code: 4 (TCL_CONTINUE)
no extra fields defined (interp->objResultPtr is EMPTY)
RARE: most [continue] jumps end up directly routed
Code: other (≤ -1 or ≥ 5)
interp->objResultPtr defined (wraps a result of type δ)
δ need not be consistent with α or βⁿ
δ should be unified with other δ from same function
δ might need to be required to be STRING
interp->returnOpts defined (DICT)
[I didn't use a gamma because it looked too like a “v”.]
This is ugly stuff.
The current implementation is that a FAIL is a non-0 return code (with
everything else implicit in the interpreter) and that a FOO FAIL is a
tuple of an int and a FOO (the FOO is defined only for the 0 return code
case). That's clearly not correct. Non-zero returns from a procedure
should go to the exception handling coda that forms the right error
state for the caller (and builds the error trace and so on) and produces
a FAIL; that's the main objective of the ‘procLeave’ opcode.
I suspect that there's something going wrong
Donal.
[* Formally not actually true, except random things break if it isn't. I
ought to propose enforcing this in a TIP. ]
|
|
From: Kevin K. <kev...@gm...> - 2019-01-29 19:42:37
|
Donal,
As you know, I'm working on refactoring the moveToCallFrame and
moveFromCallFrame stuff to be amenable to better analysis.
The first step is to get everything running correctly if atrociously
slowly, by making all references to Tcl variables access the copies in
the callframe. (Then I can put the optimizations back in, and have a
better theory of safety, that can even possibly be amenable to some
traces.) I'm still struggling with getting everything working.
I've managed to reduce the number of test failures in the
'kbk-refactor-callframe' branch substantially, but there is still a
handful of things that I'm grappling with.
====
1. magicreturn.
The 'magicreturn' test is crashing code generation. I think that the
issue is that the case:
set l 0
set z [return -level $l [expr {$y + $y}]]
has never tested what it purported to test. By the time we implemented
[return -level], the quadcode front end already had constant folding,
so it generated code similar to the first case. The return options
reduced to {literal {-level 0}}.
Now that $l is being retrieved from the callframe (and the retrieval
is intentionally not yet being optimized) we are generating a code
burst that looks something like:
moveFromCallFrame {var l} {temp callframe} {literal l}
list {temp 0} {literal -level} {var l}
... argument checking for 'add' ...
add {temp 1} {temp y} {temp y}
initException {temp exception} {temp 1} {temp 0}
so that the type signature of the initException will be
NUMERIC -> STRING -> FAIL NUMERIC
This runs through 'compile' and the Big Switch passes it off to
InitException in compile.tcl. It doesn't hit the pretty '-level 0'
special case, and now (for the first time, perhaps?) gets into the
code that looks like
my StoreResult $tgt [$b $opcode {*}$vals $value $maintype \
$errorCode $name]
and it's not finding an appropriately typed initException. Is there
supposed to be some sort of promotion of NUMERIC here, or is there
something missing that's supposed to allow it to promote a NUMERIC to
a FAIL NUMERIC?
====
2. lsetest
lstetest still fails mysteriously, and I haven't yet found an
explanation. I'd appreciate a second set of eyeballs. It's almost
certainly something I'm doing in code gen, but I'm just not seeing it!
====
3. direct*
Because of aliases in the callframe, all the direct* operations at
least notionally need to appear to modify the callframe for data flow
analysis to work. If you could give me a worked example of modifying
one of them (directAppend or directIncr, perhaps), or at least a
sketch of how to proceed, to change from
directAppend listOrErrorOut listIn value...
jumpMaybe catchBlock, listOrErrorOut
to
directAppend combinedOut callframeIn listIn value...
extractCallFrame callframeOut combinedOut
retrieveResult listOrErrorOut listIn value...
jumpMaybe catchBlock, listOrErrorOut
that would be a huge help. Mostly, I need to learn how to construct
and manage reference counting for the combined result. In the case of
directAppend, the result type will be a CALLFRAME FAIL STRING, and it
isn't clear to me what all the correct accessors are.
I think that http://core.tcl.tk/tclquadcode/tktview?name=58e3e71962
may relate to this issue as well.
(There is also a raft of TODO comments around the direct ops in
translate.tcl, having to do with direct operations on
non-fully-qualified names. Was this something you were leaving for me
to chase? Can we at least have tickets for these, if that's the case?)
====
I'd really like to have the tests stable again before I start plugging
away on doing rudimentary escape analysis, which would then allow us
to attempt all four combinations of elimination of operations:
load-load (the same variable pulled from the callframe or a namespace
twice without anything intervening that might have changed it),
load-store (saving a value to the callframe that is known to be
identical to the value that's already there, and for which any
necessary traces have fired.), store-load (retrieving a value from the
callframe that cannot have changed since we put it there), and
store-store (overwriting a value in the callframe that nothing can
have examined with a same or different value). Ths should be more
stable and flexible than the current approach of updating the
callframe only at escape points (mostly invokes). It should also allow
some amount of code motion; a load operation inside a loop that does
not modify the variable can be hoisted out of the loop, for instance.
Part of the reason I did pre.tcl is that it provides a good partial
framework in which to implement this sort of thing. The concepts of
'available value' and 'anticipable value' that it uses generalize
readily to slots in the callframe. The fact that it's working well now
gives me some hope that we can get to intelligent load/store
management.
But I'm really bogged down with this last handful of tests, and really
reluctant to move farther forward until they're stable.
Help?
|
|
From: Donal K. F. <don...@ma...> - 2019-01-06 08:58:37
|
Optimization is hard. Even small bugs can result in code that can be
exploited to do arbitrary code execution.
https://abiondo.me/2019/01/02/exploiting-math-expm1-v8/
Fortunately for us, I don't think Tcl can possibly be vulnerable this
particular way as we don't conflate integers (and indices) with floating
point numbers (and so can't have the strangenesses in the IEEE spec
cross over into the critical domain to catch us unawares). But it is
still a good cautionary tale.
Donal.
|
|
From: Donal K. F. <don...@ma...> - 2019-01-04 08:59:20
|
On 03/01/2019 20:23, Kevin Kenny wrote: > (a) Read traces (if we support them at all) will fire at least once > for any quadcode sequence that reads a value from a namespace > variable. Subsequent reads of a value that is known not to have been > modified since the previous read need not be traced. The major issue with read traces is linked variables (as created with Tcl_LinkVar) since not all user code is good about notifying us when it has updated the underlying C variable. Sometimes that's for good reason (such as it being a state variable of a non-Tcl system) but the upshot is the same: for variables with a read trace, it isn't safe to assume anything about their value other than by reading it. Donal. |
|
From: Kevin K. <kev...@gm...> - 2019-01-03 20:23:44
|
I had a few further thoughts about traces.
When we're talking about traces as a feature that MUST be supported
for quadcode to represent an acceptable subset of the language, we're
mostly talking about, "what is needed for Tk to work?" [1] That's a
lot less general than "support all use cases of traces".
Tk uses write traces only, and generally accumulates data about what
has happened and defers acting upon it to the idle loop. Because it
works this way, trace elision actually sounds to me like a fairly good
idea. The concept here would be:
(a) Read traces (if we support them at all) will fire at least once
for any quadcode sequence that reads a value from a namespace
variable. Subsequent reads of a value that is known not to have been
modified since the previous read need not be traced.
(b) Write traces will fire when the value of a program variable is
stored back in the Var data structure. It is not required that every
place that a variable is set in a quadcode sequence must correspond
with a write trace. Rather, all namespace variables must be updated
prior to invoking code that might change them or cede control, and
prior to return to a call thunk. (n other words, require that the
variables actually be in sync only when giving control to something
that isn't quadcode.) In addition, variables in the callframe must be
in sync when invoking any code that may access them. There are a fair
number of Tcl commands that access variables in a controlled,
predictable fashion (for example, [gets], [read], [regexp], [regsub],
[scan] and [binary scan]), and it is permissible to defer
synchronization when invoking a command that does not access the
variable in question.
(c) Unset traces follow the same rule as write traces - they require
synchronization prior to returning to a call thunk or invoking
non-quadcode that might try to access the variable.
Essentially, this makes the tracing rule for quadcode be, "before any
non-quadcode subsystem gets control, whether by call, return, or
coroutine transfer, at least one write or unset traces has fired
presenting the current value of each traced variable that has been
modified."
That keeps Tk (and other subsystems depending on linked variables)
working, and in fact saves them work if the same variable is hit
multiple times. It also allows us considerable latitude for moving
reads and writes around, for instance by pushing them above or below
loops in which they appear.
It's surely a change in language semantics between interpreted Tcl and
quadcode, but I think it's most likely a change for the better, given
what traces appear to be used for 'in the wild'.[2]
----
[1] There's also ::env, but I think a bunch of us concluded quite a
long time ago that the read trace on ::env is NOT desirable. I really
ought to work up the TIP for per-interp env, initialized from the
process environment at initial interp creation time, copied from the
parent interp when creating a child interp, and copied to the process
environment (under mutual exclusion) during [exec] and other C logic
that needs the environment in sync. The fact that neither we nor
POSIX defines any sort of synchronization requirement for environment
variable changes is crazy. End of dirgression.
[2] I've also seen traces used for debugging instrumentation, in which
case they *are* interested in each individual access or change to a
variable. I'd imagine, though, that we'd want either to have a
debugging option to the code generation to force such traces to be
left in, or else run this sort of instrumentation only on interpreted
code, rather than burden every use of quadcode with the cost of the
additional hash table lookups and assignments.
On Wed, Jan 2, 2019 at 9:58 AM Kevin Kenny <kev...@gm...> wrote:
>
> On Wed, Jan 2, 2019 at 6:35 AM Donal K. Fellows
> <don...@ma...> wrote:
> > On 02/01/2019 02:15, Kevin Kenny wrote:
> > In theory, there's no problem with having multiple CALLFRAMEs within a
> > single function once they're distinct entities (due to inlining). I've
> > no idea if this is actually practical, of course!
>
> That's something else that I wanted to discuss. Right now, I'm doing
> only the very simplest inlining. In effect, the inlined function has
> to be trivial enough that I don't need its callframe. That rules out
> having the inlined code do anything like [regexp] because without
> multiple callframes, there's no way for the C implementation of
> [regsub] to find the match variables. It also rules out throwing
> errors until and unless I figure out what to do with 'procLeave'. But
> that's an issue for another day.
>
> > > The new algorithms will be fairly aggressive about moving code around
> > > - so that, for instance, a read-only use of a namespace variable
> > > within a loop can be hoisted out of the loop if nothing reads or
> > > writes the variable or any alias. IN designing this, I realize that
> > > the logic is about to founder on directSet, directGet, directArraySet,
> > > directArrayGet, etc. - fourteen operations in all. All of these at
> > > least notionally modify or access the state of variables, and so need
> > > to have the callframe among their parameters and results.
> >
> > Those operations are essentially designed for handling the cases where
> > variables are accessed by qualified names, and CAN TRIGGER TRACES. We
> > must handle them correctly, alas; lots of Tcl code is strongly dependent
> > on traces. However, in the case that it is possible to prove that a
> > variable name does not contain a namespace separator, something more
> > efficient can be done; the side effects are much more contained (as we
> > explicitly state that we won't support local variable traces).
>
> We do need to constrain scope somewhat! Traces, as they are currently
> defined, are a feature about which I say, "if you need the
> interpreter, you know where to find it!" Consider the following:
>
> proc tracer {n1 n2 op} {
> uplevel 1 {set b 2}
> }
> trace add variable ::a write tracer
> proc x {} {
> set b 1
> set ::a 1
> return $b
> }
> puts [x]
>
> In the presence of traces, no assumptions about the semantics of code
> are safe. 'tracer', instead of setting an irrelevant variable, could
> have redefined 'set' or something equally nasty.
>
> What I suspect will save us is that nobody sane does radical semantic
> alterations in performance-sensitive code. There will be some core set
> of algorithms where a program spends its tiime that don't need the
> full functionality of the interpreter and can be aggressively
> optimized.
>
> > You're correct that they should be returning CALLFRAME FAIL STRING in
> > the general case.
>
> Right now, on trunk, none of these functions takes the callframe. It's
> just 'directSet varName value' and so on. That's the immediate
> modification I'm proposing.
>
> > > Beyond this, the next thing to think about will be breaking up direct*
> > > into smaller pieces. If 'directLappend result callframe listvar
> > > element' could turn into a sequence like
> > >
> > > directRef listRef callframe var
> > > refGet val listRef # don't know if I might
> > > lappend val2 val element
> > > refPut temp callframe listvar element
> > > extractCallframe callframe2 temp
> > > retrieveResult result temp
> > >
> > > (plus whatever error checking goes there)
> > >
> > > then I couid start looking at aggregating all the directRef's to the
> > > same variable, hoisting them out of loops, and so on. Many of the
> > > instructions in straight-line sequences like this are optimized to
> > > zero-cost operations.
> >
> > Doing this is all about getting the semantics right. That's surprisingly
> > difficult. For example, the code for the equivalent of:
> >
> > lappend ::foo::bar::var $a
> > lappend ::foo::bar::var $b
> > lappend ::foo::bar::var $c
> >
> > and
> >
> > lappend ::foo::bar::var $a $b $c
> >
> > needs to be able to avoid making extra references to the list in the
> > variable if possible or we risk transforming O(N) algorithms into O(N²),
> > and nobody will thank us for that. (Early versions of 8.6 had exactly
> > this bug due to my getting things slightly wrong in the bytecode.)
>
> Indeed. I was talking about making references to the variable, not the
> value. Already, for a local variable, we're able to retain the
> variable ref - it's in the callframe after all - without trouble, As
> far as retaining the value ref goes, that's also likely not to be an
> issue because of the elimination of common subexpressions.
>
> I imagine a pattern like the following (all the error checking stuff
> suppressed):
>
> getVarRef aRef callframe "::foo::bar::var"
> lappendRef result aRef a
> unset result
> lappendRef result aRef b
> unset result
> lappendRef result aRef c
> unset aRef
>
> The only reference to the list is still in the variable, but we avoid
> the cost of resolution on all references except the first.
> Essentially, aRef is holding a Var*.
>
> But that's if we think that ::foo::bar::var must be in sync at all
> times, so read on...
>
> > Similarly, we *must* get the trace semantics mostly right on global
> > variables, at least to the point of having at least one trace call in
> > the right places so that access to variables synched into C code in
> > extensions will work correctly. Also, I vaguely recall that [append] is
> > strange in this regard. Or perhaps that was arguably a bug that is now
> > fixed (or ought to be, for consistency's sake)?
>
> I think we're in agreement here, for sufficiently small values of
> 'mostly right'. Rather than 'all variables in sync at all times', I
> already have logic that assesses known effects. There are already some
> commands that are identified as, 'this might change the value of any
> variable, but otherwise have known effects', plus some commands like
> [regexp] that have constrained effects: '[regexp] does not read any
> variable and writes exactly those variables whose names are passed as
> match variables'.
>
> So one definition of 'mostly right' could be that all variables are in
> sync before and after calls to C commands that might examine or alter
> them, and all namespace variables are in sync on procedure exit.
> Dropping traces that would have fired between those points might be
> fair game.
>
> > I suspect that these will reduce your scope for improvement a lot.
>
> > Otherwise, I'm generally happy to do this. I'm guessing that it might be
> > easiest to split things so that we have operations that produce a VAR
> > type (not a subclass of STRING) that we can then do the other changes
> > on. That'd at least let us factor out the (potentially expensive!)
> > variable lookup operations. The actual reads and writes are usually much
> > cheaper (other than that the values must always be boxed and traces are
> > a true issue).
>
> Given the problem that traces have arbitrary and unknowable side
> effects, the best we can hope for is that some benign ones work.
>
> In such a world, there's also an opportunity for quadcode that we've
> not yet properly addressed. If we allow variables to be out of sync
> momentarily (provided that we've restored consistency any time we get
> to external code that might use them), then we could mostly eliminate
> the need for the K combinator. If we allow 'x' to have a transient
> state where its value is lost, then
>
> set x [lreplace $x $a $b {*}$neweleents]
>
> would not incur a performance penalty with respect to the mysterious
>
> set x [lreplace $x[set x {}] $a $b {*}$newelements]
>
> because we could detect that the new value of x is always a
> newly-created value and avoid the extra ref. (This would also involve
> proving either that the [lreplace] cannot throw, or that the value of
> $x will never again be used if it does.)
>
> Anyway, the immediate issue is that direct ops don't reference the
> CALLFRAME formally, and need to. :)
>
> Kevin
|
|
From: Kevin K. <kev...@gm...> - 2019-01-02 14:58:20
|
On Wed, Jan 2, 2019 at 6:35 AM Donal K. Fellows
<don...@ma...> wrote:
> On 02/01/2019 02:15, Kevin Kenny wrote:
> In theory, there's no problem with having multiple CALLFRAMEs within a
> single function once they're distinct entities (due to inlining). I've
> no idea if this is actually practical, of course!
That's something else that I wanted to discuss. Right now, I'm doing
only the very simplest inlining. In effect, the inlined function has
to be trivial enough that I don't need its callframe. That rules out
having the inlined code do anything like [regexp] because without
multiple callframes, there's no way for the C implementation of
[regsub] to find the match variables. It also rules out throwing
errors until and unless I figure out what to do with 'procLeave'. But
that's an issue for another day.
> > The new algorithms will be fairly aggressive about moving code around
> > - so that, for instance, a read-only use of a namespace variable
> > within a loop can be hoisted out of the loop if nothing reads or
> > writes the variable or any alias. IN designing this, I realize that
> > the logic is about to founder on directSet, directGet, directArraySet,
> > directArrayGet, etc. - fourteen operations in all. All of these at
> > least notionally modify or access the state of variables, and so need
> > to have the callframe among their parameters and results.
>
> Those operations are essentially designed for handling the cases where
> variables are accessed by qualified names, and CAN TRIGGER TRACES. We
> must handle them correctly, alas; lots of Tcl code is strongly dependent
> on traces. However, in the case that it is possible to prove that a
> variable name does not contain a namespace separator, something more
> efficient can be done; the side effects are much more contained (as we
> explicitly state that we won't support local variable traces).
We do need to constrain scope somewhat! Traces, as they are currently
defined, are a feature about which I say, "if you need the
interpreter, you know where to find it!" Consider the following:
proc tracer {n1 n2 op} {
uplevel 1 {set b 2}
}
trace add variable ::a write tracer
proc x {} {
set b 1
set ::a 1
return $b
}
puts [x]
In the presence of traces, no assumptions about the semantics of code
are safe. 'tracer', instead of setting an irrelevant variable, could
have redefined 'set' or something equally nasty.
What I suspect will save us is that nobody sane does radical semantic
alterations in performance-sensitive code. There will be some core set
of algorithms where a program spends its tiime that don't need the
full functionality of the interpreter and can be aggressively
optimized.
> You're correct that they should be returning CALLFRAME FAIL STRING in
> the general case.
Right now, on trunk, none of these functions takes the callframe. It's
just 'directSet varName value' and so on. That's the immediate
modification I'm proposing.
> > Beyond this, the next thing to think about will be breaking up direct*
> > into smaller pieces. If 'directLappend result callframe listvar
> > element' could turn into a sequence like
> >
> > directRef listRef callframe var
> > refGet val listRef # don't know if I might
> > lappend val2 val element
> > refPut temp callframe listvar element
> > extractCallframe callframe2 temp
> > retrieveResult result temp
> >
> > (plus whatever error checking goes there)
> >
> > then I couid start looking at aggregating all the directRef's to the
> > same variable, hoisting them out of loops, and so on. Many of the
> > instructions in straight-line sequences like this are optimized to
> > zero-cost operations.
>
> Doing this is all about getting the semantics right. That's surprisingly
> difficult. For example, the code for the equivalent of:
>
> lappend ::foo::bar::var $a
> lappend ::foo::bar::var $b
> lappend ::foo::bar::var $c
>
> and
>
> lappend ::foo::bar::var $a $b $c
>
> needs to be able to avoid making extra references to the list in the
> variable if possible or we risk transforming O(N) algorithms into O(N²),
> and nobody will thank us for that. (Early versions of 8.6 had exactly
> this bug due to my getting things slightly wrong in the bytecode.)
Indeed. I was talking about making references to the variable, not the
value. Already, for a local variable, we're able to retain the
variable ref - it's in the callframe after all - without trouble, As
far as retaining the value ref goes, that's also likely not to be an
issue because of the elimination of common subexpressions.
I imagine a pattern like the following (all the error checking stuff
suppressed):
getVarRef aRef callframe "::foo::bar::var"
lappendRef result aRef a
unset result
lappendRef result aRef b
unset result
lappendRef result aRef c
unset aRef
The only reference to the list is still in the variable, but we avoid
the cost of resolution on all references except the first.
Essentially, aRef is holding a Var*.
But that's if we think that ::foo::bar::var must be in sync at all
times, so read on...
> Similarly, we *must* get the trace semantics mostly right on global
> variables, at least to the point of having at least one trace call in
> the right places so that access to variables synched into C code in
> extensions will work correctly. Also, I vaguely recall that [append] is
> strange in this regard. Or perhaps that was arguably a bug that is now
> fixed (or ought to be, for consistency's sake)?
I think we're in agreement here, for sufficiently small values of
'mostly right'. Rather than 'all variables in sync at all times', I
already have logic that assesses known effects. There are already some
commands that are identified as, 'this might change the value of any
variable, but otherwise have known effects', plus some commands like
[regexp] that have constrained effects: '[regexp] does not read any
variable and writes exactly those variables whose names are passed as
match variables'.
So one definition of 'mostly right' could be that all variables are in
sync before and after calls to C commands that might examine or alter
them, and all namespace variables are in sync on procedure exit.
Dropping traces that would have fired between those points might be
fair game.
> I suspect that these will reduce your scope for improvement a lot.
> Otherwise, I'm generally happy to do this. I'm guessing that it might be
> easiest to split things so that we have operations that produce a VAR
> type (not a subclass of STRING) that we can then do the other changes
> on. That'd at least let us factor out the (potentially expensive!)
> variable lookup operations. The actual reads and writes are usually much
> cheaper (other than that the values must always be boxed and traces are
> a true issue).
Given the problem that traces have arbitrary and unknowable side
effects, the best we can hope for is that some benign ones work.
In such a world, there's also an opportunity for quadcode that we've
not yet properly addressed. If we allow variables to be out of sync
momentarily (provided that we've restored consistency any time we get
to external code that might use them), then we could mostly eliminate
the need for the K combinator. If we allow 'x' to have a transient
state where its value is lost, then
set x [lreplace $x $a $b {*}$neweleents]
would not incur a performance penalty with respect to the mysterious
set x [lreplace $x[set x {}] $a $b {*}$newelements]
because we could detect that the new value of x is always a
newly-created value and avoid the extra ref. (This would also involve
proving either that the [lreplace] cannot throw, or that the value of
$x will never again be used if it does.)
Anyway, the immediate issue is that direct ops don't reference the
CALLFRAME formally, and need to. :)
Kevin
|
|
From: Donal K. F. <don...@ma...> - 2019-01-02 11:36:09
|
On 02/01/2019 02:15, Kevin Kenny wrote:
> I imagine that it's a pretty mechanical change because at present the
> CALLFRAME doesn't actually mean anything. It's a notional thing that
> keeps me from reordering quadcode in a way that would hoist loads
> above the corresponding stores, that sort of thing, but there is only
> one CALLFRAME per proc, ever, so there doesn't necessarily need to be
> anything in the CALLFRAME value to designate the callframe. (On the
> other hand, I think I see that you have given an internal
> representation to the CALLFRAME data type, so I'm not sure what's
> going on there.)
I need to pass actual call frames to some functions that manipulate
them, especially the ones for setting the frame up and writing to and
reading from it for external calls. IIRC, it's a 'struct CallFrame*' in
all but name.
In theory, there's no problem with having multiple CALLFRAMEs within a
single function once they're distinct entities (due to inlining). I've
no idea if this is actually practical, of course!
> The new algorithms will be fairly aggressive about moving code around
> - so that, for instance, a read-only use of a namespace variable
> within a loop can be hoisted out of the loop if nothing reads or
> writes the variable or any alias. IN designing this, I realize that
> the logic is about to founder on directSet, directGet, directArraySet,
> directArrayGet, etc. - fourteen operations in all. All of these at
> least notionally modify or access the state of variables, and so need
> to have the callframe among their parameters and results.
Those operations are essentially designed for handling the cases where
variables are accessed by qualified names, and CAN TRIGGER TRACES. We
must handle them correctly, alas; lots of Tcl code is strongly dependent
on traces. However, in the case that it is possible to prove that a
variable name does not contain a namespace separator, something more
efficient can be done; the side effects are much more contained (as we
explicitly state that we won't support local variable traces).
You're correct that they should be returning CALLFRAME FAIL STRING in
the general case.
> Beyond this, the next thing to think about will be breaking up direct*
> into smaller pieces. If 'directLappend result callframe listvar
> element' could turn into a sequence like
>
> directRef listRef callframe var
> refGet val listRef # don't know if I might
> lappend val2 val element
> refPut temp callframe listvar element
> extractCallframe callframe2 temp
> retrieveResult result temp
>
> (plus whatever error checking goes there)
>
> then I couid start looking at aggregating all the directRef's to the
> same variable, hoisting them out of loops, and so on. Many of the
> instructions in straight-line sequences like this are optimized to
> zero-cost operations.
Doing this is all about getting the semantics right. That's surprisingly
difficult. For example, the code for the equivalent of:
lappend ::foo::bar::var $a
lappend ::foo::bar::var $b
lappend ::foo::bar::var $c
and
lappend ::foo::bar::var $a $b $c
needs to be able to avoid making extra references to the list in the
variable if possible or we risk transforming O(N) algorithms into O(N²),
and nobody will thank us for that. (Early versions of 8.6 had exactly
this bug due to my getting things slightly wrong in the bytecode.)
Similarly, we *must* get the trace semantics mostly right on global
variables, at least to the point of having at least one trace call in
the right places so that access to variables synched into C code in
extensions will work correctly. Also, I vaguely recall that [append] is
strange in this regard. Or perhaps that was arguably a bug that is now
fixed (or ought to be, for consistency's sake)?
I suspect that these will reduce your scope for improvement a lot.
Otherwise, I'm generally happy to do this. I'm guessing that it might be
easiest to split things so that we have operations that produce a VAR
type (not a subclass of STRING) that we can then do the other changes
on. That'd at least let us factor out the (potentially expensive!)
variable lookup operations. The actual reads and writes are usually much
cheaper (other than that the values must always be boxed and traces are
a true issue).
Donal.
|
|
From: Kevin K. <kev...@gm...> - 2019-01-02 02:15:42
|
Donal,
Happy 2019! Hope you're well. I've not been around on the mailing
lists or the chat the last couple of weeks - this has been a time for
spending with family and friends.
But in my odd moments, I have been looking at quadcode some. (In
particular, jump threading has a shiny new implementation that is many
times faster than the old, and in addition, generates a much smaller
explosion of the state space.) I've also done a new implementation of
Global Value Numbering/Partial Redundancy Elimination (GVN-PRE) that
subsumes common subexpression elimination and loop-invariant code
motion, and could (not yet implemented, but possible!) also include
constant folding (rather than having that in a separate pass),
expression simplification (things like 0+a=a) and reduction of
operator strength. Some of that I want to leave to LLVM, but I doubt
that it can do a very good job simplifying dictionary or list
operations!
The choice of these was guided by the issues surrounding poly1305. The
old jump threading was a huge drag on compile speed, and now that
particular issue is removed.
The next big issue in poly1305 is callframe management, so that's what
I've jumped into next. (I took a long false path into trying to
refactor varargs, and intend to get back to that, but it will get
easier once the callframe stuff is refactored.)
It's become fairly clear to me that the 'save to the callframe before
invoke, restore after' is the wrong approach. What's right is to do
what LLVM itself does: allocate slots in the callframe (analogous to
the 'alloca' operations that reserve variables in LLVM), and initially
generate code that loads ('moveFromCallFrame in quadcode notation) and
stores (moveToCallFrame) on every read and write of a variable. That
will start with all variables in sync, and I'm working on the
algorithms (very, very closely related to GVN-PRE) for determining
when it's safe to eliminate loads and stores.
The new algorithms will be fairly aggressive about moving code around
- so that, for instance, a read-only use of a namespace variable
within a loop can be hoisted out of the loop if nothing reads or
writes the variable or any alias. IN designing this, I realize that
the logic is about to founder on directSet, directGet, directArraySet,
directArrayGet, etc. - fourteen operations in all. All of these at
least notionally modify or access the state of variables, and so need
to have the callframe among their parameters and results.
All of them need to get the callframe as input, and the ones that
modify variables need to return {CALLFRAME FAIL STRING} - which will
immediately fall into the conventional sequence of extractCallFrame,
retrieveResult, jumpMaybe.
In the kbk-refactor-callframe branch, I've changed translate.tcl to
generate these operations (not really tested, I'm still working on the
rest of the code!) but I've not yet even started on making the
corresponding change on the codegen/ side. If you could possibly help
with that - I imagine that it's a pretty mechanical change - that'd be
a time-saver for me.
I imagine that it's a pretty mechanical change because at present the
CALLFRAME doesn't actually mean anything. It's a notional thing that
keeps me from reordering quadcode in a way that would hoist loads
above the corresponding stores, that sort of thing, but there is only
one CALLFRAME per proc, ever, so there doesn't necessarily need to be
anything in the CALLFRAME value to designate the callframe. (On the
other hand, I think I see that you have given an internal
representation to the CALLFRAME data type, so I'm not sure what's
going on there.)
There's no particular hurry with all of this. It'll take me a little
while to get my own stuff in order. If you'd rather wait until I have
something that you could test against, I'm fine with that, too.
As a side effect, when this subproject is done, we should get a
correct implementation of [set $a] and [set $a $b] - and a possibly
substantial performance gain in access to [upvar] and namespace
variables.
Beyond this, the next thing to think about will be breaking up direct*
into smaller pieces. If 'directLappend result callframe listvar
element' could turn into a sequence like
directRef listRef callframe var
refGet val listRef # don't know if I might
lappend val2 val element
refPut temp callframe listvar element
extractCallframe callframe2 temp
retrieveResult result temp
(plus whatever error checking goes there)
then I couid start looking at aggregating all the directRef's to the
same variable, hoisting them out of loops, and so on. Many of the
instructions in straight-line sequences like this are optimized to
zero-cost operations.
Kevin
|
|
From: Kevin K. <kev...@gm...> - 2018-12-17 23:34:18
|
I just committed and synced a new jump-threading pass to tcl-quadcode. This replaces the old 'nodesplit' pass, which in a large procedure could cause the entire optimization pipeline to rerun a great many times for a large procedure, since it would thread only one basic block at a time. (Moreover, it often faced a combinatorial explosion of threading opportunities - many more than were justified by the actual data types, because it did not detect converging pathways in the splits.) The new threading pass is the first one that I've done to require SSA deconstruction and reconstruction. For the large number of changes that the pass typically makes, it's faster and simpler simply to tear things down and put them back together again. With the new module in place, the 'renameTemps' pass, which was a thorn in the side of the compiler on large procedures (owing to a performance bug in Tcllib's 'struct::disjointset' module) can be removed. In addition, of course, the old 'nodesplit.tcl' is also no longer relevant. Donal, feel free to ignore my earlier message about addReference on values combined with NEXIST IMPURE, and returnCode on values that are FAIL+some other type. The revised jump threading separates those combined types before they get to the offending operations. (I now realize that the jump threading is *required* to compile [catch] because the failure and success paths *must* be separated until the FAIL is dismissed.) Ad astra! Kevin |
|
From: Donal K. F. <don...@ma...> - 2018-10-22 05:19:47
|
Ah, the benefits of an intercontinental flight! NRE now compiles and works with expansion (once I sync my fossil repo). It was fiddly, but was essentially just a blend of what existed under the *assumption* that by the time we hit this code, we're basically giving up on compiling deep inside. Theoretically, we might be able to do better with the case where the initial word is not expanded… but I don't check that. It assumes that arguments needing expansion are always well-formed lists. I can't remember if we check that elsewhere. :-) Donal. |
|
From: Donal K. F. <don...@ma...> - 2018-10-18 11:11:13
|
I've just published an update to llvmtcl to support LLVM 7; llvmtcl 3.9.2 is a very small change, but LLVM 7 is a slightly larger change that's forced me to do a small update to the code generator. The issue is that the type signatures of two intrinsics (for memcpy and memset) have changed; the new code is flexible between LLVM versions. People using branches other than trunk will want to merge trunk before using LLVM 7. (llvmtcl 3.9.2 also removes one API call, but we weren't making any use of it.) Donal. |
|
From: Kevin K. <kev...@gm...> - 2018-05-01 18:21:19
|
THe last few messages, I mentioned working on a mystery bug in calling procedures that returned CALLFRAME COROHANDLE. I now realize that the call-and-return mechanisms were innocent. Instead, the true culprit was the entry of a procedure that has variables that must be kept in the Tcl callframe. The local variable table was being allocated with '$builder allocArray', which turns into an 'alloca' instruction. That instruction was not in the entry block of the function, and for whatever reason, LLVM's optimizer decided it couldn't move it there. The result was that the LVT was allocated on the stack, and promptly went away at the first coroutine suspension, and nasal daemons were promptly invoked. People who work on this stuff should be aware that there's a variable $entryBlock in the compiler, that holds an unclosed block, and the builder has methods 'allocInBlock' and 'allocArrayInBlock' to make use of it. Anything brought in with a conventional 'alloca' cannot persist across suspending an LLVM coroutine. It appears that constant-size 'alloc' operations in the entry blocks of inline procedures are safely moved to the entry block of the procedure in which they appear, but other 'alloca's should all appear before the first 'call', lest the 'call' be inlined and the 'alloca' appear outside the entry block as a consequence. With the allocation issue fixed, 'rectest3' started working, so I'm back to having all the existing tests pass. I still need to do code and tests for NRE-aware invocation of commands (rather than compiled procs) and NRE-aware invokeExpanded. Then I'll turn on the change that makes the I/O commands require NRE (because several packages yield from stacked channel handlers). Last will come the implementation of 'tailcall' and 'yield'. 'yieldto' may be a bridge too far at this point; I have Absolutely No Idea what to do in a compiled caller of a coroutine that does 'yieldto'! So, continuing to plug ahead - along with boxing up my office, they're insisting that I have to move to a different space. :( |
|
From: Kevin K. <kev...@gm...> - 2018-04-30 19:16:06
|
On Mon, Apr 30, 2018 at 3:02 PM, Donal K. Fellows < don...@ma...> wrote: > On 30/04/2018 18:00, Kevin Kenny wrote: > >> I'm also noticing that the code gen is starting to emit pretty loose and >> bulky code, which is sure to cause cache pressure in the long run. I'm >> wondering if some of the bigger sequences (setting up a callframe and >> performing an [upvar] are two that come to mind) would actually be better >> with 'inlinehint' as opposed to 'alwaysinline.' >> > > Right now, that's governed by the definition of the Module:local method > (in codegen/struct.tcl) which is currently setting library functions to be > always inlined except when explicitly told to make them 'noinline'. I've > updated (on kbk-nre) the method to also recognise 'inlinehint' as another > override; add it after the type signature on the functions that need the > balanced approach (since most of the standard library depends on being > inlined to work efficiently, I'm going to keep the default what it is right > now). > OK, thanks! This is bringing up another intermediate-term idea - simplifying some of the code issuer by delegating more work up front. I am increasingly hating the INT type; why don't we just go int64 and have done with it, forcing int32 only when there's some external API that needs it? Also, I'm seeing that the jump threading that LLVM is doing on NUMERIC operations is ... incomplete, or at best inconsistent. I've half a mind to make the type promotion explicit in quadcode, so that numeric operations will always see int64 or DOUBLE, with conditional jumps enabling both if needed. That way, my jump threading can untangle the code. (I'm even starting to think about separating many procedures that now have NUMERIC args into separate int64 and double versions - so that they can in turn be threaded with their callers. That would get rid of a significant fraction of the 'must inline' stuff. But the idea comes to me partly just because Builder:unknown frightens me. |
|
From: Donal K. F. <don...@ma...> - 2018-04-30 19:10:09
|
On 30/04/2018 18:00, Kevin Kenny wrote: > In doing so, I'm noticing a few things that are causing me a little bit > of concern. I'd like to get a second opinion on them. The original concept was that we were compiling code to replace procedures in a specific interpreter, when making everything (interpreter and Tcl_Obj literals) “global” to the compilation was a reasonable approach. I agree that if we're going to be more like building libraries, we should do it by passing the interpreter explicitly and hanging any relevant “constants” off that via the usual approaches (especially as we don't currently use the ClientData parameter in the binding layer at all). I think that can all be done via the code generator; it doesn't need the quadcode level to be aware. > While I'm in there, I'm going to add an additional argument to > 'invoke' and 'invokeExpanded', which will be a list giving the > leading arguments that were replaced by the name being invoked. This > will let us start generating correct stack traces even though the > 'middle end' is precomputing function and ensemble resolutions. That so reminds me of what I did with 'invokeReplace' in TEBC… Donal. |
|
From: Donal K. F. <don...@ma...> - 2018-04-30 19:02:25
|
On 30/04/2018 18:00, Kevin Kenny wrote: > I'm also noticing that the code gen is starting to emit pretty loose and > bulky code, which is sure to cause cache pressure in the long run. I'm > wondering if some of the bigger sequences (setting up a callframe and > performing an [upvar] are two that come to mind) would actually be > better with 'inlinehint' as opposed to 'alwaysinline.' Right now, that's governed by the definition of the Module:local method (in codegen/struct.tcl) which is currently setting library functions to be always inlined except when explicitly told to make them 'noinline'. I've updated (on kbk-nre) the method to also recognise 'inlinehint' as another override; add it after the type signature on the functions that need the balanced approach (since most of the standard library depends on being inlined to work efficiently, I'm going to keep the default what it is right now). Donal. |
|
From: Kevin K. <kev...@gm...> - 2018-04-30 17:00:08
|
I'm still plowing through debugging the interactions introduced by the
CALLFRAME COROHANDLE data type. (I'm sure that I have simply overlooked
something obvious, but it's another one of those bugs that lands into code
that should not be reachable - so I'm having to slog through a lot of
disassembly in llvm.)
In doing so, I'm noticing a few things that are causing me a little bit of
concern. I'd like to get a second opinion on them.
It appears that the Tcl interpreter pointer is being stashed statically and
then referenced throughout the library. This is clearly not going to be
multi-interp safe, defeating our apartment-threaded "multi-interp safe and
thread-oblivious" model. It appears actually to have a fairly
straightforward fix: pass the interpreter as the first parameter to
compiled commands, and stash it in the Builder immediately on generating
the entry sequence. This would be an extra parameter, but my guess is that
it's used frequently enough that it would wind up being kept in %rsi
anyway, so be effectively zero cost.
I see the same thing with the constant pool - Tcl_Obj's are not normally
safe to share among threads. The trick that many Tcl extensions do - have
the constant pool be one of the things hanging off the ClientData of their
commands - would work for us, too. Once again, the ClientData would have to
be passed as a parameter to all our command implementations, consuming
another parameter slot, but once again, it's used often enough that it
would probably just live in %rdi.
Sometime soon would be a good time for me to tackle this. I'm looking at
the possibility of refactoring the generation of calls in the code issuer
in any case - because the cross product of {function, command, command with
arg expansion} and {NRE, no NRE} and {uses callframe, doesn't use
callframe} is turning into an awful lot of apparently redundant code. If
I'm refactoring 'invoke' and 'entry' anyway, I might as well work this
stuff in.
While I'm in there, I'm going to add an additional argument to 'invoke' and
'invokeExpanded', which will be a list giving the leading arguments that
were replaced by the name being invoked. This will let us start generating
correct stack traces even though the 'middle end' is precomputing function
and ensemble resolutions.
I'm also noticing that the code gen is starting to emit pretty loose and
bulky code, which is sure to cause cache pressure in the long run. I'm
wondering if some of the bigger sequences (setting up a callframe and
performing an [upvar] are two that come to mind) would actually be better
with 'inlinehint' as opposed to 'alwaysinline.'
Anyway, as I said, a second opinion would be welcome!
|
|
From: Donal K. F. <don...@ma...> - 2018-04-30 10:30:12
|
On 29/04/2018 20:24, Kevin Kenny wrote: > Does it fail with mcjit, or only with code gen for an external module? External module; a representative example is this (which is the CI build for the v3.9 tag): https://travis-ci.org/dkfellows/llvmtcl/builds/372739935 The overall run works (because tcltest insists on doing [exit 0] despite tests failing) but the tests external-1 and external-2 don't work. (They do pass on my local machine.) Also there's irritatingly many warnings during the compile, but they seem to be just annoyances and not a real problem. > If it succeeds with mcjit, does the configurator come up with the same target triple that the external-module code gen is expecting? I've had a suspicion for a while that our target triple selection is Not Quite Right on some of the platforms. Could be. The triples are annoyingly opaque, though I thought the approach of asking for the host triple during configuration time would work reliably (and that's easy). I'd really like to know if there's a difference between what different parts of LLVM think the host is. (If the LLVM available on Travis isn't capable of building properly for host, that's *astonishingly* nasty!) Donal. |
|
From: Kevin K. <kev...@gm...> - 2018-04-29 19:24:33
|
On Sun, Apr 29, 2018 at 1:43 PM, Donal K. Fellows < don...@ma...> wrote: > I've done a release of llvmtcl, 3.9, fixing up as much as I can and > getting everything into a good state. > > https://github.com/dkfellows/llvmtcl/releases/tag/v3.9 > > There's only one niggling issue that I know of right now; the travis build > doesn't pass its own tests because it can't issue code for the native > platform. It seems like something deeply weird is going on. Does it fail with mcjit, or only with code gen for an external module? If it succeeds with mcjit, does the configurator come up with the same target triple that the external-module code gen is expecting? I've had a suspicion for a while that our target triple selection is Not Quite Right on some of the platforms. |
|
From: Donal K. F. <don...@ma...> - 2018-04-29 17:44:03
|
I've done a release of llvmtcl, 3.9, fixing up as much as I can and
getting everything into a good state.
https://github.com/dkfellows/llvmtcl/releases/tag/v3.9
There's only one niggling issue that I know of right now; the travis
build doesn't pass its own tests because it can't issue code for the
native platform. It seems like something deeply weird is going on.
Donal.
|
|
From: Kevin K. <kev...@gm...> - 2018-04-24 04:03:45
|
I hacked together the patches that: - make sure that the target triple and target data layout are set in the module before any code generation takes place. - use the LLVM DataLayout for handling 'sizeof' and 'alignof' With those in place, I didn't need to tell any lies about the size of the coroutine promise - CoroSplit finds the correct size and generates correct code. There is a companion change in llvmtcl that requires advancing the version to 3.10 - I've sent a pull request with the changes. With these changes in place, the mrtest crash is gone - the code worked on the first try once I'd adapted coro.tcl to use the new 'sizeof' and 'alignof' primitives. All the tests in demos/perftest/tester.tcl pass. Performance isn't great, but I suspect is adequate. mrtest - which is really stressing the overhead of recursive procedures, goes from 960% to only 290%. qsort - a more realistic example with actual work done in the procedure before making recursive calls - goes from 244% to 187%. I'm not positive, but I think I can eliminate one Tcl_NRAddCallback per NRE procedure invocation, which would help mrtest significantly. I'll put that one on a back burner for now. I'm not merging to trunk yet, because I still haven't done invokeExpanded and invocation of noncompiled commands in the NRE environment. Now that LLVM is treating my coroutines sanely, that should be a Small Matter of Programming. The last five weeks have been quite a slog, but there's light at the end of the tunnel: my mystery crash, at long last, is gone! On Mon, Apr 23, 2018 at 6:14 PM, Kevin Kenny <kev...@gm...> wrote: > Would it be possible to commit to 'mcjit' or 'simple' before we get to > 'optimize' and attach the execution engine then? I'm pretty sure that the > empty string we pass as DataLayout isn't giving the same answers for > structure offsets. The LLVM code that's generating the incorrect offset in > the call thunk is using the data layout to do it - and I think the data > layout is changing when we bind the execution engine, since that's the > place where the target data layout is actually known. > |
|
From: Kevin K. <kev...@gm...> - 2018-04-23 22:14:24
|
It occurs to me: if CoroSplit is taking the coroutine frame as an (i8*), adding a fixed offset, and expecting that to point to a data slot beyond the coroutine promise, then it has to be getting the size of the promise statically from somewhere. This has to be the DataLayout associated with the Module. But we don't bind that correctly until [$module simple] or [$module mcjit], when we attach the execution engine. Would it be possible to commit to 'mcjit' or 'simple' before we get to 'optimize' and attach the execution engine then? I'm pretty sure that the empty string we pass as DataLayout isn't giving the same answers for structure offsets. The LLVM code that's generating the incorrect offset in the call thunk is using the data layout to do it - and I think the data layout is changing when we bind the execution engine, since that's the place where the target data layout is actually known. Relevant LLVM code is at http://llvm.org/doxygen/CoroFrame_8cpp_source.html#l00313 and the layout of the first few elements of the frame is accomplished (using that Padder class) at http://llvm.org/doxygen/CoroFrame_8cpp_source.html#l00384 I'm guessing that the right sequence is: LLVMCreateTargetMachine LLVMCreateTargetDataLayout LLVMSetDataLayout (for separate compilation) or LLVMCreateMCJITCompilerForModule LLVMGetExecutionEngineTargetMachine LLVMGetExecutionEngineTargetData so that we will have a target machine and data layout for the optimizer in both cases. In addition, we probably need to call LLVMAddAnalysisPasses on both the function and module pass managers, to get the target-dependent optimizations. It looks as if perftester/MCJIT goes through LLVM::optimise, and that the other two places I need to worry about are LLVM::compilepackagetodll and LLVM::compiletodll. It looks as if Module:simple and Module:interpreter are both dead at this point? Am I missing anything else? They don't make it easy, do they? I'll see about hacking up patches to struct.tcl, support.tcl and jit.tcl (and whatever might spill over elsewhere) - maybe this will clear up the issues. |