Re: [tclquadcode] Data motion to/from callframes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sat, Jun 3, 2017 at 2:23 AM, Donal K. Fellows <
don...@ma...> wrote:

> On 03/06/2017 00:28, Donal K. Fellows wrote:
>
>> During testing, I noticed that the procedure that *really* makes LLVM
>> take ages during optimisation seems to be [parseBuiltinsTxt::main]. I
>> have no idea why beyond simple generated function length.
>>
>
> To follow up on this, commenting out that one procedure from the list of
> things to compile better than halves the time in the optimisation stage.
> This is very unexpected! (Right now, it also produces no measurable
> speed-up as it is an I/O-bound procedure, and slows things down slightly
> because of the known issues with my terrible implementation of callframes;
> these are not actually surprising, but the cost of getting to that point
> is.)

 It appears to be just that: generated function length. If I turn off
node splitting, its compilation speeds up over fourfold. But of course
we *need* node splitting. On some procedures, it's a
more-than-order-of-magnitude performance gain.

I spent today on trying to do better node splitting, avoiding useless
splits. It didn't do much good, I'm afraid, but the code that does it
is improved, so I'm keeping the changes.

As a side effect, I needed to introduce a quadcode instruction,

     purify valueOut valueIn

that extracts a FOO from an IMPURE FOO.  Since I was implementing that
anyway, I went ahead and guarded some instructions with it:

       add -       bitand -    bitnot -    bitor -     bitxor -
       div -       eq -        expon -     ge -        gt -
       land -      le -        lor -       lshift -    lt -
       mod -       mult -      neq -       rshift -    sub -
       uminus -    uplus -
       dictIncr

On these instructions, you should now never see an IMPURE operand. I
hope this will allow at least a small amount of backend code to be
simplified.

Back to the performance. LLVM appears to have performance that goes up
quadratically or worse in the number of basic blocks, and expanding
parseBuiltinsTxt gives a lot of them. It appears that the 'emscripten'
people have actually resorted in their front end to introducing passes
to break up large functions so that LLVM doesn't ever have to consider
ones that big.

http://mozakai.blogspot.com/2013/08/outlining-workaround-for-jits-and-big.html

VLDB does the same (see slides 18ff):

http://llvm.org/devmtg/2017-03//assets/slides/using_llvm_in_a_scalable_high_available_in_memory_database_server.pdf

The issue has been reported for Julia:

https://github.com/JuliaLang/julia/issues/19158

And apparently, it's well known in the folklore that once a function
gets beyond a certain size, the LLVM optimizer's performance falls off
a cliff.

Automated refactoring into smaller functions is beyond where I want to
try to go right now. I think I'm just going to try to live with the LLVM
backend performance for the moment.

Re: [tclquadcode] Data motion to/from callframes

The Tool Command Language implementation

Re: [tclquadcode] Data motion to/from callframes