|
From: Andreas R. <and...@gm...> - 2004-04-07 15:21:18
|
I'm forwarding David's reply here - I don't know if there are any implications for the stuff we're talking about wrt. V4 but if anyone has an insight into these areas it might be worthwhile to keep some of this in mind. Cheers, - Andreas ----- Original Message ----- From: "David P. Reed" <dp...@re...> To: "Andreas Raab" <and...@gm...>; <dav...@be...>; <al...@sq...> Sent: Wednesday, April 07, 2004 4:37 PM Subject: Re: Q: Lowest level VM changes > I don't think we know what is best at this time. It's clear that the > inter-teaparty message send has a common-case fast-path. But it's too > early to guess what the change should be. > > Prbably the biggest other win would be around making it much more efficient > to use floating point (which we do in tea-times as well as in the 3D > stuff). Since floats are put on the heap, it might be worth looking at > the techniques we used in MACLISP interpretation to put intermediate floats > in a "number stack" that was much more efficiently allocated and freed > (allocate = push onto the temporary number stack). Coupled with compiling > sequences of math operations and tests into a "math mode" byte code stream > that checks types on the inputs and then just runs a different byte code > interpreter without any further type checking, this could speed up math a > lot. It's a kind of optimistic or speculative execution concept. This > coupled with the matrix stuff would make Croquet a kick-ass math interpreter. > > At 05:26 PM 4/6/2004, Andreas Raab wrote: > >Hi, > > > >I am just in an extremely low-level discussion with some people about the > >benefits of various kinds of lowest-level VM changes and I was wondering if > >there is anything in Croquet where certain modifications of the VM could > >make huge differences. If you have anything where you say "oh, it would be a > >*huge* improvement to have support X, Y, or Z" this would be a very good > >time to voice it. Note that I am not making any promises here - just that I > >might be able to throw something in that helps us support what you think is > >needed. > > > > - Andreas > |
|
From: Ian P. <ian...@in...> - 2004-04-07 15:44:21
|
On 07 Apr 2004, at 17:21, Andreas Raab wrote: > From: "David P. Reed" <dp...@re...> > >> Prbably the biggest other win would be around making it much more > efficient >> to use floating point (which we do in tea-times as well as in the 3D >> stuff). Since floats are put on the heap, it might be worth looking >> at >> the techniques we used in MACLISP interpretation to put intermediate > floats >> in a "number stack" that was much more efficiently allocated and freed >> (allocate = push onto the temporary number stack). Coupled with > compiling >> sequences of math operations and tests into a "math mode" byte code >> stream >> that checks types on the inputs and then just runs a different byte >> code >> interpreter without any further type checking, this could speed up >> math a >> lot. It's a kind of optimistic or speculative execution concept. I think you could do this implicitly, at least for the special arithmetic selectors. Dispatch bytecodes through a pointer to the bytecode table (identical to what gnuification generates for the inner loop at present anyway) and on creation of a float result push it onto the float stack and switch the dispatch pointer to the "floating bytecode set". Arithmetic selectors continue to manipulate the float set until something non-arithmetic comes along, triggering a pop and box of the float stack onto the regular stack and a switch back to the regular dispatch pointer before continuing with whatever bytecode we're up to. No compiler changes needed. Anton Ertl did something related (but different) in his vmgen, where parallel bytecode sets are used to represent the state of caching the topmost stack value in a register. With a little work this could maybe even be made to look fairly pretty in the source (with the parallel implementations generated automagically of the same source methods with compile-time conditionalised sections) and extended to work for SIs too (or even matrices if they were every to become a primitive type known to the arithmetic selectors directly). (Of course, the right solution is to generate and execute in native code and do minimal dataflow analysis and method splitting to keep everything unboxed and in registers as much as possible. But I digress...) Cheers, Ian |
|
From: David P. R. <dp...@re...> - 2004-04-07 16:56:22
|
The only reason to do compiler changes might be to reorder code to increase the likelihood that you'd stay in the "math mode" for a long time. This is like compilers try to reorder code to get maximum benefit from the CPU pipeline and registers by moving loads earlier and stores later within basic blocks. The generic strategy of an alternative interpreter that handles certain streams of operations optimistically and then backs up to retry with the standard one benefits most when there is a really fast really common case. Integer calculations also would benefit, by the way, from also avoiding checks to see if the intermediates are bigger than small integers, and so you could get very effective integer loops. At 11:44 AM 4/7/2004, Ian Piumarta wrote: >On 07 Apr 2004, at 17:21, Andreas Raab wrote: > >>From: "David P. Reed" <dp...@re...> >> >>>Prbably the biggest other win would be around making it much more >>efficient >>>to use floating point (which we do in tea-times as well as in the 3D >>>stuff). Since floats are put on the heap, it might be worth looking at >>>the techniques we used in MACLISP interpretation to put intermediate >>floats >>>in a "number stack" that was much more efficiently allocated and freed >>>(allocate = push onto the temporary number stack). Coupled with >>compiling >>>sequences of math operations and tests into a "math mode" byte code stream >>>that checks types on the inputs and then just runs a different byte code >>>interpreter without any further type checking, this could speed up math a >>>lot. It's a kind of optimistic or speculative execution concept. > >I think you could do this implicitly, at least for the special arithmetic >selectors. > >Dispatch bytecodes through a pointer to the bytecode table (identical to >what gnuification generates for the inner loop at present anyway) and on >creation of a float result push it onto the float stack and switch the >dispatch pointer to the "floating bytecode set". Arithmetic selectors >continue to manipulate the float set until something non-arithmetic comes >along, triggering a pop and box of the float stack onto the regular stack >and a switch back to the regular dispatch pointer before continuing with >whatever bytecode we're up to. > >No compiler changes needed. > >Anton Ertl did something related (but different) in his vmgen, where >parallel bytecode sets are used to represent the state of caching the >topmost stack value in a register. > >With a little work this could maybe even be made to look fairly pretty in >the source (with the parallel implementations generated automagically of >the same source methods with compile-time conditionalised sections) and >extended to work for SIs too (or even matrices if they were every to >become a primitive type known to the arithmetic selectors directly). > >(Of course, the right solution is to generate and execute in native code >and do minimal dataflow analysis and method splitting to keep everything >unboxed and in registers as much as possible. But I digress...) > >Cheers, >Ian > |