|
From: Nicolas N. <Nic...@iw...> - 2003-11-22 15:08:02
|
Hello, I am trying to find out if it is possible to call Fortran BLAS routines also on short vectors. I am running in the following problem: I have put a test program on http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop-new.lisp When I test the Lisp ddot/daxpy code I get: DDOT-long: 271.15 MFLOPS DDOT-short: 679.58 MFLOPS DAXPY-long: 143.55 MFLOPS DAXPY-short: 488.06 MFLOPS But when I call the Matlisp routines (not via CLOS!), I get BLAS-DDOT-long: 267.10 MFLOPS BLAS-DDOT-short: 63.31 MFLOPS BLAS-DAXPY-long: 149.13 MFLOPS BLAS-DAXPY-short: 61.01 MFLOPS The reason is probably that the external function call is almost as costly as the daxpy for the case +N-short+=256, while calling Lisp functions is much faster. Is it possible to cut down these costs? Thanks, Nicolas. |
|
From: Nicolas N. <Nic...@iw...> - 2003-11-22 16:58:18
|
Hello, Rereading my mail I see that I expressed myself badly again. Corrections: > Hello, > > I am trying to find out if it is possible to call Fortran BLAS routines Of course, it is possible. But is it possible without such a tremendeous performance loss? > also on short vectors. I am running in the following problem: > > I have put a test program on > > http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop-new.lisp > > When I test the Lisp ddot/daxpy code I get: > > DDOT-long: 271.15 MFLOPS > DDOT-short: 679.58 MFLOPS > DAXPY-long: 143.55 MFLOPS > DAXPY-short: 488.06 MFLOPS > > But when I call the Matlisp routines (not via CLOS!), I get > > BLAS-DDOT-long: 267.10 MFLOPS > BLAS-DDOT-short: 63.31 MFLOPS > BLAS-DAXPY-long: 149.13 MFLOPS > BLAS-DAXPY-short: 61.01 MFLOPS > > The reason is probably that the external function call is almost as costly From the numbers it is obvious that the call is even much more expensive than a daxpy for 256 double-floats. How comes? > as the daxpy for the case +N-short+=256, while calling Lisp functions is > much faster. Is it possible to cut down these costs? > > Thanks, Nicolas. > |
|
From: Raymond T. <to...@rt...> - 2003-11-24 15:43:19
|
>>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:
Nicolas> From the numbers it is obvious that the call is even much more expensive
Nicolas> than a daxpy for 256 double-floats. How comes?
>> as the daxpy for the case +N-short+=256, while calling Lisp functions is
>> much faster. Is it possible to cut down these costs?
>>
>> Thanks, Nicolas.
>>
I'll try to look into this. There's probably some improvement to be
had, but I doubt we can improve it enough for you. I think the
overhead comes from computing the necessary addresses, and also having
to turn off GC during the computation. IIRC, this involves an
unwind-protect which does add quite a bit of code.
Note that I also noticed long ago that a simple vector add in Lisp was
at least as fast as calling BLAS. However, having everything go
through FFI to BLAS at least allows us to take advantage of any
special libraries that might be available.
I, however, am not opposed to implementing the BLAS in Lisp. Other
LAPACK routines will still use the original BLAS, and Lisp code can
get the faster versions. Will need thinking, design, and
experimentation.
Ray
|
|
From: Nicolas N. <Nic...@iw...> - 2003-11-24 16:11:49
|
-- Dr. Nicolas Neuss IWR, INF 368, D-69120 Heidelberg Email: Nic...@IW... WWW: <http://www.iwr.uni-heidelberg.de/~Nicolas.Neuss> |
|
From: Nicolas N. <Nic...@iw...> - 2003-11-24 16:13:03
|
Raymond Toy <to...@rt...> writes: > >>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes: > > > Nicolas> From the numbers it is obvious that the call is even much more expensive > Nicolas> than a daxpy for 256 double-floats. How comes? > > >> as the daxpy for the case +N-short+=256, while calling Lisp functions is > >> much faster. Is it possible to cut down these costs? > >> > >> Thanks, Nicolas. > >> > > I'll try to look into this. There's probably some improvement to be > had, but I doubt we can improve it enough for you. I think the > overhead comes from computing the necessary addresses, and also having > to turn off GC during the computation. IIRC, this involves an > unwind-protect which does add quite a bit of code. Yes, you are right. I see this now. If switching off multithreading is expensive, there is a problem here. I don't know enough of these things to help you here. > Note that I also noticed long ago that a simple vector add in Lisp was > at least as fast as calling BLAS. Probably this was before I started using Matlisp. > However, having everything go through FFI to BLAS at least allows us to > take advantage of any special libraries that might be available. > > I, however, am not opposed to implementing the BLAS in Lisp. Other > LAPACK routines will still use the original BLAS, and Lisp code can > get the faster versions. Will need thinking, design, and > experimentation. I will have to do this at least for a small part of the routines, if the foreign call cannot be achieved with really little overhead (say two times a Lisp function call). I want to implement flexible sparse block matrices, and choosing Matlisp data for the blocks would be a possibility. But the blocks can be small, therefore I cannot make compromises when operating on those blocks. Thanks, Nicolas. P.S.: BTW, how does ACL perform in this respect? Just today I read Duane writing about interoperability of ACL with C and C++. If the overhead we are suffering from is necessary in general, this might be quite a problem for some applications. |
|
From: Raymond T. <to...@rt...> - 2003-11-24 16:35:49
|
>>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:
Nicolas> Raymond Toy <to...@rt...> writes:
>> I'll try to look into this. There's probably some improvement to be
>> had, but I doubt we can improve it enough for you. I think the
>> overhead comes from computing the necessary addresses, and also having
>> to turn off GC during the computation. IIRC, this involves an
>> unwind-protect which does add quite a bit of code.
Nicolas> Yes, you are right. I see this now. If switching off multithreading is
Nicolas> expensive, there is a problem here. I don't know enough of these things to
Nicolas> help you here.
It's not multithreading, per se. It's because we can't have GC
suddenly move the vectors before doing the foreign call, otherwise the
foreign function will be reading and writing to some random place in
memory.
>> Note that I also noticed long ago that a simple vector add in Lisp was
>> at least as fast as calling BLAS.
Nicolas> Probably this was before I started using Matlisp.
Yeah, probably before matlisp became matlisp.
Nicolas> I will have to do this at least for a small part of the routines, if the
Nicolas> foreign call cannot be achieved with really little overhead (say two times
Nicolas> a Lisp function call). I want to implement flexible sparse block matrices,
A factor of 2 will be very difficult to achieve, since a Lisp function
call basically loads up a bunch of pointers and calls the function.
We need to compute addresses, do the without-gc/unwind-protect stuff,
load up the registers for a foreign call and then call it.
Nicolas> and choosing Matlisp data for the blocks would be a possibility. But the
Nicolas> blocks can be small, therefore I cannot make compromises when operating on
Nicolas> those blocks.
I assume you've profiled it so that the small blocks really are the
bottleneck?
Nicolas> P.S.: BTW, how does ACL perform in this respect? Just today I read Duane
Don't know since I don't have a version of ACL that can run matlisp.
Ray
|
|
From: Nicolas N. <Nic...@iw...> - 2003-11-24 18:12:05
|
Raymond Toy <to...@rt...> writes: > It's not multithreading, per se. It's because we can't have GC > suddenly move the vectors before doing the foreign call, otherwise the > foreign function will be reading and writing to some random place in > memory. OK. But if GC is done by the same thread, my simple mind would think that switching it off means setting one global variable to NIL. > A factor of 2 will be very difficult to achieve, since a Lisp function > call basically loads up a bunch of pointers and calls the function. We > need to compute addresses, do the without-gc/unwind-protect stuff, load > up the registers for a foreign call and then call it. Yes. Here I assume (in the direction to what Duane posted) that also the Lisp compiler works with addresses and has them readily available. > Nicolas> and choosing Matlisp data for the blocks would be a possibility. But the > Nicolas> blocks can be small, therefore I cannot make compromises when operating on > Nicolas> those blocks. > > I assume you've profiled it so that the small blocks really are the > bottleneck? I'm still more or less in the design phase. I have now a compact row-ordered scheme (which is as fast as the C version) and want to make it more general without destroying too much performance. It is a very safe bet that I cannot bear too much of overhead here. Could be that I will have to handle the very small blocks (1x1--3x3) even without any function call. Nicolas. |
|
From: Raymond T. <to...@rt...> - 2003-11-24 19:02:54
|
>>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:
Nicolas> OK. But if GC is done by the same thread, my simple mind
Nicolas> would think that switching it off means setting one
Nicolas> global variable to NIL.
Yes, I think that's true. I don't use a multithreaded system, though,
so I don't know.
>> A factor of 2 will be very difficult to achieve, since a Lisp function
>> call basically loads up a bunch of pointers and calls the function. We
>> need to compute addresses, do the without-gc/unwind-protect stuff, load
>> up the registers for a foreign call and then call it.
Nicolas> Yes. Here I assume (in the direction to what Duane posted) that also the
Nicolas> Lisp compiler works with addresses and has them readily available.
Yes, we have addresses, but need to figure out from the lisp object
address where the actual data is. I would think in a threaded system,
locking out GC is even more important since other threads can start GC
even if the current thread wouldn't.
But I'll look to see what we can do.
Nicolas> bet that I cannot bear too much of overhead here. Could be that I will
Nicolas> have to handle the very small blocks (1x1--3x3) even without any function
Nicolas> call.
I think even normal BLAS overhead would hurt quite a bit if your
blocks are this small. Putting a 5 args, say, onto the call stack
probably costs as much as the computations in such a small block.
Ray
|
|
From: Nicolas N. <Nic...@iw...> - 2003-11-25 09:12:46
|
Raymond Toy <to...@rt...> writes: > >>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes: > > Nicolas> OK. But if GC is done by the same thread, my simple mind > Nicolas> would think that switching it off means setting one > Nicolas> global variable to NIL. > > Yes, I think that's true. I don't use a multithreaded system, though, > so I don't know. Even switching-off GC should probably not be necessary if everything is working fine. I guess that GC is triggered when objects want to get heap-allocated. But for these low-level calls no consing should appear. (Admitted, this will probably make foreign-function interfaces of CL implementations tricky. But it would give us a seamless cooperation with the Fortran and C world.) > Nicolas> bet that I cannot bear too much of overhead here. Could be that I will > Nicolas> have to handle the very small blocks (1x1--3x3) even without any function > Nicolas> call. > > I think even normal BLAS overhead would hurt quite a bit if your > blocks are this small. Putting a 5 args, say, onto the call stack > probably costs as much as the computations in such a small block. Yes, you are right here. I don't yet have a perfect solution. But the problem is not that much different for C/C++ and so on, and with the power of Lisp I hope to do at least as well as those. Up to now I have accepted a lot of performance degradation at several places. But I want to announce Femlisp to the scientific computing community next year and therefore cannot do this any longer. Nicolas. |