Thread: [Matlisp-users] Calling Fortran routines on short arrays

Status: Beta

Brought to you by: rtoy, simsek

matlisp-users

[Matlisp-users] Calling Fortran routines on short arrays

From: Nicolas N. <Nic...@iw...> - 2003-11-22 15:08:02

Hello,

I am trying to find out if it is possible to call Fortran BLAS routines
also on short vectors.  I am running in the following problem:

I have put a test program on

http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop-new.lisp

When I test the Lisp ddot/daxpy code I get:

DDOT-long: 271.15 MFLOPS
DDOT-short: 679.58 MFLOPS
DAXPY-long: 143.55 MFLOPS
DAXPY-short: 488.06 MFLOPS

But when I call the Matlisp routines (not via CLOS!), I get

BLAS-DDOT-long: 267.10 MFLOPS
BLAS-DDOT-short: 63.31 MFLOPS
BLAS-DAXPY-long: 149.13 MFLOPS
BLAS-DAXPY-short: 61.01 MFLOPS

The reason is probably that the external function call is almost as costly
as the daxpy for the case +N-short+=256, while calling Lisp functions is
much faster.  Is it possible to cut down these costs?

Thanks, Nicolas.

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Nicolas N. <Nic...@iw...> - 2003-11-22 16:58:18

Hello,

Rereading my mail I see that I expressed myself badly again.  Corrections:

> Hello,
> 
> I am trying to find out if it is possible to call Fortran BLAS routines

Of course, it is possible.  But is it possible without such a tremendeous
performance loss?

> also on short vectors.  I am running in the following problem:
> 
> I have put a test program on
> 
> http://cox.iwr.uni-heidelberg.de/~neuss/misc/mflop-new.lisp
> 
> When I test the Lisp ddot/daxpy code I get:
> 
> DDOT-long: 271.15 MFLOPS
> DDOT-short: 679.58 MFLOPS
> DAXPY-long: 143.55 MFLOPS
> DAXPY-short: 488.06 MFLOPS
> 
> But when I call the Matlisp routines (not via CLOS!), I get
> 
> BLAS-DDOT-long: 267.10 MFLOPS
> BLAS-DDOT-short: 63.31 MFLOPS
> BLAS-DAXPY-long: 149.13 MFLOPS
> BLAS-DAXPY-short: 61.01 MFLOPS
> 
> The reason is probably that the external function call is almost as costly

From the numbers it is obvious that the call is even much more expensive
than a daxpy for 256 double-floats.  How comes?

> as the daxpy for the case +N-short+=256, while calling Lisp functions is
> much faster.  Is it possible to cut down these costs?
> 
> Thanks, Nicolas.
>

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Raymond T. <to...@rt...> - 2003-11-24 15:43:19

>>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:

    Nicolas> From the numbers it is obvious that the call is even much more expensive
    Nicolas> than a daxpy for 256 double-floats.  How comes?

    >> as the daxpy for the case +N-short+=256, while calling Lisp functions is
    >> much faster.  Is it possible to cut down these costs?
    >> 
    >> Thanks, Nicolas.
    >> 

I'll try to look into this.  There's probably some improvement to be
had, but I doubt we can improve it enough for you.  I think the
overhead comes from computing the necessary addresses, and also having
to turn off GC during the computation.  IIRC, this involves an
unwind-protect which does add quite a bit of code.

Note that I also noticed long ago that a simple vector add in Lisp was
at least as fast as calling BLAS.  However, having everything go
through FFI to BLAS at least allows us to take advantage of any
special libraries that might be available.

I, however, am not opposed to implementing the BLAS in Lisp.  Other
LAPACK routines will still use the original BLAS, and Lisp code can
get the faster versions.  Will need thinking, design, and
experimentation.

Ray

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Nicolas N. <Nic...@iw...> - 2003-11-24 16:11:49

-- 
Dr. Nicolas Neuss     IWR, INF 368, D-69120 Heidelberg
Email: Nic...@IW...
WWW: <http://www.iwr.uni-heidelberg.de/~Nicolas.Neuss>

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Nicolas N. <Nic...@iw...> - 2003-11-24 16:13:03

Raymond Toy <to...@rt...> writes:

> >>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:
> 
> 
>     Nicolas> From the numbers it is obvious that the call is even much more expensive
>     Nicolas> than a daxpy for 256 double-floats.  How comes?
> 
>     >> as the daxpy for the case +N-short+=256, while calling Lisp functions is
>     >> much faster.  Is it possible to cut down these costs?
>     >> 
>     >> Thanks, Nicolas.
>     >> 
> 
> I'll try to look into this.  There's probably some improvement to be
> had, but I doubt we can improve it enough for you.  I think the
> overhead comes from computing the necessary addresses, and also having
> to turn off GC during the computation.  IIRC, this involves an
> unwind-protect which does add quite a bit of code.

Yes, you are right.  I see this now.  If switching off multithreading is
expensive, there is a problem here.  I don't know enough of these things to
help you here.

> Note that I also noticed long ago that a simple vector add in Lisp was
> at least as fast as calling BLAS.

Probably this was before I started using Matlisp.

> However, having everything go through FFI to BLAS at least allows us to
> take advantage of any special libraries that might be available.
>
> I, however, am not opposed to implementing the BLAS in Lisp.  Other
> LAPACK routines will still use the original BLAS, and Lisp code can
> get the faster versions.  Will need thinking, design, and
> experimentation.

I will have to do this at least for a small part of the routines, if the
foreign call cannot be achieved with really little overhead (say two times
a Lisp function call).  I want to implement flexible sparse block matrices,
and choosing Matlisp data for the blocks would be a possibility.  But the
blocks can be small, therefore I cannot make compromises when operating on
those blocks.

Thanks, Nicolas.

P.S.: BTW, how does ACL perform in this respect?  Just today I read Duane
writing about interoperability of ACL with C and C++.  If the overhead we
are suffering from is necessary in general, this might be quite a problem
for some applications.

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Raymond T. <to...@rt...> - 2003-11-24 16:35:49

>>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:

    Nicolas> Raymond Toy <to...@rt...> writes:
    >> I'll try to look into this.  There's probably some improvement to be
    >> had, but I doubt we can improve it enough for you.  I think the
    >> overhead comes from computing the necessary addresses, and also having
    >> to turn off GC during the computation.  IIRC, this involves an
    >> unwind-protect which does add quite a bit of code.

    Nicolas> Yes, you are right.  I see this now.  If switching off multithreading is
    Nicolas> expensive, there is a problem here.  I don't know enough of these things to
    Nicolas> help you here.

It's not multithreading, per se.  It's because we can't have GC
suddenly move the vectors before doing the foreign call, otherwise the
foreign function will be reading and writing to some random place in
memory.

    >> Note that I also noticed long ago that a simple vector add in Lisp was
    >> at least as fast as calling BLAS.

    Nicolas> Probably this was before I started using Matlisp.

Yeah, probably before matlisp became matlisp.

    Nicolas> I will have to do this at least for a small part of the routines, if the
    Nicolas> foreign call cannot be achieved with really little overhead (say two times
    Nicolas> a Lisp function call).  I want to implement flexible sparse block matrices,

A factor of 2 will be very difficult to achieve, since a Lisp function
call basically loads up a bunch of pointers and calls the function.
We need to compute addresses, do the without-gc/unwind-protect stuff,
load up the registers for a foreign call and then call it.

    Nicolas> and choosing Matlisp data for the blocks would be a possibility.  But the
    Nicolas> blocks can be small, therefore I cannot make compromises when operating on
    Nicolas> those blocks.

I assume you've profiled it so that the small blocks really are the
bottleneck?

    Nicolas> P.S.: BTW, how does ACL perform in this respect?  Just today I read Duane

Don't know since I don't have a version of ACL that can run matlisp.

Ray

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Nicolas N. <Nic...@iw...> - 2003-11-24 18:12:05

Raymond Toy <to...@rt...> writes:

> It's not multithreading, per se.  It's because we can't have GC
> suddenly move the vectors before doing the foreign call, otherwise the
> foreign function will be reading and writing to some random place in
> memory.

OK.  But if GC is done by the same thread, my simple mind would think that
switching it off means setting one global variable to NIL.

> A factor of 2 will be very difficult to achieve, since a Lisp function
> call basically loads up a bunch of pointers and calls the function.  We
> need to compute addresses, do the without-gc/unwind-protect stuff, load
> up the registers for a foreign call and then call it.

Yes.  Here I assume (in the direction to what Duane posted) that also the
Lisp compiler works with addresses and has them readily available.

>     Nicolas> and choosing Matlisp data for the blocks would be a possibility.  But the
>     Nicolas> blocks can be small, therefore I cannot make compromises when operating on
>     Nicolas> those blocks.
>
> I assume you've profiled it so that the small blocks really are the
> bottleneck?

I'm still more or less in the design phase.  I have now a compact
row-ordered scheme (which is as fast as the C version) and want to make it
more general without destroying too much performance.  It is a very safe
bet that I cannot bear too much of overhead here.  Could be that I will
have to handle the very small blocks (1x1--3x3) even without any function
call.

Nicolas.

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Raymond T. <to...@rt...> - 2003-11-24 19:02:54

>>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:

    Nicolas> OK.  But if GC is done by the same thread, my simple mind
    Nicolas> would think that switching it off means setting one
    Nicolas> global variable to NIL.

Yes, I think that's true.  I don't use a multithreaded system, though,
so I don't know.

    >> A factor of 2 will be very difficult to achieve, since a Lisp function
    >> call basically loads up a bunch of pointers and calls the function.  We
    >> need to compute addresses, do the without-gc/unwind-protect stuff, load
    >> up the registers for a foreign call and then call it.

    Nicolas> Yes.  Here I assume (in the direction to what Duane posted) that also the
    Nicolas> Lisp compiler works with addresses and has them readily available.

Yes, we have addresses, but need to figure out from the lisp object
address where the actual data is.  I would think in a threaded system,
locking out GC is even more important since other threads can start GC
even if the current thread wouldn't.  

But I'll look to see what we can do.

    Nicolas> bet that I cannot bear too much of overhead here.  Could be that I will
    Nicolas> have to handle the very small blocks (1x1--3x3) even without any function
    Nicolas> call.

I think even normal BLAS overhead would hurt quite a bit if your
blocks are this small.  Putting a 5 args, say, onto the call stack
probably costs as much as the computations in such a small block.

Ray

Re: [Matlisp-users] Calling Fortran routines on short arrays

From: Nicolas N. <Nic...@iw...> - 2003-11-25 09:12:46

Raymond Toy <to...@rt...> writes:

> >>>>> "Nicolas" == Nicolas Neuss <Nic...@iw...> writes:
> 
>     Nicolas> OK.  But if GC is done by the same thread, my simple mind
>     Nicolas> would think that switching it off means setting one
>     Nicolas> global variable to NIL.
> 
> Yes, I think that's true.  I don't use a multithreaded system, though,
> so I don't know.

Even switching-off GC should probably not be necessary if everything is
working fine.  I guess that GC is triggered when objects want to get
heap-allocated.  But for these low-level calls no consing should
appear. (Admitted, this will probably make foreign-function interfaces of
CL implementations tricky.  But it would give us a seamless cooperation
with the Fortran and C world.)

>     Nicolas> bet that I cannot bear too much of overhead here.  Could be that I will
>     Nicolas> have to handle the very small blocks (1x1--3x3) even without any function
>     Nicolas> call.
> 
> I think even normal BLAS overhead would hurt quite a bit if your
> blocks are this small.  Putting a 5 args, say, onto the call stack
> probably costs as much as the computations in such a small block.

Yes, you are right here.  I don't yet have a perfect solution.  But the
problem is not that much different for C/C++ and so on, and with the power
of Lisp I hope to do at least as well as those.  Up to now I have accepted
a lot of performance degradation at several places.  But I want to announce
Femlisp to the scientific computing community next year and therefore
cannot do this any longer.

Nicolas.