[cgkit-user] cgkit matrix multiply performance - 25x slower than numpy?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I'm working on a simple BVH player that presently uses Tk for display
and cgkit for the matrix math, but I found that the matrix routines
are killing me on performance.  My program presently takes over
6 minutes on a modern machine to do some moderately simple joint
rotation precomputations on a 2700-frame BVH file, to preconvert
all joint positions for each keyframe to worldspace.

Wondering if I was doing something wrong, I installed numpy and wrote
a few comparison programs.  Numpy was a bit tricky because it
supports both "array" and "matrix" types, and they each have their
own way of invoking matrix multiplication.

Here's a mini table of the results I got:
---------------------------------------------------------
(Context: core 2 Quad 6600 2.4GHz/core, Fedora 8, Python 2.5.1,
all scripts used only 1 core)

4x4 matrix creation, all zeroes:
   cgkit mat4(): 28000 creations/sec.
   numpy "array": 28000 creations/sec
   numpy "matrix": 8800 creations/sec

4x4 identity matrix creation:
   cgkit mat4(1): 19900 creations/sec
   numpy "array": 28000 creations/sec
   numpy "matrix": 8700 creations/sec

4x4 matrix multiplication:
   cgkit mat4(): 3100 matrix multiplies/sec     (ouch)
   numpy "array": 79300 matrix multiplies/sec
   numpy "matrix": 10800 matrix multiplies/sec
------------------------------------------------------------

So at least for the test code snippets I used, it seems like numpy's
"array" type does matrix multiplication about 25x faster than cgkit.
I also reran the numpy test using some simple non-zero floating-point
entries in the 4x4 arrays and results were the same.

It seems like I ought to switch the matrix math to numpy -- I can't
live with only 3100 matrix multiplies per second when I need to
do precomputations for a few thousand keyframes on a 40-joint skeleton.

Is this pretty much what I should expect?  Anything horribly wrong
in my test code?

(Of course, when I installed numpy, its installer fired up gcc and
built all kinds of crud, so it's probably going down to the bare
metal with its matrix multiply optimizations.)

A few code extracts and results are below so that people can reproduce
on your own machine as desired.

Bruce Hahne
hahne at io dot com
Disclaimer: programming is not my day job, so I don't know what I'm 
doing :-)

------------------------
SAMPLE 1: 100,000 4x4 matrix multiplies using cgkit

#!/usr/bin/python
import profile
from cgkit.cgtypes import mat4

def profile_me():
       mymat1 = mat4()
       mymat2 = mat4()
       multiply_me(mymat1, mymat2)

def multiply_me(mat1, mat2):
   for x in range(100000):
     out = mat1 * mat2

profile.run('profile_me()')
------------------------------

RESULT OF SAMPLE 1:  (100K multiplies in 32.5 seconds)

          2700012 function calls in 32.504 CPU seconds
    Ordered by: standard name

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    300000    3.443    0.000    3.443    0.000 :0(isinstance)
    500004    2.903    0.000    2.903    0.000 :0(len)
    100000    7.734    0.000   14.905    0.000 :0(map)
         1    0.003    0.003    0.003    0.003 :0(range)
         1    0.000    0.000    0.000    0.000 :0(setprofile)
         1    0.000    0.000   32.504   32.504 <string>:1(<module>)
    100000    5.911    0.000   31.295    0.000 mat4.py:161(__mul__)
    100002    4.134    0.000   21.942    0.000 mat4.py:60(__init__)
   1600000    7.171    0.000    7.171    0.000 mat4.py:97(<lambda>)
         1    1.205    1.205   32.503   32.503 profile3.py:12(multiply_me)
         1    0.000    0.000   32.504   32.504 profile3.py:7(profile_me)
         1    0.000    0.000   32.504   32.504 profile:0(profile_me())
         0    0.000             0.000          profile:0(profiler)

------------------------------------------------

SAMPLE 2: 100,000 4x4 multiplies using numpy "array"

#!/usr/bin/python
import profile
from numpy import *

def profile_me():
     mymat1 = array([ [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0] ])
     mymat2 = array([ [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0] ])
     multiply_me(mymat1,mymat2)

def multiply_me(mat1, mat2):
   for x in range(100000):
     out = dot(mat1,mat2)

profile.run('profile_me()')

---------------------------------------

RESULT OF SAMPLE 2:  (100K multiplies in 1.251 sec.)

          100008 function calls in 1.261 CPU seconds
    Ordered by: standard name

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
         2    0.000    0.000    0.000    0.000 :0(array)
    100000    0.742    0.000    0.742    0.000 :0(dot)
         1    0.003    0.003    0.003    0.003 :0(range)
         1    0.000    0.000    0.000    0.000 :0(setprofile)
         1    0.000    0.000    1.261    1.261 <string>:1(<module>)
         1    0.516    0.516    1.261    1.261 profile6.py:12(multiply_me)
         1    0.000    0.000    1.261    1.261 profile6.py:7(profile_me)
         1    0.000    0.000    1.261    1.261 profile:0(profile_me())
         0    0.000             0.000          profile:0(profiler)

-----------------------------------------------