I'm working on a simple BVH player that presently uses Tk for display
and cgkit for the matrix math, but I found that the matrix routines
are killing me on performance. My program presently takes over
6 minutes on a modern machine to do some moderately simple joint
rotation precomputations on a 2700frame BVH file, to preconvert
all joint positions for each keyframe to worldspace.
Wondering if I was doing something wrong, I installed numpy and wrote
a few comparison programs. Numpy was a bit tricky because it
supports both "array" and "matrix" types, and they each have their
own way of invoking matrix multiplication.
Here's a mini table of the results I got:

(Context: core 2 Quad 6600 2.4GHz/core, Fedora 8, Python 2.5.1,
all scripts used only 1 core)
4x4 matrix creation, all zeroes:
cgkit mat4(): 28000 creations/sec.
numpy "array": 28000 creations/sec
numpy "matrix": 8800 creations/sec
4x4 identity matrix creation:
cgkit mat4(1): 19900 creations/sec
numpy "array": 28000 creations/sec
numpy "matrix": 8700 creations/sec
4x4 matrix multiplication:
cgkit mat4(): 3100 matrix multiplies/sec (ouch)
numpy "array": 79300 matrix multiplies/sec
numpy "matrix": 10800 matrix multiplies/sec

So at least for the test code snippets I used, it seems like numpy's
"array" type does matrix multiplication about 25x faster than cgkit.
I also reran the numpy test using some simple nonzero floatingpoint
entries in the 4x4 arrays and results were the same.
It seems like I ought to switch the matrix math to numpy  I can't
live with only 3100 matrix multiplies per second when I need to
do precomputations for a few thousand keyframes on a 40joint skeleton.
Is this pretty much what I should expect? Anything horribly wrong
in my test code?
(Of course, when I installed numpy, its installer fired up gcc and
built all kinds of crud, so it's probably going down to the bare
metal with its matrix multiply optimizations.)
A few code extracts and results are below so that people can reproduce
on your own machine as desired.
Bruce Hahne
hahne at io dot com
Disclaimer: programming is not my day job, so I don't know what I'm
doing :)

SAMPLE 1: 100,000 4x4 matrix multiplies using cgkit
#!/usr/bin/python
import profile
from cgkit.cgtypes import mat4
def profile_me():
mymat1 = mat4()
mymat2 = mat4()
multiply_me(mymat1, mymat2)
def multiply_me(mat1, mat2):
for x in range(100000):
out = mat1 * mat2
profile.run('profile_me()')

RESULT OF SAMPLE 1: (100K multiplies in 32.5 seconds)
2700012 function calls in 32.504 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
300000 3.443 0.000 3.443 0.000 :0(isinstance)
500004 2.903 0.000 2.903 0.000 :0(len)
100000 7.734 0.000 14.905 0.000 :0(map)
1 0.003 0.003 0.003 0.003 :0(range)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 32.504 32.504 <string>:1(<module>)
100000 5.911 0.000 31.295 0.000 mat4.py:161(__mul__)
100002 4.134 0.000 21.942 0.000 mat4.py:60(__init__)
1600000 7.171 0.000 7.171 0.000 mat4.py:97(<lambda>)
1 1.205 1.205 32.503 32.503 profile3.py:12(multiply_me)
1 0.000 0.000 32.504 32.504 profile3.py:7(profile_me)
1 0.000 0.000 32.504 32.504 profile:0(profile_me())
0 0.000 0.000 profile:0(profiler)

SAMPLE 2: 100,000 4x4 multiplies using numpy "array"
#!/usr/bin/python
import profile
from numpy import *
def profile_me():
mymat1 = array([ [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0] ])
mymat2 = array([ [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0] ])
multiply_me(mymat1,mymat2)
def multiply_me(mat1, mat2):
for x in range(100000):
out = dot(mat1,mat2)
profile.run('profile_me()')

RESULT OF SAMPLE 2: (100K multiplies in 1.251 sec.)
100008 function calls in 1.261 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 0.000 0.000 :0(array)
100000 0.742 0.000 0.742 0.000 :0(dot)
1 0.003 0.003 0.003 0.003 :0(range)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 1.261 1.261 <string>:1(<module>)
1 0.516 0.516 1.261 1.261 profile6.py:12(multiply_me)
1 0.000 0.000 1.261 1.261 profile6.py:7(profile_me)
1 0.000 0.000 1.261 1.261 profile:0(profile_me())
0 0.000 0.000 profile:0(profiler)

