[cgkit-user] cgkit matrix multiply performance - 25x slower than numpy?
Brought to you by:
mbaas
From: Bruce H. <ha...@io...> - 2008-08-31 23:17:22
|
I'm working on a simple BVH player that presently uses Tk for display and cgkit for the matrix math, but I found that the matrix routines are killing me on performance. My program presently takes over 6 minutes on a modern machine to do some moderately simple joint rotation precomputations on a 2700-frame BVH file, to preconvert all joint positions for each keyframe to worldspace. Wondering if I was doing something wrong, I installed numpy and wrote a few comparison programs. Numpy was a bit tricky because it supports both "array" and "matrix" types, and they each have their own way of invoking matrix multiplication. Here's a mini table of the results I got: --------------------------------------------------------- (Context: core 2 Quad 6600 2.4GHz/core, Fedora 8, Python 2.5.1, all scripts used only 1 core) 4x4 matrix creation, all zeroes: cgkit mat4(): 28000 creations/sec. numpy "array": 28000 creations/sec numpy "matrix": 8800 creations/sec 4x4 identity matrix creation: cgkit mat4(1): 19900 creations/sec numpy "array": 28000 creations/sec numpy "matrix": 8700 creations/sec 4x4 matrix multiplication: cgkit mat4(): 3100 matrix multiplies/sec (ouch) numpy "array": 79300 matrix multiplies/sec numpy "matrix": 10800 matrix multiplies/sec ------------------------------------------------------------ So at least for the test code snippets I used, it seems like numpy's "array" type does matrix multiplication about 25x faster than cgkit. I also reran the numpy test using some simple non-zero floating-point entries in the 4x4 arrays and results were the same. It seems like I ought to switch the matrix math to numpy -- I can't live with only 3100 matrix multiplies per second when I need to do precomputations for a few thousand keyframes on a 40-joint skeleton. Is this pretty much what I should expect? Anything horribly wrong in my test code? (Of course, when I installed numpy, its installer fired up gcc and built all kinds of crud, so it's probably going down to the bare metal with its matrix multiply optimizations.) A few code extracts and results are below so that people can reproduce on your own machine as desired. Bruce Hahne hahne at io dot com Disclaimer: programming is not my day job, so I don't know what I'm doing :-) ------------------------ SAMPLE 1: 100,000 4x4 matrix multiplies using cgkit #!/usr/bin/python import profile from cgkit.cgtypes import mat4 def profile_me(): mymat1 = mat4() mymat2 = mat4() multiply_me(mymat1, mymat2) def multiply_me(mat1, mat2): for x in range(100000): out = mat1 * mat2 profile.run('profile_me()') ------------------------------ RESULT OF SAMPLE 1: (100K multiplies in 32.5 seconds) 2700012 function calls in 32.504 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 300000 3.443 0.000 3.443 0.000 :0(isinstance) 500004 2.903 0.000 2.903 0.000 :0(len) 100000 7.734 0.000 14.905 0.000 :0(map) 1 0.003 0.003 0.003 0.003 :0(range) 1 0.000 0.000 0.000 0.000 :0(setprofile) 1 0.000 0.000 32.504 32.504 <string>:1(<module>) 100000 5.911 0.000 31.295 0.000 mat4.py:161(__mul__) 100002 4.134 0.000 21.942 0.000 mat4.py:60(__init__) 1600000 7.171 0.000 7.171 0.000 mat4.py:97(<lambda>) 1 1.205 1.205 32.503 32.503 profile3.py:12(multiply_me) 1 0.000 0.000 32.504 32.504 profile3.py:7(profile_me) 1 0.000 0.000 32.504 32.504 profile:0(profile_me()) 0 0.000 0.000 profile:0(profiler) ------------------------------------------------ SAMPLE 2: 100,000 4x4 multiplies using numpy "array" #!/usr/bin/python import profile from numpy import * def profile_me(): mymat1 = array([ [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0] ]) mymat2 = array([ [0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0] ]) multiply_me(mymat1,mymat2) def multiply_me(mat1, mat2): for x in range(100000): out = dot(mat1,mat2) profile.run('profile_me()') --------------------------------------- RESULT OF SAMPLE 2: (100K multiplies in 1.251 sec.) 100008 function calls in 1.261 CPU seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 2 0.000 0.000 0.000 0.000 :0(array) 100000 0.742 0.000 0.742 0.000 :0(dot) 1 0.003 0.003 0.003 0.003 :0(range) 1 0.000 0.000 0.000 0.000 :0(setprofile) 1 0.000 0.000 1.261 1.261 <string>:1(<module>) 1 0.516 0.516 1.261 1.261 profile6.py:12(multiply_me) 1 0.000 0.000 1.261 1.261 profile6.py:7(profile_me) 1 0.000 0.000 1.261 1.261 profile:0(profile_me()) 0 0.000 0.000 profile:0(profiler) ----------------------------------------------- |