2009-05-05 10:18:15 UTC
Development has mostly ceased. There was a big change to add PPC, x86-64, and Itanium support by some individuals, but I'm afraid it left the code base a bit messy and incomplete. The code overlaying engine works, but there are some problems with it. But as far as efficiency goes, it is as fast as static linking, so that is kind of hard to beat (compare to D3DX using function pointers).
One major problem is there is no way to do sizeof(function) to get the size of a function's code in bytes. In assembly, it is trivial. That leaves me two options: copy hard-coded amounts of code and hope they are >= the compiler's generated code or rewrite it in assembly (which wastes time since I'd be rewriting non-SIMD C code in assembly.) In practice, #1 seems to work OK, though for x86-64 with RIP-relative addressing that you just can't seem to turn off with GCC, it makes code-copying impossible. So effectively, only 32-bit x86 will be supported for a while.
To get 0.5.0 running, the inline assembly needs to move out of line. MSVC can't understand GCC asm, and vice versa, so in order to target any environment, discrete asm files are used. While a lot of inline assembly has been moved out of line, there is still more yet left. There also quite a bit of assembly code that has not been written yet (i.e. some 3DNow! code not yet written). Since SSE seems to be the most common target, at the moment, I'd say that focusing on SSE would make the most sense.
A nice, but not strictly necessary thing would be to implement better testing. I've commited some code that tests the math functionality and gives a per-function pass/fail. Each module could really use a test like that -- it makes sure the code written actually performs the correct calculation. Always a winner when writing volumes of error-prone assembly code.
There are lots of x86 math libs that more or less have the same functions, though many use asm intrinsics. In theory (and probably practice too), these would be faster since a compiler can schedule them (statically), but so far I don't think anyone has taken it to code overlays, so either you reduce the client audience by supporting few processors (e.g. "only processors with SSE3 or better") or you use function pointers and incur overhead per call. I'd expect the former to be the more common one.
Patrick