|
From: Benno S. <be...@ga...> - 2002-11-13 16:35:37
|
Hi, during the last couple of days I performed benchmarks in order to analyze the speed of resampling/mixing routines which will make up the core of the RAM sampler module. Since we will probably go all floating point (because high precision, head room and flexibility over integer) you need to be careful to optimize the code because as we all know x86 FPUs do suck a bit. I performed benchmarks on a celeron,p4 and athlon and must admit that the athlon will make up for a damn good sampler box since it seems to have a speedy fpu. The difference is notable especially when using cubic interpolaton: an athlon 1400 matches the performance of a 1.8Ghz P4. Anyway if you want to play a bit with my benchmark (it's only a quick hack to test a few routines) just download it from http://www.linuxdj.com/benno/rspeed4.tgz Steve H: I have added stereo mixing with volume support to better reflect the behaviour of a real sampler with pan support, fortunately the performance drop from the mono version is minimal thanks to caching. The strange thing is that on most modern x86 CPUs using doubles is as fast/faster than floats. That's good :-) Regarding the RAM sampler module I proposed earlier: I studied some event based stuff David Olofson proposed long time ago and since Steve H. said "we will probably need both event based stuff and control values but the control value frequency does not need that high", I made a few calculations and it seems to pay of to implement the control values as fine grained events. One might say this is a waste of CPU but as Steve wrote in an earlier posting on this list, the rate of CV values is usually much lower (1/4 - 1/16) than the samplerate. This means that even if the event stream is very dense the added overhead is minimal. I think the best way to find a good comprimise between flexibility and speed is to try out several methods and pick those with the best price/performance ratio. I have an important question regarding the effect sends: (since I am not an expert here) Are FXes in soft samplers/synths usually stereo or mono ? Since we are using recompilation this can be made flexible but I have noticed that FX send channels can chew up quite some CPU. see this: data of my celeron 366: cubic interpolation with looping, mono voices but output is stereo (with pan) no fx sends: samples/sec = 4879341.532237 mono voices at 44.1kHz = 110.642665 efficency: 74.957245 CPU cycles/sample one FX stereo send: samples/sec = 4104676.981704 mono voices at 44.1kHz = 93.076576 efficency: 89.103723 CPU cycles/sample two FX stereo sends: samples/sec = 3508911.444682 mono voices at 44.1kHz = 79.567153 efficency: 104.232326 CPU cycles/sample The CPU power for two mono sends is about the same for one single stereo send so I was just wondering which way we should go initially. (mono I guess ?). The innermost mixing loop with 2 stereo FX sends looks like this: sample_val=CUBIC_INTERPOLATOR; output_sum_left[u] += volume_left * sample_val; output_sum_right[u] += volume_right * sample_val; effect_sum_left[u] += fx_volume_left * sample_val; effect_sum_right[u] += fx_volume_right * sample_val; effect2_sum_left[u] += fx2_volume_left * sample_val; effect2_sum_right[u] += fx2_volume_right * sample_val; makes sense ? (output_sum_left/right is the dry component , effect_sum and effect2_sum the FX sends) Some other numbers I got from P4 1.8Ghz vs Athlon 1400 cubic,looping and 2 stereo FX sends: P4: samples/sec = 12528321.035306 mono voices at 44.1kHz = 284.088912 efficency: 144.401951 CPU cycles/sample Athlon: samples/sec = 14626412.219113 mono voices at 44.1kHz = 331.664676 efficency: 95.721219 CPU cycles/sample This with both gcc3.2 and 2.96. The P4 seem to suck quite. Using the Intel C / gcc compilers with SSE optimizations did not provide any speedup, in some cases the performance was even worse. I heard the P4 heavily relies on optimal SSE2 optimizations in order to deliver maximum performance and it seems that both gcc and icc do not work optimally in this regard. (if I get my hands on a Visual C++ compiler on a P4 box I will try to run it on that box to see what the performance looks like). Let me know your thoughts about all the issues I raised in this (boring) mail :-) cheers, Benno -- http://linuxsampler.sourceforge.net Building a professional grade software sampler for Linux. Please help us designing and developing it. |