[Linuxsampler-devel] resampling benchmarks, sampler module, effect sends

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,
during the last couple of days I performed benchmarks in order to
analyze the speed of resampling/mixing routines which will make up the
core of the RAM sampler module.
Since we will probably go all floating point (because high precision,
head room and flexibility over integer) you need to be careful to
optimize the code because as we all know x86 FPUs do suck a bit.
I performed benchmarks on a celeron,p4 and athlon and must admit that
the athlon will make up for a damn good sampler box since it seems to
have a speedy fpu. The difference is notable especially when using cubic
interpolaton: an athlon 1400 matches the performance of a 1.8Ghz P4.
Anyway if you want to play a bit with my benchmark (it's only a quick
hack to test a few routines) just download it from
http://www.linuxdj.com/benno/rspeed4.tgz  
Steve H: I have added stereo mixing with volume support to better
reflect the behaviour of a real sampler with pan support, fortunately
the performance drop from the mono version is minimal thanks to caching.
The strange thing is that on most modern x86 CPUs using doubles is as
fast/faster than floats. That's good :-)

Regarding the RAM sampler module I proposed earlier:
I studied some event based stuff David Olofson proposed long time ago
and since Steve H. said "we will probably need both event based stuff
and control values but the control value frequency does not need that
high", I made a few calculations and it seems to pay of to implement
the control values as fine grained events.
One might say this is a waste of CPU but as Steve wrote in an earlier
posting on this list, the rate of CV values is usually much lower (1/4 -
1/16) than the samplerate. This means that even if the event stream is
very dense the added overhead is minimal.
I think the best way to find a good comprimise between flexibility
 and speed is to try out several methods and pick those with the best
price/performance ratio.

I have an important question regarding the effect sends: (since I am not
an expert here)
Are FXes in soft samplers/synths usually stereo or mono ?
Since we are using recompilation this can be made flexible but I have
noticed that FX send channels can chew up quite some CPU.
see this:

data of my celeron 366: cubic interpolation with looping, mono voices
but output is stereo (with pan)

no fx sends:
samples/sec = 4879341.532237  mono voices at 44.1kHz = 110.642665
efficency: 74.957245 CPU cycles/sample

one FX stereo send:
samples/sec = 4104676.981704  mono voices at 44.1kHz = 93.076576
efficency: 89.103723 CPU cycles/sample

two FX stereo sends:
samples/sec = 3508911.444682  mono voices at 44.1kHz = 79.567153
efficency: 104.232326 CPU cycles/sample

The CPU power for two mono sends is about the same for one single stereo
send so I was just wondering which way we should go initially. (mono I
guess ?).

The innermost mixing loop with 2 stereo FX sends looks like this:

             sample_val=CUBIC_INTERPOLATOR;
              output_sum_left[u] += volume_left * sample_val;
              output_sum_right[u] += volume_right * sample_val;
              effect_sum_left[u] += fx_volume_left * sample_val;
              effect_sum_right[u] += fx_volume_right * sample_val;
              effect2_sum_left[u] += fx2_volume_left * sample_val;
              effect2_sum_right[u] += fx2_volume_right * sample_val;

makes sense ?
(output_sum_left/right is the dry component , effect_sum and effect2_sum
the FX sends)

Some other numbers I got from P4 1.8Ghz vs Athlon 1400
cubic,looping and 2 stereo FX sends:

P4:
samples/sec = 12528321.035306  mono voices at 44.1kHz = 284.088912
efficency: 144.401951 CPU cycles/sample

Athlon:
samples/sec = 14626412.219113  mono voices at 44.1kHz = 331.664676
efficency: 95.721219 CPU cycles/sample

This with both gcc3.2 and 2.96. The P4 seem to suck quite.
Using the Intel C / gcc compilers with SSE optimizations did not
provide any speedup, in some cases the performance was even worse.

I heard the P4 heavily relies on optimal SSE2 optimizations in order to
deliver maximum performance and it seems that both gcc and icc do not
work optimally in this regard.
(if I get my hands on a Visual C++ compiler on a P4 box I will try to
run it on that box to see what the performance looks like).

Let me know your thoughts about all the issues I raised in this (boring)
mail :-)

cheers,
Benno

-- 
http://linuxsampler.sourceforge.net
Building a professional grade software sampler for Linux.
Please help us designing and developing it.