|
From: Benno S. <be...@ga...> - 2002-11-13 16:35:37
|
Hi, during the last couple of days I performed benchmarks in order to analyze the speed of resampling/mixing routines which will make up the core of the RAM sampler module. Since we will probably go all floating point (because high precision, head room and flexibility over integer) you need to be careful to optimize the code because as we all know x86 FPUs do suck a bit. I performed benchmarks on a celeron,p4 and athlon and must admit that the athlon will make up for a damn good sampler box since it seems to have a speedy fpu. The difference is notable especially when using cubic interpolaton: an athlon 1400 matches the performance of a 1.8Ghz P4. Anyway if you want to play a bit with my benchmark (it's only a quick hack to test a few routines) just download it from http://www.linuxdj.com/benno/rspeed4.tgz Steve H: I have added stereo mixing with volume support to better reflect the behaviour of a real sampler with pan support, fortunately the performance drop from the mono version is minimal thanks to caching. The strange thing is that on most modern x86 CPUs using doubles is as fast/faster than floats. That's good :-) Regarding the RAM sampler module I proposed earlier: I studied some event based stuff David Olofson proposed long time ago and since Steve H. said "we will probably need both event based stuff and control values but the control value frequency does not need that high", I made a few calculations and it seems to pay of to implement the control values as fine grained events. One might say this is a waste of CPU but as Steve wrote in an earlier posting on this list, the rate of CV values is usually much lower (1/4 - 1/16) than the samplerate. This means that even if the event stream is very dense the added overhead is minimal. I think the best way to find a good comprimise between flexibility and speed is to try out several methods and pick those with the best price/performance ratio. I have an important question regarding the effect sends: (since I am not an expert here) Are FXes in soft samplers/synths usually stereo or mono ? Since we are using recompilation this can be made flexible but I have noticed that FX send channels can chew up quite some CPU. see this: data of my celeron 366: cubic interpolation with looping, mono voices but output is stereo (with pan) no fx sends: samples/sec = 4879341.532237 mono voices at 44.1kHz = 110.642665 efficency: 74.957245 CPU cycles/sample one FX stereo send: samples/sec = 4104676.981704 mono voices at 44.1kHz = 93.076576 efficency: 89.103723 CPU cycles/sample two FX stereo sends: samples/sec = 3508911.444682 mono voices at 44.1kHz = 79.567153 efficency: 104.232326 CPU cycles/sample The CPU power for two mono sends is about the same for one single stereo send so I was just wondering which way we should go initially. (mono I guess ?). The innermost mixing loop with 2 stereo FX sends looks like this: sample_val=CUBIC_INTERPOLATOR; output_sum_left[u] += volume_left * sample_val; output_sum_right[u] += volume_right * sample_val; effect_sum_left[u] += fx_volume_left * sample_val; effect_sum_right[u] += fx_volume_right * sample_val; effect2_sum_left[u] += fx2_volume_left * sample_val; effect2_sum_right[u] += fx2_volume_right * sample_val; makes sense ? (output_sum_left/right is the dry component , effect_sum and effect2_sum the FX sends) Some other numbers I got from P4 1.8Ghz vs Athlon 1400 cubic,looping and 2 stereo FX sends: P4: samples/sec = 12528321.035306 mono voices at 44.1kHz = 284.088912 efficency: 144.401951 CPU cycles/sample Athlon: samples/sec = 14626412.219113 mono voices at 44.1kHz = 331.664676 efficency: 95.721219 CPU cycles/sample This with both gcc3.2 and 2.96. The P4 seem to suck quite. Using the Intel C / gcc compilers with SSE optimizations did not provide any speedup, in some cases the performance was even worse. I heard the P4 heavily relies on optimal SSE2 optimizations in order to deliver maximum performance and it seems that both gcc and icc do not work optimally in this regard. (if I get my hands on a Visual C++ compiler on a P4 box I will try to run it on that box to see what the performance looks like). Let me know your thoughts about all the issues I raised in this (boring) mail :-) cheers, Benno -- http://linuxsampler.sourceforge.net Building a professional grade software sampler for Linux. Please help us designing and developing it. |
|
From: Richard A. S. <rs...@bi...> - 2002-11-13 17:36:11
|
On 13 Nov 2002 17:47:14 +0100, Benno Senoner wrote: > The strange thing is that on most modern x86 CPUs using doubles is as > fast/faster than floats. That's good :-) > Perhaps thats due to data bus size and the FPU size. Modern x86s FPUs are 80-bit IIRC. The data bus is 64 bits wide so fetching a double or float from memory take the same ammount of cycles. Perhaps going from a 32-bit float to the 80 bit FPU format involves a cast that uses more cycles than a 64 bit double to 80-bit. -- Richard A. Smith Bitworks, Inc. rs...@bi... 479.846.5777 x104 Sr. Design Engineer http://www.bitworks.com |
|
From: Steve H. <S.W...@ec...> - 2002-11-13 17:44:15
|
On Wed, Nov 13, 2002 at 11:35:58 -0600, Richard A. Smith wrote: > On 13 Nov 2002 17:47:14 +0100, Benno Senoner wrote: > > > The strange thing is that on most modern x86 CPUs using doubles is as > > fast/faster than floats. That's good :-) > > > > Perhaps thats due to data bus size and the FPU size. Modern x86s > FPUs are 80-bit IIRC. The data bus is 64 bits wide so fetching a > double or float from memory take the same ammount of cycles. The 80bit format is mainly internal, its used to maintain IEEE compatibility in the 387 as I understand it. SSE does not use it. The problem with using doubles (64bit) or long doubles (80bit) in your code is the cache effects. You still have the same number of fp stack registers though. See Tim G.'s early attemps with ladspa filters, it makes no difference when thats the only thing running, but when you add more processes it becomes much slower. If using doubles was actually faster you may have missed the trailing f off a constant or used, eg. sin() instead of sinf(). - Steve |
|
From: Steve H. <S.W...@ec...> - 2002-11-13 17:58:47
|
On Wed, Nov 13, 2002 at 05:47:14 +0100, Benno Senoner wrote: > Since we will probably go all floating point (because high precision, > head room and flexibility over integer) you need to be careful to > optimize the code because as we all know x86 FPUs do suck a bit. Right, but we can use SSE in P4's (and maybe P3's if its faster) with gcc3. This just needs the flags I posted to l-a-d, no code changes. > Steve H: I have added stereo mixing with volume support to better > reflect the behaviour of a real sampler with pan support, fortunately > the performance drop from the mono version is minimal thanks to caching. Excellent. I though we were wasting a lot of cycles waiting for the RAM in the mono case. [events and CV] > One might say this is a waste of CPU but as Steve wrote in an earlier > posting on this list, the rate of CV values is usually much lower (1/4 - > 1/16) than the samplerate. This means that even if the event stream is > very dense the added overhead is minimal. > I think the best way to find a good comprimise between flexibility > and speed is to try out several methods and pick those with the best > price/performance ratio. OK, well events are more LADSPA like, which is convienient I suppose, this is really an internal enging thoing though, so we dont have to decide upfront. > Are FXes in soft samplers/synths usually stereo or mono ? > Since we are using recompilation this can be made flexible but I have > noticed that FX send channels can chew up quite some CPU. > see this: I think on older samplers they are stereo return (to the main mix outs), newer samplers have many more outputs, so I dont know how they handle it. The number of send channels is equal to the number of channels in the sample. > P4: > samples/sec = 12528321.035306 mono voices at 44.1kHz = 284.088912 > efficency: 144.401951 CPU cycles/sample > > Athlon: > samples/sec = 14626412.219113 mono voices at 44.1kHz = 331.664676 > efficency: 95.721219 CPU cycles/sample > > > This with both gcc3.2 and 2.96. The P4 seem to suck quite. P4's really dont like branches from what I have heard (very long pipelines). The Athlon is much shallower. What RAM systems did the two machines have? > Using the Intel C / gcc compilers with SSE optimizations did not > provide any speedup, in some cases the performance was even worse. Even on P4? > I heard the P4 heavily relies on optimal SSE2 optimizations in order to > deliver maximum performance and it seems that both gcc and icc do not > work optimally in this regard. SSE, not SSE2 IIRC. SSE2 is still only 128bits wide, and uses 64bit floats so it can only go two-way. - Steve |
|
From: Nicolas J. <nic...@fr...> - 2002-11-13 20:26:25
|
On Wednesday 13 November 2002 18:58, Steve Harris wrote: > > I heard the P4 heavily relies on optimal SSE2 optimizations in order to > > deliver maximum performance and it seems that both gcc and icc do not > > work optimally in this regard. > > SSE, not SSE2 IIRC. SSE2 is still only 128bits wide, and uses 64bit floats > so it can only go two-way. Gcc and even icc are not really good at code vectorisation. IMHA it is a better idea to parallel the code manually using the SSE instructions, you will get better performances. I can try to look at the code, and see if there is room for optimisations. But I'm very new to this project, and I think there is more experimented programmers than me on this list. -- Nicolas Justin - <nic...@fr...> |
|
From: Steve H. <S.W...@ec...> - 2002-11-13 20:34:57
|
On Wed, Nov 13, 2002 at 09:25:46 +0100, Nicolas Justin wrote: > I can try to look at the code, and see if there is room for optimisations. > But I'm very new to this project, and I think there is more experimented > programmers than me on this list. I would wait until we have finalised the inner loop, its likely to change a lot. - Steve |
|
From: Nicolas J. <nic...@fr...> - 2002-11-13 20:47:49
|
On Wednesday 13 November 2002 21:25, Nicolas Justin wrote: > I can try to look at the code, and see if there is room for optimisations. > But I'm very new to this project, and I think there is more experimented > programmers than me on this list. Maybe you can look at libSIMD (http://libsimd.sf.net), these is a library implementing simple maths functions with SIMD instructions. There is also a patch by Stéphane Marchesin implementing a MMX mixer and audio converter for SDL (http://www.libsdl.org), you can find it here: http://dea-dess-info.u-strasbg.fr/~marchesin/SDL_mmx.patch Just my 2 cents... -- Nicolas Justin - <nic...@fr...> |
|
From: Steve H. <S.W...@ec...> - 2002-11-13 21:06:32
|
On Wed, Nov 13, 2002 at 09:46:42 +0100, Nicolas Justin wrote: > On Wednesday 13 November 2002 21:25, Nicolas Justin wrote: > > I can try to look at the code, and see if there is room for optimisations. > > But I'm very new to this project, and I think there is more experimented > > programmers than me on this list. > > Maybe you can look at libSIMD (http://libsimd.sf.net), these is a library > implementing simple maths functions with SIMD instructions. Thats interesting, there only appears to be 3dnow accelerations at the moment, but it could be useful once they get SSE done. - Steve |