Min / Max / MinMax /... not Multi Threaded

2008-03-14
2013-04-23
  • Marthein Plat
    Marthein Plat
    2008-03-14

    I have translated the .h files for the 4 FrameWave DLL's to Pascal units for Delphi and have performed some tests with routines from fwSignal. The speed gain of some routines impressed me, but the Min, Max, MinMax, etc routines only use one thread.

    It should not be hard splitting up the array into multiple pieces and find the Minimum ( and/or Max ) in all of the pieces, put the found values in an array and run the routine on that once more.

    * Are there plans to make these routines Multi Threaded?

    * Futhermore I am missing some data types:
    - signed 8 bit int
    - (un)signed 64 bit int

    * I have taken a look at the source code for Max etc. The loop is unwound into a loop which performs 4 operations followed by code doing the rest. Wouldn't it be better to split it into 3 parts? The first part would process the values until pSrc is at an aligned memory location. This would make sure the first value in the middle loop is always properly aligned.

    If anyone is interested, I can upload the pascal units for using the FrameWave DLLs. In that case, please point me to the location on SourceForge where the files should be uploaded.

     
    • Usually if a routine is not multi-threaded, it's often because the developers have found that the single threaded version actually works faster.

      So the math needed to do the processing to align on a per vector basis can become complex enough to negate any advantage from aligning the data. Plus the complexity of course. The FE framework is complex enough already :)

      C++ Compilers * (templates + complexity) = not so good performance

       
      • Marthein Plat
        Marthein Plat
        2008-03-18

        I agree with you that keeping things as simple as possible is very important. The routines are already a bit 'over specified' in my opinion. When unrolling the loops by hand some assumptions are made about the required aligning by the used processor, which may not be optimal. What if a new processor requires 32 byte boundaries?

        I believe the unrolling should be left to the compiler (/pre-processor), which knows more about the processor. It should be able to split a for or while loop into 2 or 3 parts. The problem in this case is probably that the compilers are not smart enough yet to do this... ( or was this code already generated from one loop with a pre-processor? )

         
    • Ravindra Babu
      Ravindra Babu
      2008-03-18

      * Are there plans to make these routines Multi Threaded?
      >>We will be experimenting on this front and will consider adding Multi Threading in future releases if we get considerable performance gain. We might selectively multi thread only those routines which benefit from threading.

      * Futhermore I am missing some data types:
      - signed 8 bit int
      - (un)signed 64 bit int
      >>We will try to accommodate the additions in future releases

      * I have taken a look at the source code for Max etc. The loop is unwound into a loop which performs 4 operations followed by code doing the rest. Wouldn't it be better to split it into 3 parts? The first part would process the values until pSrc is at an aligned memory location. This would make sure the first value in the middle loop is always properly aligned.
      >>We guarantee best performance when the source buffer passed to the API is aligned. Considering the memory offset to reach the 16 byte aligned boundary could be anywhere between 1 to 15 bytes. There is no guarantee that the nearest aligned boundary starts with a new pixel/element in order to start processing. We try to accommodate most of the generic reusable features in FE framework for easy development and maintenance of the code but implementing this feature is a non trivial task using FE framework. Our current focus has been to get the best performance with aligned input and hence we recommend the users to align the data while using framewave.

       
      • Marthein Plat
        Marthein Plat
        2008-03-18

        In my application, I cannot be sure the data I am passing to the MinMax routine is aligned, because I need to know the Min and Max of parts of the array. The start of the array will be aligned, but if I need to know the Min and Max from index 621 to 1234567, the sub-array is not aligned.

        I have performed some more tests with Byte and Double arrays and the MinMax routine to find the limiting factors.

        To see the effect of caching, I have used 3 different array sizes. On my AMD64 3200 L1 cache size is 64K, L2 is 1024K. I have also measured the effect of misalignment by offsetting the array 1 or 8 bytes.

        Conclusions from the test results:

        - On this system, the limiting factor seems to be the memory bandwidth, because cacheable sizes are much faster.
        - The effect of misaligning by 1 double ( 8 bytes ) is small when values are read from memory and not too big when read from cache. ( Effect will be larger on systems with a wider memory data bus. )
        - The effect of misaligning by 1 byte is very big when the values are in cache.
        - Multi Threading is only useful when the memory bandwidth is very high compared to the processor speed. I do not have a new multi-core system with DDR3 at the moment to test if this is the case.

        Recommendation:
        - Split the loop into 3 parts, at least for variable types smaller than 8 bytes. This can lead to 30% speed gain, maybe more on other systems.

        ----------------

        Test results for searching Min + Max in array of Bytes:

        5k times 65536 Bytes ( equal to L1 cache size ):
        FrameWave ops/s : 3,899 G ( 3,301 G when misaligned 1 byte )
        Std ops/s : 277,785 M
        SpeedGain: 14,034767116323

        625 times 512 KiBytes ( fits in L2 cache ):
        FrameWave ops/s : 4,829 G ( 3,452 G when misaligned 1 byte )
        Std ops/s : 273,952 M
        SpeedGain: 17,6288116474511

        10 times 32 MiBytes ( does not fit in cache ):
        FrameWave ops/s : 2,406 G ( 2,108 G when misaligned 1 byte )
        Std ops/s : 269,632 M
        SpeedGain: 8,92295889439725

        ==========

        Test results for searching Min + Max in array of Doubles:

        5k times 65536 Bytes ( equal to L1 cache size ):
        FrameWave ops/s : 490,077 M = 3,921 GBytes/s ( 456,251 M = 3,65 GB/s when misaligned 1 double ) ( 414,156 M = 3,313 GB/s when misaligned 1 byte )
        Std ops/s : 144,754 M
        SpeedGain: 3,38557829221064

        625 times 512 KiBytes ( fits in L2 cache ):
        FrameWave ops/s : 588,589 M = 4,709 GBytes/s ( 550,362 M = 4,403 GB/s when misaligned 1 double ) ( 409,013 M = 3,272 GB/s when misaligned 1 byte )
        Std ops/s : 137,97 M
        SpeedGain: 4,26607279784477

        10 times 32 MiBytes ( does not fit in cache ):
        FrameWave ops/s : 297,603 M = 2,381 GBytes/s ( 288,325 M = 2,307 GB/s when misaligned 1 double ) ( 268,179 M = 2,145 GB/s when misaligned 1 byte )
        Std ops/s : 131,685 M
        SpeedGain: 2,25996903781752

        Reference 'Std' routine used for SpeedGain value:

        type
          //TTest = Byte;
          TTest = Double;
          PTest = ^TTest;
          PTestArray = ^TTestArray;
          TTestArray = array[0..1 shl 24] of TTest;

        function ownMinMax( pSrc: PTest ; len: integer ; var Min: TTest ; var Max: TTest ) : FwStatus;
        var
          i : integer;
          p : PTestArray;
        begin
          Result := fwStsNoErr;
          Min := pSrc^;
          Max := pSrc^;
          p := PTestArray( pSrc );
          for i := 1 to len - 1 do
            if p^[i] < Min then
              Min := p^[i]
            else
            if p^[i] > Max then
              Max := p^[i];
        end;

         
    • Kalyan
      Kalyan
      2008-03-20

      Multithreading has been enabled for Min, Max and MinMax functions.

       
    • Kalyan
      Kalyan
      2008-04-08

      I have checked-in the code for the following Functions Min8s, Max8s, Min64u and Max64u.