Menu

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth

2023-05-09
2024-03-13
1 2 3 4 > >> (Page 1 of 4)
  • Igor Pavlov

    Igor Pavlov - 2023-05-09

    7-Zip 23.00 now supports new benchmark command with Swap4 filter:

    7z b -mm=swap4 -mtic=30 -bt
    

    It tests Swap4 filter only.
    Swap4 filter changes the order of bytes in each 32-bit word of data.
    The idea with that Swap4 filter, that it can increase compression ratio for some data (like ARM64 code), because big-endian order of 32-bit data usually is better for LZMA compression than little-endian order. But currently 7-Zip doesn't use Swap4 filter by default, because Swap4 helps only for pure code (only executable sections of file), and it hurts compression ratio for another data.

    New version 7-Zip 23.00 has fast Swap4 code versions that use SSE2, AVX2 and NEON on ARM.
    Also the benchmark code can change the block size and number of threads.
    Each basic operation in SWAP4 works so:
    1. Load chunk of data (128-bit or 256-bit).
    2. Swap bytes in each 32-bit word in loaded chunk of data.
    3. Store converted data chunk back (128-bit or 256-bit).
    So that code can show the speed of vector instructions (SSE2/AVX2). Mostly it shows the speed of cache and memory for different block size.
    For small block sizes it shows the bandwidth of processor caches.
    For large block sizes it shows the bandwidth of RAM.
    It shows the result of encoding and decoding in MB/s. But actual processor memory bandwidth is 2 times larger than reported speed value, because each operation in SWAP4 filter includes two memory operations: one LOAD operation and one STORE operation.

    I want to check how new optimized Swap4 filter works on different CPUs.
    Please run the following benchmark command with output redirection to some file swap4.txt and attach results here.

    mkdir c:\res
    7z b -mm=swap4 -mtic=30 -bt > c:\res\swap4.txt
    

    Please also write some information about the speed of your RAM: frequency and timings, if you know them.
    -mtic=30 is switch to reduce the complexity of test (the number of iterations), because it can be too long to execute benchmark with default complexity.
    Also it's better to close another programs (including browser) before benchmark launching.

     

    Last edit: Igor Pavlov 2023-05-09
  • AlexS

    AlexS - 2023-05-09

    DIMM1: Kingston 99U5471-020.A00LF 4 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
    DIMM2: Kingston 99P5471-013.A00LF 4 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
    DIMM3: Kingston 99U5474-015.A00LF 2 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
    DIMM4: Kingston 99U5474-015.A00LF 2 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)

     

    Last edit: AlexS 2023-05-09
    • Igor Pavlov

      Igor Pavlov - 2023-05-09

      Does memory work in DDR3-1333 mode?
      It was slightly unexpected that Swap4 in your FX-8300 is fast only for 4 KiB (line 12:), while AMD FX-8300 has 16 KiB data cache.
      Please close another programs before test, and try with special affinity switch -maf=1:

      7z b -mm=swap4 -mtic=30 -bt -maf=1 > c:\res\swap4_af1.txt
      

      -maf=1 switch will set fixed affinity for each benchmark thread. So the system will not move benchmark threads to another cores.

      Correction.
      I've looked more about AMD Bulldozer
      and fast 4 KiB for FX is normal.
      16 KiB data cache is for reading only.
      But for writing there is: Write Coalescing Cache: 4 KB:
      https://chipsandcheese.com/2023/01/24/bulldozer-amds-crash-modernization-caching-and-conclusion/

       

      Last edit: Igor Pavlov 2023-05-09
  • AlexS

    AlexS - 2023-05-09
    7z b -mm=swap4 -mtic=30 -bt -maf=1 > c:\res\swap4_af1.txt
    
     

    Last edit: AlexS 2023-05-09
  • IDDQDesnik

    IDDQDesnik - 2023-05-11

    Intel Core i3-7100 CPU @ 3.90GHz

     
  • mdadm

    mdadm - 2023-05-14

    Old system here: DDR2 667Mhz.

     
    • Igor Pavlov

      Igor Pavlov - 2023-05-15

      Do you know why your system reports about 4 cores or threads, while E2140 is only 2-threads?
      And why the frequency dropped to 1200 MHz after 1-3 seconds of load?

       

      Last edit: Igor Pavlov 2023-05-15
      • mdadm

        mdadm - 2023-05-15

        First my questions:
        1) What it means: "CPU hardware threads: 2 / 4 : 3" ?
        2) What it means: "OPEN_MAX:1024" ?

        This CPU has only 2 cores, no ht, no threads. I can check this in multiple ways and it returns only 2 cores (cat /proc/cpuinfo, lscpu, top, htop, getconf _NPROCESSORS_ONLN).
        nproc returns 2, but nproc --all returns 4 (don't ask me why, but this probably is also incorrect for other cpus).

        As for frequency I have set default CPU scaling governor schedutil and 1200 MHz is mostly all the time.

         
        • Igor Pavlov

          Igor Pavlov - 2023-05-15

          " 2 / 4 : 3"
          4 - total number of cpus (similar to nproc --all)
          2 - number of threads available (enabled) for process.
          3 - hex mask of available threads.
          So only "4" is some unusual value in your case.
          "OPEN_MAX:1024 - how many files the program can keep open.
          So your system increases the frequency to 1600 MHz for about 1-2 seconds, and then the system drops the frequency back to 1200 MHz. Why does it do it?
          1600 MHz is not big power frequency.
          Why doesn't it keep 1600 MHz under load?

           

          Last edit: Igor Pavlov 2023-05-15
          • mdadm

            mdadm - 2023-05-15

            Now done on CPU frequency scaling governor "performance". CPU frequency was stable. I used default governor schedutil because of powersafe reasons.

             
        • Ninimu

          Ninimu - 2023-05-15

          If cat /proc/cpuinfo reports correct information, maybe there are bugs in nproc or in other places?

          This GitHub user also has E2140, cpuinfo

           
  • HITCHER

    HITCHER - 2023-05-15

    TV-BOX S905x4 (x96x4) has 4GB of DDR3 RAM on 8 RAM chips DDR3-1866 CL13 (32 Bit = 8x 4 Bit chip)
    OS is CoreELEC and i used the static build for ARM64 Linux.

     

    Last edit: HITCHER 2023-05-15
  • HITCHER

    HITCHER - 2023-05-15

    FX-8300 @3,6/4,2 GHz. RAM 4x4 GB DDR3 1600 CL7 (7-8-7-24)
    OS: Windows 10 2022H2.

     
    • HITCHER

      HITCHER - 2023-05-16

      Same PC on Linux OS, to see if there are any differences. Linux has less background tasks open.

       
  • HITCHER

    HITCHER - 2023-05-16

    R7-5700G, Fedora Linux 36. RAM 2x16GB DDR4-3200 CL16 Single-Rank
    iGPU mode, no graphics card used.
    Slightly tuned in BIOS with curve-optimizer (automatic neg. voltage offset, so boost on all core load is higher)

     

    Last edit: HITCHER 2023-05-16
    • Igor Pavlov

      Igor Pavlov - 2023-05-16

      About 16.5 GB/s (read and write) in line "27:".
      So it gives about 33 GB/s of total memory bandwidth, while DDR4-3200 has peak bandwidth of 50 GB/s.

       
      • HITCHER

        HITCHER - 2023-05-16

        Is there SSE2 code or AVX code running? It states in log gcc compiler SSE2.

         
        • Igor Pavlov

          Igor Pavlov - 2023-05-16

          Compiler is SSE2 for whole 7-Zip code, but the Swap4 code can use AVX2 with runtime dispatching according cpuid.
          107.047 GB/s / 4.65 GHz = 23 Bytes / cycle.
          It does 3 operations: Load / Shuffle / Store. Maybe we can expect 32 bytes/cycle for Zen3.
          But 4 threads show larger result: 501.528 GB/s, or 125 GB/s per thread. So 4 threads version works faster than 1 thread version by some unknown reason.
          Maybe the number of iterations is not big enough.
          We can try more iterations:

          7z b -mm=swap4 -mtic=32 -md20 -bt > swap4_tic32.txt
          7z b -mm=swap4 -mtic=32 -md20 -bt -maf=1 > swap4_tic32_af1.txt
          
           

          Last edit: Igor Pavlov 2023-05-16
          • HITCHER

            HITCHER - 2023-05-16

            Max CPU boost frequency is 4,7 GHz for up to two threads, with this setup.

             

            Last edit: HITCHER 2023-05-16
            • Igor Pavlov

              Igor Pavlov - 2023-05-17

              It's unexpected in your results, that single thread speed is lower for encoding, and higher for decoding:

              15:    101   108102   100   132111
              

              Single thread code is executed after 16-thread load in previous step. And your cpu could still keep low frequency for 1-thread load just after high load at 16-threads.
              Do you know what frequency on high AVX load for all threads in your cpu?

              But we want to get full frequency benchmark results for single thread. So we can limit the load to 2 threads:

              7z b -mm=swap4 -mtic=32 -bt -md29 -mmt2 > swap4_mmt2.txt
              
               
              • HITCHER

                HITCHER - 2023-05-17

                It should be able for up to 4,3 GHz on all cores, but maybe only limited time, until temperature increases, or if too much power is drawn.

                I can do
                cpupower monitor
                log, while benchmark is running.

                 

                Last edit: HITCHER 2023-05-17
              • HITCHER

                HITCHER - 2023-05-17

                cpupower.txt is output of command
                watch -n 1 'cpupower monitor >> cpupower.txt'

                 
                • Igor Pavlov

                  Igor Pavlov - 2023-05-17

                  -mmt2 results are good now for single thread:
                  144 GB/s / 4.7 GHz = 31 B/cycle.
                  you can look frequency, when you have more avx threads:

                  7z b -mm=swap4 -mtic=35 -mdf=14 -bt
                  
                   

                  Last edit: Igor Pavlov 2023-05-17
                  • HITCHER

                    HITCHER - 2023-05-17

                    So frequency is not as high as expected. 4,2 GHz for 8 cores, for all threads 3,9 GHz.

                     

                    Last edit: HITCHER 2023-05-17
  • HITCHER

    HITCHER - 2023-05-17

    old notebook with C2D CPU T6670, RAM 2x2 GB DDR2-800, Linux

     
1 2 3 4 > >> (Page 1 of 4)

Log in to post a comment.