Menu

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth

2023-05-09
2024-03-13
<< < 1 2 3 4 (Page 4 of 4)
  • Igor Pavlov

    Igor Pavlov - 2023-11-28

    Thanks for tests!
    Memory bandwidth results are low.
    It's only 3.3 GB/s of memory bandwidth:

    27:    100     1642   100     1637 |   200     1462   200     1517 |   369     1655   368     1653
    

    please try also these tests:

    7zz b -mtic=29 -mm=swap4 -maf=1
    7zz b -mtic=29 -mm=crc32:64
    7zz b -mtic=29 -mm=crc32:64 -maf=1
    

    swap4 reads data from memory, changes data and writes changed data back to memory.
    crc32:64 only reads data from memory and updates checksum variable with fast crc32 instruction. So we can get full memory reading bandwidth, because crc32 instruction is very fast in Cortex-A76.
    -maf=1 binds threads to cores.

     

    Last edit: Igor Pavlov 2023-11-28
    • HITCHER

      HITCHER - 2023-11-28

      I posted before results for raspberry pi4 @2 GHz, which should be slower.

       
      • Igor Pavlov

        Igor Pavlov - 2023-11-28

        Reading bandwidth for crc32 for one thread is about 13 GB/s .
        So it's very unexpected that read + write is only 3.3 GB/s = 1.65 GB/s read + 1.65 GB/s write. So it's about 4 times slower than pure reading.
        Probably memory controller in pi5 is not effective for simultaneous reading and writing.

         
      • Igor Pavlov

        Igor Pavlov - 2023-11-28

        I've looked pi5 forum.
        https://forums.raspberrypi.com/viewtopic.php?p=2158741
        They have fast memory STREAM benchmark results with 10-12 GB/s.
        So I don't understand why swap4 is only 3.3 GB/s.
        swap4 workload usually is simpler for cpu core than STREAM workload because swap4 reads and writes from/to same array.

         

        Last edit: Igor Pavlov 2023-11-28
        • HITCHER

          HITCHER - 2023-11-28

          I tried with other alternative kernel in this distribution, which is not so well optimized for pi5, it has 4KB pagesize.
          But there is not much difference.

           
          • Igor Pavlov

            Igor Pavlov - 2023-11-28

            Can you try also STREAM benchmark as described here at raspberrypi.com forum?
            https://forums.raspberrypi.com/viewtopic.php?p=2148515#p2148515

             
            • HITCHER

              HITCHER - 2023-11-28

              This is again with default kernel for pi5 and 16KB pagesize.

               
            • HITCHER

              HITCHER - 2023-11-28

              This is with alternative kernel not optimized for pi5 and 4KB pagesize.

              i just let run everything twice ...

               

              Last edit: HITCHER 2023-11-28
              • Igor Pavlov

                Igor Pavlov - 2023-11-28

                Thanks!
                It's unusual that bandwidth in STREAM is several times larger than swap4 bandwidth.
                I'll try to look swap4 results for some another arm64 computers and think about reasons of low swap4 results on pi5.

                 

                Last edit: Igor Pavlov 2023-11-28
  • Igor Pavlov

    Igor Pavlov - 2023-11-28

    Also I don't see THPstatus in log.
    7-Zip reads file

    /sys/kernel/mm/transparent_hugepage/enabled
    

    Does RaspiOS support THP?

     

    Last edit: Igor Pavlov 2023-11-28
    • HITCHER

      HITCHER - 2023-11-28

      Currently i use default kernel for pi5 on 64bit RaspiOs Bookworm for Pi5.

      transparent_hugepage/enabled is not configured

      edit:
      in this kernel, pi5 has 16KB pagesize

       

      Last edit: HITCHER 2023-11-28
    • HITCHER

      HITCHER - 2023-11-29

      I now tried ubuntu 23.10 server for pi5.
      It has THP in its kernel. Benchmark results are mostly the same.

       

      Last edit: HITCHER 2023-11-29
      • Igor Pavlov

        Igor Pavlov - 2023-11-29

        THP doesn't affect swap4, because swap4 is sequential access.
        THP can help for random access.
        So THP:always can help for lzma compression speed.
        But 16 KB pages are also better than 4 GB pages for lzma compression speed.

        I've looked results for some server arm64 computers.
        And none of them have such low swap4 speed for memory (RAM) block size.
        Probably it's some flaw in broadcom processor (flaw in memory controller). Or maybe L3 cache or memory controller is configured wrong way by default.
        The access pattern for read, change, write to same place is slow in pi5, while most arm64 processors has no problems with such access pattern.

         

        Last edit: Igor Pavlov 2023-11-29
  • zhgbbs

    zhgbbs - 2024-03-13

    CPU: 7945HX

     
<< < 1 2 3 4 (Page 4 of 4)

Log in to post a comment.