7-Zip / Discussion / Open Discussion: Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth

Igor Pavlov - 2023-11-28

Thanks for tests!
Memory bandwidth results are low.
It's only 3.3 GB/s of memory bandwidth:

27: 100 1642 100 1637 | 200 1462 200 1517 | 369 1655 368 1653

please try also these tests:

7zz b -mtic=29 -mm=swap4 -maf=1 7zz b -mtic=29 -mm=crc32:64 7zz b -mtic=29 -mm=crc32:64 -maf=1

swap4 reads data from memory, changes data and writes changed data back to memory.
crc32:64 only reads data from memory and updates checksum variable with fast crc32 instruction. So we can get full memory reading bandwidth, because crc32 instruction is very fast in Cortex-A76.
-maf=1 binds threads to cores.

Last edit: Igor Pavlov 2023-11-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- HITCHER - 2023-11-28
  
  I posted before results for raspberry pi4 @2 GHz, which should be slower.
  
  crc32-maf1.txt
  
  crc32.txt
  
  swap4-maf1.txt
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2023-11-28
    
    Reading bandwidth for crc32 for one thread is about 13 GB/s .
    So it's very unexpected that read + write is only 3.3 GB/s = 1.65 GB/s read + 1.65 GB/s write. So it's about 4 times slower than pure reading.
    Probably memory controller in pi5 is not effective for simultaneous reading and writing.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2023-11-28
    
    I've looked pi5 forum.
    https://forums.raspberrypi.com/viewtopic.php?p=2158741
    They have fast memory STREAM benchmark results with 10-12 GB/s.
    So I don't understand why swap4 is only 3.3 GB/s.
    swap4 workload usually is simpler for cpu core than STREAM workload because swap4 reads and writes from/to same array.
    
    Last edit: Igor Pavlov 2023-11-28
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - HITCHER - 2023-11-28
      
      I tried with other alternative kernel in this distribution, which is not so well optimized for pi5, it has 4KB pagesize.
      But there is not much difference.
      
      swap4-2.txt
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Igor Pavlov - 2023-11-28
        
        Can you try also STREAM benchmark as described here at raspberrypi.com forum?
        https://forums.raspberrypi.com/viewtopic.php?p=2148515#p2148515
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        HITCHER - 2023-11-28
        
        This is again with default kernel for pi5 and 16KB pagesize.
        
        runs.out0
        
        runs.out1
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        HITCHER - 2023-11-28
        
        This is with alternative kernel not optimized for pi5 and 4KB pagesize.
        
        i just let run everything twice ...
        
        Last edit: HITCHER 2023-11-28
        
        runs.out.2
        
        runs.out.3
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Igor Pavlov - 2023-11-28
        
        Thanks!
        It's unusual that bandwidth in STREAM is several times larger than swap4 bandwidth.
        I'll try to look swap4 results for some another arm64 computers and think about reasons of low swap4 results on pi5.
        
        Last edit: Igor Pavlov 2023-11-28
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2023-11-28

Also I don't see THPstatus in log.
7-Zip reads file

/sys/kernel/mm/transparent_hugepage/enabled

Does RaspiOS support THP?

Last edit: Igor Pavlov 2023-11-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- HITCHER - 2023-11-28
  
  Currently i use default kernel for pi5 on 64bit RaspiOs Bookworm for Pi5.
  
  transparent_hugepage/enabled is not configured
  
  edit:
  in this kernel, pi5 has 16KB pagesize
  
  Last edit: HITCHER 2023-11-28
  
  config-6.1.0-rpi6-rpi-2712
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- HITCHER - 2023-11-29
  
  I now tried ubuntu 23.10 server for pi5.
  It has THP in its kernel. Benchmark results are mostly the same.
  
  Last edit: HITCHER 2023-11-29
  
  runs.out
  
  swap4.txt
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Igor Pavlov - 2023-11-29
    
    THP doesn't affect swap4, because swap4 is sequential access.
    THP can help for random access.
    So THP:always can help for lzma compression speed.
    But 16 KB pages are also better than 4 GB pages for lzma compression speed.
    
    I've looked results for some server arm64 computers.
    And none of them have such low swap4 speed for memory (RAM) block size.
    Probably it's some flaw in broadcom processor (flaw in memory controller). Or maybe L3 cache or memory controller is configured wrong way by default.
    The access pattern for read, change, write to same place is slow in pi5, while most arm64 processors has no problems with such access pattern.
    
    Last edit: Igor Pavlov 2023-11-29
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

zhgbbs - 2024-03-13

CPU: 7945HX

swap4.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth

A free file archiver for extremely high compression

Forums

Help

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth

A free file archiver for extremely high compression

Forums

Help

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Swap4 SSE2/AVX2 benchmark : CPU Cache / RAM bandwidth