swap4 reads data from memory, changes data and writes changed data back to memory. crc32:64 only reads data from memory and updates checksum variable with fast crc32 instruction. So we can get full memory reading bandwidth, because crc32 instruction is very fast in Cortex-A76. -maf=1 binds threads to cores.
Last edit: Igor Pavlov 2023-11-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Reading bandwidth for crc32 for one thread is about 13 GB/s .
So it's very unexpected that read + write is only 3.3 GB/s = 1.65 GB/s read + 1.65 GB/s write. So it's about 4 times slower than pure reading.
Probably memory controller in pi5 is not effective for simultaneous reading and writing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've looked pi5 forum. https://forums.raspberrypi.com/viewtopic.php?p=2158741
They have fast memory STREAM benchmark results with 10-12 GB/s.
So I don't understand why swap4 is only 3.3 GB/s.
swap4 workload usually is simpler for cpu core than STREAM workload because swap4 reads and writes from/to same array.
Last edit: Igor Pavlov 2023-11-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried with other alternative kernel in this distribution, which is not so well optimized for pi5, it has 4KB pagesize.
But there is not much difference.
Thanks!
It's unusual that bandwidth in STREAM is several times larger than swap4 bandwidth.
I'll try to look swap4 results for some another arm64 computers and think about reasons of low swap4 results on pi5.
Last edit: Igor Pavlov 2023-11-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
THP doesn't affect swap4, because swap4 is sequential access.
THP can help for random access.
So THP:always can help for lzma compression speed.
But 16 KB pages are also better than 4 GB pages for lzma compression speed.
I've looked results for some server arm64 computers.
And none of them have such low swap4 speed for memory (RAM) block size.
Probably it's some flaw in broadcom processor (flaw in memory controller). Or maybe L3 cache or memory controller is configured wrong way by default.
The access pattern for read, change, write to same place is slow in pi5, while most arm64 processors has no problems with such access pattern.
Last edit: Igor Pavlov 2023-11-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for tests!
Memory bandwidth results are low.
It's only 3.3 GB/s of memory bandwidth:
please try also these tests:
swap4
reads data from memory, changes data and writes changed data back to memory.crc32:64
only reads data from memory and updates checksum variable with fastcrc32
instruction. So we can get full memory reading bandwidth, becausecrc32
instruction is very fast in Cortex-A76.-maf=1
binds threads to cores.Last edit: Igor Pavlov 2023-11-28
I posted before results for raspberry pi4 @2 GHz, which should be slower.
Reading bandwidth for crc32 for one thread is about 13 GB/s .
So it's very unexpected that read + write is only 3.3 GB/s = 1.65 GB/s read + 1.65 GB/s write. So it's about 4 times slower than pure reading.
Probably memory controller in pi5 is not effective for simultaneous reading and writing.
I've looked pi5 forum.
https://forums.raspberrypi.com/viewtopic.php?p=2158741
They have fast memory STREAM benchmark results with 10-12 GB/s.
So I don't understand why
swap4
is only 3.3 GB/s.swap4 workload usually is simpler for cpu core than STREAM workload because swap4 reads and writes from/to same array.
Last edit: Igor Pavlov 2023-11-28
I tried with other alternative kernel in this distribution, which is not so well optimized for pi5, it has 4KB pagesize.
But there is not much difference.
Can you try also STREAM benchmark as described here at raspberrypi.com forum?
https://forums.raspberrypi.com/viewtopic.php?p=2148515#p2148515
This is again with default kernel for pi5 and 16KB pagesize.
This is with alternative kernel not optimized for pi5 and 4KB pagesize.
i just let run everything twice ...
Last edit: HITCHER 2023-11-28
Thanks!
It's unusual that bandwidth in STREAM is several times larger than swap4 bandwidth.
I'll try to look swap4 results for some another arm64 computers and think about reasons of low swap4 results on pi5.
Last edit: Igor Pavlov 2023-11-28
Also I don't see
THP
status in log.7-Zip reads file
Does RaspiOS support
THP
?Last edit: Igor Pavlov 2023-11-28
Currently i use default kernel for pi5 on 64bit RaspiOs Bookworm for Pi5.
transparent_hugepage/enabled is not configured
edit:
in this kernel, pi5 has 16KB pagesize
Last edit: HITCHER 2023-11-28
I now tried ubuntu 23.10 server for pi5.
It has THP in its kernel. Benchmark results are mostly the same.
Last edit: HITCHER 2023-11-29
THP doesn't affect swap4, because swap4 is sequential access.
THP can help for random access.
So THP:always can help for lzma compression speed.
But 16 KB pages are also better than 4 GB pages for lzma compression speed.
I've looked results for some server arm64 computers.
And none of them have such low swap4 speed for memory (RAM) block size.
Probably it's some flaw in broadcom processor (flaw in memory controller). Or maybe L3 cache or memory controller is configured wrong way by default.
The access pattern for read, change, write to same place is slow in pi5, while most arm64 processors has no problems with such access pattern.
Last edit: Igor Pavlov 2023-11-29
CPU: 7945HX