7-Zip 23.00 now supports new benchmark command with Swap4 filter:
7zb-mm=swap4-mtic=30-bt
It tests Swap4 filter only. Swap4 filter changes the order of bytes in each 32-bit word of data.
The idea with that Swap4 filter, that it can increase compression ratio for some data (like ARM64 code), because big-endian order of 32-bit data usually is better for LZMA compression than little-endian order. But currently 7-Zip doesn't use Swap4 filter by default, because Swap4 helps only for pure code (only executable sections of file), and it hurts compression ratio for another data.
New version 7-Zip 23.00 has fast Swap4 code versions that use SSE2, AVX2 and NEON on ARM.
Also the benchmark code can change the block size and number of threads.
Each basic operation in SWAP4 works so:
1. Load chunk of data (128-bit or 256-bit).
2. Swap bytes in each 32-bit word in loaded chunk of data.
3. Store converted data chunk back (128-bit or 256-bit).
So that code can show the speed of vector instructions (SSE2/AVX2). Mostly it shows the speed of cache and memory for different block size.
For small block sizes it shows the bandwidth of processor caches.
For large block sizes it shows the bandwidth of RAM.
It shows the result of encoding and decoding in MB/s. But actual processor memory bandwidth is 2 times larger than reported speed value, because each operation in SWAP4 filter includes two memory operations: one LOAD operation and one STORE operation.
I want to check how new optimized Swap4 filter works on different CPUs.
Please run the following benchmark command with output redirection to some file swap4.txt and attach results here.
mkdir c:\res
7z b -mm=swap4 -mtic=30 -bt > c:\res\swap4.txt
Please also write some information about the speed of your RAM: frequency and timings, if you know them. -mtic=30 is switch to reduce the complexity of test (the number of iterations), because it can be too long to execute benchmark with default complexity.
Also it's better to close another programs (including browser) before benchmark launching.
Last edit: Igor Pavlov 2023-05-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Does memory work in DDR3-1333 mode?
It was slightly unexpected that Swap4 in your FX-8300 is fast only for 4 KiB (line 12:), while AMD FX-8300 has 16 KiB data cache.
Please close another programs before test, and try with special affinity switch -maf=1:
Do you know why your system reports about 4 cores or threads, while E2140 is only 2-threads?
And why the frequency dropped to 1200 MHz after 1-3 seconds of load?
Last edit: Igor Pavlov 2023-05-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First my questions:
1) What it means: "CPU hardware threads: 2 / 4 : 3" ?
2) What it means: "OPEN_MAX:1024" ?
This CPU has only 2 cores, no ht, no threads. I can check this in multiple ways and it returns only 2 cores (cat /proc/cpuinfo, lscpu, top, htop, getconf _NPROCESSORS_ONLN).
nproc returns 2, but nproc --all returns 4 (don't ask me why, but this probably is also incorrect for other cpus).
As for frequency I have set default CPU scaling governor schedutil and 1200 MHz is mostly all the time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
" 2 / 4 : 3"
4 - total number of cpus (similar to nproc --all)
2 - number of threads available (enabled) for process.
3 - hex mask of available threads.
So only "4" is some unusual value in your case.
"OPEN_MAX:1024 - how many files the program can keep open.
So your system increases the frequency to 1600 MHz for about 1-2 seconds, and then the system drops the frequency back to 1200 MHz. Why does it do it?
1600 MHz is not big power frequency.
Why doesn't it keep 1600 MHz under load?
Last edit: Igor Pavlov 2023-05-15
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
TV-BOX S905x4 (x96x4) has 4GB of DDR3 RAM on 8 RAM chips DDR3-1866 CL13 (32 Bit = 8x 4 Bit chip)
OS is CoreELEC and i used the static build for ARM64 Linux.
R7-5700G, Fedora Linux 36. RAM 2x16GB DDR4-3200 CL16 Single-Rank
iGPU mode, no graphics card used.
Slightly tuned in BIOS with curve-optimizer (automatic neg. voltage offset, so boost on all core load is higher)
Compiler is SSE2 for whole 7-Zip code, but the Swap4 code can use AVX2 with runtime dispatching according cpuid.
107.047 GB/s / 4.65 GHz = 23 Bytes / cycle.
It does 3 operations: Load / Shuffle / Store. Maybe we can expect 32 bytes/cycle for Zen3.
But 4 threads show larger result: 501.528 GB/s, or 125 GB/s per thread. So 4 threads version works faster than 1 thread version by some unknown reason.
Maybe the number of iterations is not big enough.
We can try more iterations:
It's unexpected in your results, that single thread speed is lower for encoding, and higher for decoding:
15:101108102100132111
Single thread code is executed after 16-thread load in previous step. And your cpu could still keep low frequency for 1-thread load just after high load at 16-threads.
Do you know what frequency on high AVX load for all threads in your cpu?
But we want to get full frequency benchmark results for single thread. So we can limit the load to 2 threads:
7zb-mm=swap4-mtic=32-bt-md29-mmt2>swap4_mmt2.txt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
7-Zip 23.00 now supports new benchmark command with
Swap4
filter:It tests
Swap4
filter only.Swap4
filter changes the order of bytes in each 32-bit word of data.The idea with that
Swap4
filter, that it can increase compression ratio for some data (like ARM64 code), because big-endian order of 32-bit data usually is better for LZMA compression than little-endian order. But currently 7-Zip doesn't useSwap4
filter by default, becauseSwap4
helps only for pure code (only executable sections of file), and it hurts compression ratio for another data.New version 7-Zip 23.00 has fast
Swap4
code versions that use SSE2, AVX2 and NEON on ARM.Also the benchmark code can change the block size and number of threads.
Each basic operation in SWAP4 works so:
1. Load chunk of data (128-bit or 256-bit).
2. Swap bytes in each 32-bit word in loaded chunk of data.
3. Store converted data chunk back (128-bit or 256-bit).
So that code can show the speed of vector instructions (SSE2/AVX2). Mostly it shows the speed of cache and memory for different block size.
For small block sizes it shows the bandwidth of processor caches.
For large block sizes it shows the bandwidth of RAM.
It shows the result of encoding and decoding in MB/s. But actual processor memory bandwidth is 2 times larger than reported speed value, because each operation in SWAP4 filter includes two memory operations: one LOAD operation and one STORE operation.
I want to check how new optimized
Swap4
filter works on different CPUs.Please run the following benchmark command with output redirection to some file
swap4.txt
and attach results here.Please also write some information about the speed of your RAM: frequency and timings, if you know them.
-mtic=30
is switch to reduce the complexity of test (the number of iterations), because it can be too long to execute benchmark with default complexity.Also it's better to close another programs (including browser) before benchmark launching.
Last edit: Igor Pavlov 2023-05-09
DIMM1: Kingston 99U5471-020.A00LF 4 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
DIMM2: Kingston 99P5471-013.A00LF 4 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
DIMM3: Kingston 99U5474-015.A00LF 2 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
DIMM4: Kingston 99U5474-015.A00LF 2 GB DDR3-1333 DDR3 SDRAM (9-9-9-24 @ 666 MHz) (8-8-8-22 @ 609 MHz) (7-7-7-20 @ 533 MHz) (6-6-6-17 @ 457 MHz)
Last edit: AlexS 2023-05-09
Does memory work in DDR3-1333 mode?
It was slightly unexpected that Swap4 in your FX-8300 is fast only for 4 KiB (line
12:
), while AMD FX-8300 has 16 KiB data cache.Please close another programs before test, and try with special affinity switch
-maf=1
:-maf=1
switch will set fixed affinity for each benchmark thread. So the system will not move benchmark threads to another cores.Correction.
I've looked more about AMD Bulldozer
and fast 4 KiB for FX is normal.
16 KiB data cache is for reading only.
But for writing there is: Write Coalescing Cache: 4 KB:
https://chipsandcheese.com/2023/01/24/bulldozer-amds-crash-modernization-caching-and-conclusion/
Last edit: Igor Pavlov 2023-05-09
Last edit: AlexS 2023-05-09
Intel Core i3-7100 CPU @ 3.90GHz
Old system here: DDR2 667Mhz.
Do you know why your system reports about 4 cores or threads, while E2140 is only 2-threads?
And why the frequency dropped to 1200 MHz after 1-3 seconds of load?
Last edit: Igor Pavlov 2023-05-15
First my questions:
1) What it means: "CPU hardware threads: 2 / 4 : 3" ?
2) What it means: "OPEN_MAX:1024" ?
This CPU has only 2 cores, no ht, no threads. I can check this in multiple ways and it returns only 2 cores (cat /proc/cpuinfo, lscpu, top, htop, getconf _NPROCESSORS_ONLN).
nproc returns 2, but nproc --all returns 4 (don't ask me why, but this probably is also incorrect for other cpus).
As for frequency I have set default CPU scaling governor schedutil and 1200 MHz is mostly all the time.
" 2 / 4 : 3"
4 - total number of cpus (similar to nproc --all)
2 - number of threads available (enabled) for process.
3 - hex mask of available threads.
So only "4" is some unusual value in your case.
"OPEN_MAX:1024 - how many files the program can keep open.
So your system increases the frequency to 1600 MHz for about 1-2 seconds, and then the system drops the frequency back to 1200 MHz. Why does it do it?
1600 MHz is not big power frequency.
Why doesn't it keep 1600 MHz under load?
Last edit: Igor Pavlov 2023-05-15
Now done on CPU frequency scaling governor "performance". CPU frequency was stable. I used default governor schedutil because of powersafe reasons.
If
cat /proc/cpuinfo
reports correct information, maybe there are bugs innproc
or in other places?This GitHub user also has E2140, cpuinfo
TV-BOX S905x4 (x96x4) has 4GB of DDR3 RAM on 8 RAM chips DDR3-1866 CL13 (32 Bit = 8x 4 Bit chip)
OS is CoreELEC and i used the static build for ARM64 Linux.
Last edit: HITCHER 2023-05-15
FX-8300 @3,6/4,2 GHz. RAM 4x4 GB DDR3 1600 CL7 (7-8-7-24)
OS: Windows 10 2022H2.
Same PC on Linux OS, to see if there are any differences. Linux has less background tasks open.
R7-5700G, Fedora Linux 36. RAM 2x16GB DDR4-3200 CL16 Single-Rank
iGPU mode, no graphics card used.
Slightly tuned in BIOS with curve-optimizer (automatic neg. voltage offset, so boost on all core load is higher)
Last edit: HITCHER 2023-05-16
About 16.5 GB/s (read and write) in line "27:".
So it gives about 33 GB/s of total memory bandwidth, while DDR4-3200 has peak bandwidth of 50 GB/s.
Is there SSE2 code or AVX code running? It states in log gcc compiler SSE2.
Compiler is SSE2 for whole 7-Zip code, but the Swap4 code can use AVX2 with runtime dispatching according cpuid.
107.047 GB/s / 4.65 GHz = 23 Bytes / cycle.
It does 3 operations: Load / Shuffle / Store. Maybe we can expect 32 bytes/cycle for Zen3.
But 4 threads show larger result: 501.528 GB/s, or 125 GB/s per thread. So 4 threads version works faster than 1 thread version by some unknown reason.
Maybe the number of iterations is not big enough.
We can try more iterations:
Last edit: Igor Pavlov 2023-05-16
Max CPU boost frequency is 4,7 GHz for up to two threads, with this setup.
Last edit: HITCHER 2023-05-16
It's unexpected in your results, that single thread speed is lower for encoding, and higher for decoding:
Single thread code is executed after 16-thread load in previous step. And your cpu could still keep low frequency for 1-thread load just after high load at 16-threads.
Do you know what frequency on high AVX load for all threads in your cpu?
But we want to get full frequency benchmark results for single thread. So we can limit the load to 2 threads:
It should be able for up to 4,3 GHz on all cores, but maybe only limited time, until temperature increases, or if too much power is drawn.
I can do
cpupower monitor
log, while benchmark is running.
Last edit: HITCHER 2023-05-17
cpupower.txt is output of command
watch -n 1 'cpupower monitor >> cpupower.txt'
-mmt2 results are good now for single thread:
144 GB/s / 4.7 GHz = 31 B/cycle.
you can look frequency, when you have more avx threads:
Last edit: Igor Pavlov 2023-05-17
So frequency is not as high as expected. 4,2 GHz for 8 cores, for all threads 3,9 GHz.
Last edit: HITCHER 2023-05-17
old notebook with C2D CPU T6670, RAM 2x2 GB DDR2-800, Linux