7-max / Discussion / Open Discussion: 7-cpu benchmark on Skylake-X

ZiNgA BuRgA - 2017-10-19

Not sure if useful to anyone, and couldn't find a more appropriate place to post this, but I ran the benchmark found at 7-cpu.com on the following system:

CPU: Intel Core i7 7820X @3.6GHz
RAM: 4x8GB DDR4 3400MHz CL16
OS: Windows 10 x64

Turbo/powersaving options disabled in the BIOS, so CPU runs at a flat 3.6GHz across all cores. I tried installing 7-Zip, according to instructions, and even ticked 'Use large memory pages' in the options dialog, but it seems some tests returned 'Not enough memory'.

Anyway, I imagine results should be largely similar to Skylake/Kabylake, but hopefully still useful.
Thanks for 7-cpu by the way!

Last edit: ZiNgA BuRgA 2017-10-19

results.7z

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-19

Thanks for tests!
Please do more tests:

1) Try to do "Large Memory Pages" tests also as described in readme.txt. it requires admin rights and one reboot.
The results must be in 64L/32L folder.
Quick test to check that Large pages work with admin rights:

MemLat64.exe 16 l >16_l.txt MemLat64.exe 16 > 16.txt

16_l.txt results must be better (smaller latency).

2) run also test.bat with default frequency (turbo boost switched on). So we can see how L3 latency will change, when the core frequency is higher.

3) do additional benchmark tests with latest 7-Zip 17.01 beta x64 at stock frequency (turbo boost switched on):

7z b -mmt1 > mmt1.txt 7z b > mmt.txt 7z b -mm=* -mmt=* > bench.txt

4) write also what mesh (uncore/L3) frequency was in tests.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-19

Oh, I see now, the batch file needs to be run with admin rights (I thought only 7-Zip needed admin rights to install large page support).

Updated results (CPU fixed at 3.6GHz, mesh fixed at 2.4GHz) in results-stock.7z

And here's a run with the following BIOS options enabled:

EIST

C-States / C1E

Intel Turbo Boost ("Enhanced Turbo" disabled)

Clock rate: 1.2GHz idle, 4GHz load across all cores; single core seems to jump between 4.1-4.3GHz, though this is just what I'm seeing in HWInfo when running Prime95
Mesh clock: 2GHz idle, 2.4GHz under load

Files is results-turbo.7z, which also includes the 7-Zip 17.01 benchmarks as well.

Last edit: ZiNgA BuRgA 2017-10-19

results-stock.7z

results-turbo.7z
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-19

Thanks for test!
Now "large pages" work OK.

Please do additional tests with Turbo Boost enabled in admin mode:

MemLat64.exe l p z23 x22 512m > 512m_x22.txt MemLat64.exe l p z23 x22 2g > 2g_x22.txt MemLat64.exe l p z23 x30 2g > 2g_x30.txt MemLat64.exe l p z23 x30 8g > 8g_x30.txt MemLat64.exe l p z23 x30 24g > 24g_x30.txt

Probably it's better to call these tests after reboot.
Last test tries to allocate 26GB of large pages.
And the memory is not fragmented after reboot still.

I try to learn about 1 GB pages.
Probably Windows 10 can allocate 1 GB pages instead of 2 MB pages in some cases.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-19

Here you go!

I'll be away from the machine for the next few days, so if there's any more tests, it'll take a few days before I can run them.

results.7z

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-20

Default settings (z20) of memlat was too small for new L3 cache in Skylake X. Latest results with z23 are more correct.

Please write more information about RAM and exact RAM settings (CPU-Z).

Last edit: Igor Pavlov 2017-10-20

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-20

The RAM is this: http://www.corsair.com/en-us/vengeance-led-32gb-4-x-8gb-ddr4-dram-3466mhz-c16-memory-kit-white-led-cmu32gx4m4c3466c16

XMP profile is applied, but I can't get it working at the rated 3466MHz, so it's running at 3400MHz instead. Everything else is stock; timings are 16-18-18-36, running in quad channel configuration.

I'm not at the computer at the moment, so can't post a CPU-Z screenshot, but hopefully that's most of the info.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-20

Ok. Write if computer will be available.
I'll prepare another script for memlat benchmarks for Skylake X.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-23

CPU-Z screenshots attached.
Interestingly, the DRAM Frequency seems to jump around, usually at 1700MHz, but sometimes going down to 739MHz - some powersaving feature? (I never knew DRAM would throttle down, though it's usually at the right speed)

cpuz-mem.png

cpuz-spd.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-23

Please run attached bat file after reboot and with admin rights.
It tests L3 Latency of different cores and RAM latency.
Note that it can be long test - maybe 30 minutes.

test.bat

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-24

Results attached

results.7z

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-24

Thanks!
core 3 is fastest for L3 - 77 cycles.
core 7 is most slow for L3 - 81 cycles.
Cores 2 and 4 probably can work at 4.5 GHz. Is it turbo boost 3.0 only for cores 2,4?

Please run attached bat file.
I hope it will show exact preformance and frequency of each core.

test_affinity.bat

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-24

That's rather interesting. Do you think it may depend on the positioning of the cores and which slice of L3 the data is stored in? From what I know, the mesh can require 1 or 2 hops to get the data.
I believe the Skylake-X LCC die supports 12 cores, so presumably 4 of them are disabled on the 8 core CPU. I don't know whether it's possible to determine the physical positioning of the cores and how much of an effect that'd have.
You also don't think that turbo would distort the clock timings a bit?

After running your script, I ran Prime95 (one worker thread), with affinity set to one thread at a time. I've attached the HWInfo screenshot - as you can see with the maximum clock column, it appears that core 0 and 6 are assigned the TB3 cores. Interestingly though, it seems the other cores don't really boost beyond 4.1GHz for whatever reason. I did leave it running on core 7 for a bit longer than the rest, and it did mostly stay on 4.1GHz, but the data shows that it did go to 4.2GHz at some point. Maybe Prime95 is doing some AVX load or similar, so the clock runs a bit lower (I'm just using the official v29.3, no mention of AVX)?

hwcpu.png

results.7z

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-25

1) Yes, core position is important. Cores in center require less hops in average to L3.
Core 3 can be in center, and core 7 can be in corner.
But the difference between them is small:
5% - of L3 Latency
1.3% - of RAM Latency
0.5% - performance difference in average in 7-Zip benchmark.

Skylake-X LCC is 10 cores (2 disabled in 7820X) and 2 cells with memory controllers. We don't know which cores are disabled.

2) I don't know why TB3 is for cores 0 and 6 for Prime.
TB3 for 7-Zip was for Core 2 (affinity mask 30) and for Core 4 (affinity mask 300) .

You can do both tests again:
1) test with Prime affinity
2) test with 7-Zip affinity (test_affinity.bat)

Also run Task manager and check which Core is under load.
Also run Hwinfo and check which Core is under load in Hwinfo.
Maybe Hwinfo sensor core numbers are differeent from Windows Core numbers?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-25

Another question about TB3.
As I understand, Windows 10 must move single-thread application to TB3 core (4.5 GHz).
But it was about 4.3 GHz in your tests without affinity:

7z b -mm=* -mmt1 ... CPU 100 4279 4279 CPU 100 4246 4245 CPU 100 4279 4279 100 100

So why TB3 didn't work in your tests without manual affinity?
Did you disable some TB3 settings in BIOS or in Windows?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-25

Oops, my mistake, so only 2 cores disabled on 7820X, but unknown as to which ones.

The core numbers do all match up: Task Manager, HWiNFO and 7z's affinity mask (I was originally setting affinity using Task Manager). Using -stm3 -mmt2 I can see the first two threads in Task Manager use all the CPU, and HWiNFO shows Core 0's frequency scaling up, and I've repeated this for all cores with the same result.

I left HWiNFO running whilst I ran test_affinity, and it does show cores 2 and 4 boosting to 4.5GHz, the rest staying at 4.3GHz (was going to post a screenshot, but I accidentally hit some key which closed the monitoring window, and lost the results). Hence, it matches your results.

I reran Prime95 but with 2 worker threads, and assigning affinity to the 2 threads per core. It seems that this time, cores 0 and 7 got selected for TB3 (though it seems core 0 didn't quite reach 4.5GHz). I also noticed that the worker windows mention FMA3, so it seems that Prime95 is using AVX (though unsure if 256 or 512-bit).
So interesting to note that the CPU does throttle on AVX, but TB3 seems to ignore that. Maybe TB3 is also workload dependent?

The version of Windows 10 I'm running is very old (probably the first RTM build in fact), so if Microsoft needed to update it to support TB3, it won't be in there. From what I can tell, Windows 10 shifts single threaded applications amongst all cores, so maybe the 4.3GHz you're reading has something to do with that?

I don't know of any option specifically relating to TB3 in the BIOS (there's just an option to enable/disable turbo).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-25

7-Zip uses integer units load.
and Prime95 uses fpu/avx.
Maybe it's reason of TB3 core difference.
Try to test affinity with another integer/fpu programs like winrar benchmark, cinebench.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ZiNgA BuRgA - 2017-10-26

I notice that in CPU-Z's 'Clocks' window, cores 2 and 4 are marked in red, so I wonder if it's managed to detect something there.

I tried WinRAR's benchmark, setting affinity to one core, and leaving it for a while, and repeating for each core. Screenshot of HWiNFO attached. This benchmark doesn't seem to stress the CPU much, as the CPU fan didn't spin up at all, and none of cores seem to really go beyond 4.1GHz. Despite some of the readings showing a higher max clock rate, I suspect they may be reading errors as, watching the clocks whilst running the tests, I notice that they rarely cross 4.1GHz.

I may carry out some more benchmarks in a few days.

winrar-cpu.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-10-26

4.1 GHz is not good.
Maybe Winrar switches the load between 2 threads on same core and the load is not 200%.
Try affinity to just one thread, and disable "multithreading" option in Winrar benchmark. That way it must use only one thread at full load.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Largely the same result unfortunately.

I decided to set up a test using the program:

#include <immintrin.h>

int main(int argc, char** argv) {
    switch (argv[1][0]) {
    case 'i': { // scalar int
        int n = 1, m = 2;
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n += n;
                m += m;
            }
        return n + m;
    }
    case 'f': { // scalar float
        float n = 1, m = 2;
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n += n;
                m += m;
            }
        return n + m;
    }
    case 'x': { // 128b int
        __m128i n = _mm_set1_epi8(1), m = _mm_set1_epi8(2);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm_add_epi8(n, n);
                m = _mm_add_epi8(m, m);
            }
        return _mm_extract_epi16(n, 0) + _mm_extract_epi16(m, 0);
    }
    case 'y': { // 256b int
        __m256i n = _mm256_set1_epi8(1), m = _mm256_set1_epi8(2);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm256_add_epi8(n, n);
                m = _mm256_add_epi8(m, m);
            }
        return _mm_extract_epi16(_mm256_castsi256_si128(n), 0) + _mm_extract_epi16(_mm256_castsi256_si128(m), 0);
    }
    case 'a': { // 128b float FMA
        __m128 n = _mm_set1_ps(1), m = _mm_set1_ps(2);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm_fmadd_ps(n, n, n);
                m = _mm_fmadd_ps(m, m, m);
            }
        return _mm_extract_ps(n, 0) + _mm_extract_ps(m, 0);
    }
    case 'b': { // 256b float FMA
        __m256 n = _mm256_set1_ps(1), m = _mm256_set1_ps(2);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm256_fmadd_ps(n, n, n);
                m = _mm256_fmadd_ps(m, m, m);
            }
        return _mm_extract_ps(_mm256_castps256_ps128(n), 0) + _mm_extract_ps(_mm256_castps256_ps128(m), 0);
    }

    default: return argv[0][0];
    }
}

Compiled using MSVC 2015, x86, but with /march:AVX2 (so float should use SSE instead of x87).
Ran this batch script with HWiNFO active:

start /b /wait /realtime /affinity 1 burn %1
start /b /wait /realtime /affinity 4 burn %1
start /b /wait /realtime /affinity 10 burn %1
start /b /wait /realtime /affinity 40 burn %1
start /b /wait /realtime /affinity 100 burn %1
start /b /wait /realtime /affinity 400 burn %1
start /b /wait /realtime /affinity 1000 burn %1
start /b /wait /realtime /affinity 4000 burn %1

Not the best test, but should be indicitive enough.
I didn't really watch it whilst it ran, so don't know whether max reading column values were sustained for much (cross referencing with the VID seems to get rid of some measurement problems).

So it does seem that FPU loads boost differently to integer loads (or maybe I'm not doing enough concurrent FPU operations to saturate the units).

Igor Pavlov - 2017-10-28

I suppose only "integer" test is good.
Another your tests probably can have problems:
1) Float value number overflow. I'm not sure about it.
2) it's too small load for fpu.
Also check what you have for "AVX" offset option in BIOS.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ah, good points, I forgot about overflows. The load is small, but I'd have thought that it shouldn't affect clock speeds much?

The AVX offset is at default ("Auto"). I'm not sure what value it selects - the motherboard manual doesn't say much either: "If set to Auto, BIOS will configure this setting automatically". I could select a value I suppose, but I've been trying to test at stock settings.

I retried with the modified program:

#include <immintrin.h>

int main(int argc, char** argv) {
    switch (argv[1][0]) {
    case 'i': { // scalar int
        int n = 1, m = 2;
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n += n;
                m += m;
            }
        return n + m;
    }
    case 'f': { // scalar float
        float n = 1.1, m = 2.2;
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n += n;
                m += m;
                if (n > 1e6) n = 1.1;
                if (m > 1e6) m = 2.2;
            }
        return n + m;
    }
    case 'x': { // 128b int
        __m128i n = _mm_set1_epi8(1), m = _mm_set1_epi8(2);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm_add_epi8(n, n);
                m = _mm_add_epi8(m, m);
            }
        return _mm_extract_epi16(n, 0) + _mm_extract_epi16(m, 0);
    }
    case 'y': { // 256b int
        __m256i n = _mm256_set1_epi8(1), m = _mm256_set1_epi8(2);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm256_add_epi8(n, n);
                m = _mm256_add_epi8(m, m);
            }
        return _mm_extract_epi16(_mm256_castsi256_si128(n), 0) + _mm_extract_epi16(_mm256_castsi256_si128(m), 0);
    }
    case 'a': { // 128b float FMA
        __m128 n = _mm_set1_ps(1), m = _mm_set1_ps(2);
        __m128 o = _mm_set1_ps(3), p = _mm_set1_ps(4);
        __m128 q = _mm_set1_ps(5), r = _mm_set1_ps(6);
        __m128 s = _mm_set1_ps(7), t = _mm_set1_ps(8);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm_fmadd_ps(n, _mm_set1_ps(0), n);
                m = _mm_fmadd_ps(m, _mm_set1_ps(0), m);
                o = _mm_fmadd_ps(o, _mm_set1_ps(0), o);
                p = _mm_fmadd_ps(p, _mm_set1_ps(0), p);
                q = _mm_fmadd_ps(q, _mm_set1_ps(0), q);
                r = _mm_fmadd_ps(r, _mm_set1_ps(0), r);
                s = _mm_fmadd_ps(s, _mm_set1_ps(0), s);
                t = _mm_fmadd_ps(t, _mm_set1_ps(0), t);
            }
        return _mm_extract_ps(n, 0) + _mm_extract_ps(m, 0)
            + _mm_extract_ps(o, 0) + _mm_extract_ps(p, 0)
            + _mm_extract_ps(q, 0) + _mm_extract_ps(r, 0)
            + _mm_extract_ps(s, 0) + _mm_extract_ps(t, 0);
    }
    case 'b': { // 256b float FMA
        __m256 n = _mm256_set1_ps(1), m = _mm256_set1_ps(2);
        __m256 o = _mm256_set1_ps(3), p = _mm256_set1_ps(4);
        __m256 q = _mm256_set1_ps(5), r = _mm256_set1_ps(6);
        __m256 s = _mm256_set1_ps(7), t = _mm256_set1_ps(8);
        for (int i = 0; i<100; i++)
            for (int j = 0; j<1e9; j++) {
                n = _mm256_fmadd_ps(n, _mm256_set1_ps(0), n);
                m = _mm256_fmadd_ps(m, _mm256_set1_ps(0), m);
                o = _mm256_fmadd_ps(o, _mm256_set1_ps(0), o);
                p = _mm256_fmadd_ps(p, _mm256_set1_ps(0), p);
                q = _mm256_fmadd_ps(q, _mm256_set1_ps(0), q);
                r = _mm256_fmadd_ps(r, _mm256_set1_ps(0), r);
                s = _mm256_fmadd_ps(s, _mm256_set1_ps(0), s);
                t = _mm256_fmadd_ps(t, _mm256_set1_ps(0), t);
            }
        return _mm_extract_ps(_mm256_castps256_ps128(n), 0) + _mm_extract_ps(_mm256_castps256_ps128(m), 0)
            + _mm_extract_ps(_mm256_castps256_ps128(o), 0) + _mm_extract_ps(_mm256_castps256_ps128(p), 0)
            + _mm_extract_ps(_mm256_castps256_ps128(q), 0) + _mm_extract_ps(_mm256_castps256_ps128(r), 0)
            + _mm_extract_ps(_mm256_castps256_ps128(s), 0) + _mm_extract_ps(_mm256_castps256_ps128(t), 0);
    }

    default: return argv[0][0];
    }
}

(compiler doesn't realize the multiply by zero, and doesn't optimize it out)

I'm running with slightly higher voltages on the CPU, as I've noticed some some instability. Results are basically the same, except for the 256b FMA where cores don't exceed 4.1GHz, except core 0 which gets to 4.3GHz.

Joe Chang - 2017-12-27

hello, I did not see a separate item for the Broadwell Xeon E5-2699 v4. Does the RAM latency of 65 cycles + 75ns represent an average of local and remote node memory? Does memlat recognize NUMA memory? can it specifically target local or remote node separately? Would local node memory access on a 2-socket be longer than one a 1-socket (due to cache coherency?) TIA -joe

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Igor Pavlov - 2017-12-28

65 cycles + 75ns is local memory latency.
Remote latency will be 60-70 ns higher for all intel CPUs.
Linux allocates local memory for memlat by default.

Would local node memory access on a 2-socket be longer than one a 1-socket (due to cache coherency?)

not too much longer. Maybe up to 10ns longer in 2-socket system.
There are two main modes for coherency in 2-socket systems:
1) With snoop - while it snoops to fast remote cache, it reads from slow local ram. Remote cache latency is almost equal to local RAM latency. So no additional deley.
2) With directory bits - it doesn't read remote node.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Joe Chang - 2017-12-29

thanks, you work in this has been incredibly helpful. if you are in the US, give me a ping.
I have a Broadwell Xeon E5-2630 v4 (1-socket) 2.2GHz 10-core (presumably the LCC die) and a Haswell Xeon E3-1275 v3 3.43GHz handy, both Windows Server 2016.
With the l option, on the Broadwell 10c, I see L3 - 38 cycle, Mem 165 cycles, 75ns, I presume this is L3 + 58ns (165-38 = 127 /2.2 = 57.7).
Your 65 cycle L3 at 3.6GHz is 18ns, My 38 cycle at 2.2GHz is 17.3ns. I would have thought that the giant double ring + swtich of the Broadwell HCC die would have more penalty, but perhaps I am not interpreting something correctly?
You have 75ns for the E5-2699 v4 (2-socket) vs my 57ns for the E5-2630 v4.
When you said 75ns was local node, I thought maybe that the ECC decode contributed, but now I do not think so.
I am thinking were both on 2133 memory with 15 timing, per your guess, 10ns could be explained by the 2S versus 1S.
I would have guessed that the remaining difference might the be time to navigate the core interconnect ring, but the L3 latencies seem to indicate not?

On the E3-v3, L3 = 36 cycles (?), mem 275 cycles (275 - 36 = 239 / 3.43 = 69.7 ns)
this is much longer than the 57ns you posted for the i7-4770. I suppose some of this comes from Registered DIMMs is supposed to be 1 clock longer latency (each of the 3 elements?).
The Crucial website say both ECC UDIMM and RDIMM at 1600MT/s (800MHz clock) are 11, but the non-ECC Ballistix is as low as 8. So 8-8-8 = 24 on an 0.8GHz clock is 30ns, while 11-11-11 is 41.25ns.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

7-cpu benchmark on Skylake-X

Forums

Help

7-cpu benchmark on Skylake-X

7-cpu benchmark on Skylake-X

Forums

Help

7-cpu benchmark on Skylake-X document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

7-cpu benchmark on Skylake-X