got it
on the Broadwell 10c, same L3 - 38 cycle,
Mem in col 6 is 174 cycles (versus 165 cycles)
so 174 - 38 = 136 cycles, divide by 2.2GHz for L3 + 61.8 ns
this brings us closer to the L3 + 75ns in the 2-socket v4 2699
would you guess 10ns for the remote L3 cache, and 3 for the more complex ring?
The reason I am looking into this is that I am making the case for 1-socket systems for database transaction processing. Almost nobody architects the DB for memory locality on NUMA. This gives single socket an advantage in performance per core, more so if the mere presence of a second adds latency even to local node access, then the advantage is even bigger.
I will try to find a refurbished 2-socket Broadwell cheap, or bite the bullet at get a Skylake 2S
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, 75 ns in E5-2699 v4 is not so good result. I don't know exact reasons why it's slow. I don't know exact timings and snoop mode that server.
62 ns for 1-socket system is OK.
Also note that Windows probably can use 1 GB pages instead of 2 MB pages, if array is big enough.
Try 2 tests:
memlat p z22 l 512 > p.txt
memlat p z22 l 8g x30 > p_x30.txt
If there is difference after line 64m, then p.txt (2MB pages) includes also 8 clocks for TLB miss, and p_x30.txt (1GB pages) are results without TLB miss - it will be pure RAM latency.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
L3 + 75ns may suck, but its not far off your guess of 1S to 2S 10ns penalty, and my system is 10c (LCC die) versus the 22c HCC die, there must be extra latency on the bigger die?
3 years ago, I was doing work at a client, test system was HP blade, 2S Xeon E5 2680, 8c 2.7GHz. The infrastructure did BIOS/UEFI updates on 3 separate occassions. After each event, I would notice the average CPU of key SQL statements went up by about 3X. It turned out the update set the system to power save mode, operating at 135MHz.
Unfortunately, I assumed it was bad test, and did not archive the data. Only afterwards did I realized that a 3X performance difference over a 20X frequency (135MHz to 2700MHz) was actually something that could be explained with memory latency effect.
Almost every data center use 2S as the standard system, also consider that two 14c processors are less expensive than one 28c. But if there is a signifcant difference in the performance efficiency of core between 1 and 2S, then that would negate the lower HW cost.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Latency for your system is 38 cycles + 58 ns.
It's from 1 GB pages results (p10c_x30.txt).
Please write about RAM configuration and timings.
Also try memlat with turbo-boost enabled.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If memory latency is important for your tasks and your tasks use big amount of memory (more than 100 MB), then check that your progarms can use large pages (2 MB and 1 GB pages) instead of default 4 KB pages. Large pages really can improve performance in such cases.
Example for large pages in 7-zip:
7z b -md27
7z b -md27 -slp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
About difference for 1S / 2S.
RAM latency difference is up to 15-20%, if NUMA works OK.
But many programs don't depend from RAM latency. So no difference at all for such programs.
But if some task uses threads in both sockets to access same memory blocks, there will be big traffic between sockets and losses for performance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So your E5-2630 v4, 1S result is:
RAM Latency = 38 cycles + 58 ns (2.2 GHz)
RAM Latency = 46 cycles + 58 ns (3.1 GHz)
Your RAM is 14 ns (CL15 / 2166 * 2). Reading from RAM requires two operations, each of 14 ns. So 28 ns is minimal possible latency for 2166 CL15.
58 ns - 28 ns = 30 ns - it's overhead of RING and memory controller delays.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The Broadwell LCC die has 1 memory controller connected to all 4 channels, unlike the MCC and HCC dies, with 2 controllers, 2 channels each. So the LCC reads 32-bytes per data tranfer, and the cache line fills on the second transfer, which occurs 0.46ns (1/2166) after the first data.
My understanding of DDR DRAM is memory latency is RAS + CL + extra depending on how much of the 8 long burst is necessary to fill a cache line?
However, for sustained random access, not rehits to the same column, it ends up being RP-RAS+CL + extra.
In SQL Server or any other page-row database, there is a read to the page header, the row offsets, then the row, which might span more than a cache line, in which case variable length columns might an extra access, these might just incur the extra CL hit
This is at the DRAM interface. There is transmission delay to the memory controller, then whatever the ring overhead is (which I wish Intel would explain, other than what they gave to say they are faster than EPYC).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1) broadwell and most another modern cpus read full cache line (64 bytes) from same channel. Full read is 8/2166 = 3.7 ns.
Maybe it reads 2 lines (128 bytes) from same channel in broadwell.
2) smart memory contoller can close RAM row after each access, if it sees that most accesses are random. So it takes only TRCD + CL for next random access to another bank (that will have row closed, if policy is closed-page).
Last edit: Igor Pavlov 2018-01-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I just purchased a Core i7-10700K (Comet Lake) with Asus ROG Maximus XII Formula motherboard and GSKILL F4-4000C17-16GTRS. Unfortunately, the system does not boot with both memory channels populated, so I did initial testing with only a single DIMM. The default DIMM setting is 2133, 15-15-15. I got different results with the 1 DIMM in B1 vs. B2. Also tested at 2933 MT/s 18-18-18 (let me verify this). For baseline - Xeon E3-1245v5 (Skylake) using unbuffered ECC, 2133, 15-15-15.
The results for fixed base frequency are interesting for low level numbers.
But also test with default frequency (turbo) to see real maximum speed. rite also about ring frequency of benchmark. It affects L3 Cache speed, that was about 50 cylces in your tests.
You can check ram frequency with CPU-Z, if you see some difference between results for DIMMs in different slots.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
my mistake on the DIMM slot, this is my first gaming system in which the memory settings are adjustable and I did not understand the settings.attached are two runs with memory at 2133 MT/s, 15-15-15.1 locked to 3800, the other unlimited (but BIOS is set to normal, not aggressive overclocking)
also, I am using a Noctua heat pipe for cooling instead of liquid cooling, not sure if this impacts OC
this is an AMD slide showing 67ns raw memory latency for DDR4-3773https://www.reddit.com/r/Amd/comments/bz5b7x/ryzen_3000_ddr_memory_latency_comparison/
vs. https://7-cpu.com/cpu/Zen2.htmlAMD 3800X (Zen2), 7 nm. RAM: 32 GB, RAM DDR4-3200 16-18-18-38-56-1T (dual channel)
RAM Latency = 38 cycles + 66 ns
Later, I might look into this
On Sunday, August 16, 2020, 09:01:33 AM EDT, Igor Pavlov ipavlov@users.sourceforge.net wrote:
The results for fixed base frequency are interesting for low level numbers.
But also test with default frequency (turbo) to see real maximum speed. rite also about ring frequency of benchmark. It affects L3 Cache speed, that was about 50 cylces in your tests.
You can check ram frequency with CPU-Z, if you see some difference between results for DIMMs in different slots.
I think I may have learned to set memory latencies in ASUS ROG UEFI. Here are tests for 1 channel, 2933MT/s with tCL / tRCD 21, 19, 17, 15 and 14 corresponding to 14.32, 12.956, 11.592, 10.228 and 9.547 ns. I set tRAS at tCL+tRCD + 3 clocks mostly (please advise if it should be 4?). A few tests are at the processor base frequency of 3.8 GHz, all timings were done at turbo-boost typically 4.8GHz, which lower latency by about 2ns. The memory latences for turbo-boost mode are : 63.25, 61.27, 57.76, 55.25 and 53.1 ns respectively.
the earlier attachments were at 2133MT/s tCL 15, for 14.065ns with memory latency at 67.73ns. The memory I used, GSKILL was rated for 4000, 17CL (8.5ns), but my system did not boot. Will try 2933 tCL = 13 (8.865ns) later. and run the same sequence at 3200MT/s next weekend
and look there numbers for 1 GB block lines. It must use 1 GB pages.
The numbers before 1 GB in large_pages are 2 MB pages with additional TLB miss penalty.
Last edit: Igor Pavlov 2020-08-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Question: say memlat64 p l 2048 show memory latency at 58ns, while p 2048 shows 98ns. The lock/large page 58ns seems reasonable based on what we know about DRAM latency, L3, with the difference presumably between the transmission and the memory controller. But the latency of 98ns seems unreasonable. For locked page, the memory must be allocated at startup? but for conventional memory, allocation is at first use? so does memlat first force allocate the memory before the measurement, or does the measurement include memory allocation time?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is TLB miss for 4 KB pages.
It must translate virtual address to physical address, and read page table for that. That page table can be in L3 cache or in RAM.
Look "4 KB pages mode, Windows 10 here": https://www.7-cpu.com/cpu/Skylake_X.html
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
here is the large_pages (8192M z22) and a rerun of memlat64 p z22 l from yesterday. (In between today and the above tests, I had reverted the system to non-XMP memory : 2133MTs/ CL-15)
the latency is about 47 ns.
in memlat you also must use 1g. 1g uses 1 GB pages
512m test uses 2 MB pages with about 2 ns penalty (9 cycles). So 512m result can be 2 ns larger than fastest result with 1g:
for latency test, you must call:
and look numbers in column "6" that is related to 64-bytes cache line.
got it
on the Broadwell 10c, same L3 - 38 cycle,
Mem in col 6 is 174 cycles (versus 165 cycles)
so 174 - 38 = 136 cycles, divide by 2.2GHz for L3 + 61.8 ns
this brings us closer to the L3 + 75ns in the 2-socket v4 2699
would you guess 10ns for the remote L3 cache, and 3 for the more complex ring?
The reason I am looking into this is that I am making the case for 1-socket systems for database transaction processing. Almost nobody architects the DB for memory locality on NUMA. This gives single socket an advantage in performance per core, more so if the mere presence of a second adds latency even to local node access, then the advantage is even bigger.
I will try to find a refurbished 2-socket Broadwell cheap, or bite the bullet at get a Skylake 2S
Yes, 75 ns in E5-2699 v4 is not so good result. I don't know exact reasons why it's slow. I don't know exact timings and snoop mode that server.
62 ns for 1-socket system is OK.
Also note that Windows probably can use 1 GB pages instead of 2 MB pages, if array is big enough.
Try 2 tests:
If there is difference after line
64m
, then p.txt (2MB pages) includes also 8 clocks for TLB miss, and p_x30.txt (1GB pages) are results without TLB miss - it will be pure RAM latency.L3 + 75ns may suck, but its not far off your guess of 1S to 2S 10ns penalty, and my system is 10c (LCC die) versus the 22c HCC die, there must be extra latency on the bigger die?
3 years ago, I was doing work at a client, test system was HP blade, 2S Xeon E5 2680, 8c 2.7GHz. The infrastructure did BIOS/UEFI updates on 3 separate occassions. After each event, I would notice the average CPU of key SQL statements went up by about 3X. It turned out the update set the system to power save mode, operating at 135MHz.
Unfortunately, I assumed it was bad test, and did not archive the data. Only afterwards did I realized that a 3X performance difference over a 20X frequency (135MHz to 2700MHz) was actually something that could be explained with memory latency effect.
Almost every data center use 2S as the standard system, also consider that two 14c processors are less expensive than one 28c. But if there is a signifcant difference in the performance efficiency of core between 1 and 2S, then that would negate the lower HW cost.
Latency for your system is 38 cycles + 58 ns.
It's from 1 GB pages results (p10c_x30.txt).
Please write about RAM configuration and timings.
Also try memlat with turbo-boost enabled.
If memory latency is important for your tasks and your tasks use big amount of memory (more than 100 MB), then check that your progarms can use large pages (2 MB and 1 GB pages) instead of default 4 KB pages. Large pages really can improve performance in such cases.
Example for large pages in 7-zip:
About difference for 1S / 2S.
RAM latency difference is up to 15-20%, if NUMA works OK.
But many programs don't depend from RAM latency. So no difference at all for such programs.
But if some task uses threads in both sockets to access same memory blocks, there will be big traffic between sockets and losses for performance.
memory is 4 x 16GB, Crucial, DDR4 2166, CL15
So your E5-2630 v4, 1S result is:
RAM Latency = 38 cycles + 58 ns (2.2 GHz)
RAM Latency = 46 cycles + 58 ns (3.1 GHz)
Your RAM is 14 ns (CL15 / 2166 * 2). Reading from RAM requires two operations, each of 14 ns. So 28 ns is minimal possible latency for 2166 CL15.
58 ns - 28 ns = 30 ns - it's overhead of RING and memory controller delays.
The Broadwell LCC die has 1 memory controller connected to all 4 channels, unlike the MCC and HCC dies, with 2 controllers, 2 channels each. So the LCC reads 32-bytes per data tranfer, and the cache line fills on the second transfer, which occurs 0.46ns (1/2166) after the first data.
My understanding of DDR DRAM is memory latency is RAS + CL + extra depending on how much of the 8 long burst is necessary to fill a cache line?
However, for sustained random access, not rehits to the same column, it ends up being RP-RAS+CL + extra.
In SQL Server or any other page-row database, there is a read to the page header, the row offsets, then the row, which might span more than a cache line, in which case variable length columns might an extra access, these might just incur the extra CL hit
This is at the DRAM interface. There is transmission delay to the memory controller, then whatever the ring overhead is (which I wish Intel would explain, other than what they gave to say they are faster than EPYC).
1) broadwell and most another modern cpus read full cache line (64 bytes) from same channel. Full read is 8/2166 = 3.7 ns.
Maybe it reads 2 lines (128 bytes) from same channel in broadwell.
2) smart memory contoller can close RAM row after each access, if it sees that most accesses are random. So it takes only TRCD + CL for next random access to another bank (that will have row closed, if policy is closed-page).
Last edit: Igor Pavlov 2018-01-02
I just purchased a Core i7-10700K (Comet Lake) with Asus ROG Maximus XII Formula motherboard and GSKILL F4-4000C17-16GTRS. Unfortunately, the system does not boot with both memory channels populated, so I did initial testing with only a single DIMM. The default DIMM setting is 2133, 15-15-15. I got different results with the 1 DIMM in B1 vs. B2. Also tested at 2933 MT/s 18-18-18 (let me verify this). For baseline - Xeon E3-1245v5 (Skylake) using unbuffered ECC, 2133, 15-15-15.
The results for fixed base frequency are interesting for low level numbers.
But also test with default frequency (turbo) to see real maximum speed. rite also about ring frequency of benchmark. It affects L3 Cache speed, that was about 50 cylces in your tests.
You can check ram frequency with CPU-Z, if you see some difference between results for DIMMs in different slots.
my mistake on the DIMM slot, this is my first gaming system in which the memory settings are adjustable and I did not understand the settings.attached are two runs with memory at 2133 MT/s, 15-15-15.1 locked to 3800, the other unlimited (but BIOS is set to normal, not aggressive overclocking)
also, I am using a Noctua heat pipe for cooling instead of liquid cooling, not sure if this impacts OC
this is an AMD slide showing 67ns raw memory latency for DDR4-3773https://www.reddit.com/r/Amd/comments/bz5b7x/ryzen_3000_ddr_memory_latency_comparison/
vs. https://7-cpu.com/cpu/Zen2.htmlAMD 3800X (Zen2), 7 nm. RAM: 32 GB, RAM DDR4-3200 16-18-18-38-56-1T (dual channel)
RAM Latency = 38 cycles + 66 ns
Later, I might look into this
On Sunday, August 16, 2020, 09:01:33 AM EDT, Igor Pavlov ipavlov@users.sourceforge.net wrote:
The results for fixed base frequency are interesting for low level numbers.
But also test with default frequency (turbo) to see real maximum speed. rite also about ring frequency of benchmark. It affects L3 Cache speed, that was about 50 cylces in your tests.
You can check ram frequency with CPU-Z, if you see some difference between results for DIMMs in different slots.
7-cpu benchmark on Skylake-X
Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/sevenmax/discussion/399008/
To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/
I think I may have learned to set memory latencies in ASUS ROG UEFI. Here are tests for 1 channel, 2933MT/s with tCL / tRCD 21, 19, 17, 15 and 14 corresponding to 14.32, 12.956, 11.592, 10.228 and 9.547 ns. I set tRAS at tCL+tRCD + 3 clocks mostly (please advise if it should be 4?). A few tests are at the processor base frequency of 3.8 GHz, all timings were done at turbo-boost typically 4.8GHz, which lower latency by about 2ns. The memory latences for turbo-boost mode are : 63.25, 61.27, 57.76, 55.25 and 53.1 ns respectively.
the earlier attachments were at 2133MT/s tCL 15, for 14.065ns with memory latency at 67.73ns. The memory I used, GSKILL was rated for 4000, 17CL (8.5ns), but my system did not boot. Will try 2933 tCL = 13 (8.865ns) later. and run the same sequence at 3200MT/s next weekend
If you need only RAM latency, you can use also:
and look there numbers for 1 GB block lines. It must use 1 GB pages.
The numbers before 1 GB in large_pages are 2 MB pages with additional TLB miss penalty.
Last edit: Igor Pavlov 2020-08-17
Question: say memlat64 p l 2048 show memory latency at 58ns, while p 2048 shows 98ns. The lock/large page 58ns seems reasonable based on what we know about DRAM latency, L3, with the difference presumably between the transmission and the memory controller. But the latency of 98ns seems unreasonable. For locked page, the memory must be allocated at startup? but for conventional memory, allocation is at first use? so does memlat first force allocate the memory before the measurement, or does the measurement include memory allocation time?
There is TLB miss for 4 KB pages.
It must translate virtual address to physical address, and read page table for that. That page table can be in L3 cache or in RAM.
Look "4 KB pages mode, Windows 10 here":
https://www.7-cpu.com/cpu/Skylake_X.html
Intel Core i7-10700K (8c, 3.8base
MSI MEG Z490 ACE
G.SKILL Trident Z Royal F4-4000C17D-16GTRS
DDR4 4000 (PC4 32000)
Timing 17-17-17-37
Memory latency = 47.73 ns
(w/o z22: 44.96 ns)
for more accurate RAM latency values look results from this program:
https://www.7-zip.org/a/large_pages.exe
at 1 GB line.
here is the large_pages (8192M z22) and a rerun of memlat64 p z22 l from yesterday. (In between today and the above tests, I had reverted the system to non-XMP memory : 2133MTs/ CL-15)
the latency is about 47 ns.
in memlat you also must use
1g
.1g
uses1 GB
pages512m test uses 2 MB pages with about 2 ns penalty (9 cycles). So 512m result can be 2 ns larger than fastest result with 1g:
you must look second column in
Latency-64
, that is46.82
ns.Last edit: Igor Pavlov 2020-08-31
p z22 l 1g