Menu

7-cpu benchmark on Skylake-X

2017-10-19
2020-08-31
<< < 1 2 (Page 2 of 2)
  • Igor Pavlov

    Igor Pavlov - 2017-12-29

    for latency test, you must call:

    memlat p z22 l 512 > p.txt
    

    and look numbers in column "6" that is related to 64-bytes cache line.

     
  • Joe Chang

    Joe Chang - 2017-12-30

    got it
    on the Broadwell 10c, same L3 - 38 cycle,
    Mem in col 6 is 174 cycles (versus 165 cycles)
    so 174 - 38 = 136 cycles, divide by 2.2GHz for L3 + 61.8 ns
    this brings us closer to the L3 + 75ns in the 2-socket v4 2699
    would you guess 10ns for the remote L3 cache, and 3 for the more complex ring?
    The reason I am looking into this is that I am making the case for 1-socket systems for database transaction processing. Almost nobody architects the DB for memory locality on NUMA. This gives single socket an advantage in performance per core, more so if the mere presence of a second adds latency even to local node access, then the advantage is even bigger.
    I will try to find a refurbished 2-socket Broadwell cheap, or bite the bullet at get a Skylake 2S

     
  • Igor Pavlov

    Igor Pavlov - 2017-12-30

    Yes, 75 ns in E5-2699 v4 is not so good result. I don't know exact reasons why it's slow. I don't know exact timings and snoop mode that server.
    62 ns for 1-socket system is OK.
    Also note that Windows probably can use 1 GB pages instead of 2 MB pages, if array is big enough.
    Try 2 tests:

    memlat p z22 l 512 > p.txt
    memlat p z22 l 8g x30 > p_x30.txt
    

    If there is difference after line 64m, then p.txt (2MB pages) includes also 8 clocks for TLB miss, and p_x30.txt (1GB pages) are results without TLB miss - it will be pure RAM latency.

     
  • Joe Chang

    Joe Chang - 2017-12-30

    L3 + 75ns may suck, but its not far off your guess of 1S to 2S 10ns penalty, and my system is 10c (LCC die) versus the 22c HCC die, there must be extra latency on the bigger die?

     
  • Joe Chang

    Joe Chang - 2017-12-30

    3 years ago, I was doing work at a client, test system was HP blade, 2S Xeon E5 2680, 8c 2.7GHz. The infrastructure did BIOS/UEFI updates on 3 separate occassions. After each event, I would notice the average CPU of key SQL statements went up by about 3X. It turned out the update set the system to power save mode, operating at 135MHz.
    Unfortunately, I assumed it was bad test, and did not archive the data. Only afterwards did I realized that a 3X performance difference over a 20X frequency (135MHz to 2700MHz) was actually something that could be explained with memory latency effect.
    Almost every data center use 2S as the standard system, also consider that two 14c processors are less expensive than one 28c. But if there is a signifcant difference in the performance efficiency of core between 1 and 2S, then that would negate the lower HW cost.

     
  • Igor Pavlov

    Igor Pavlov - 2017-12-31

    Latency for your system is 38 cycles + 58 ns.
    It's from 1 GB pages results (p10c_x30.txt).
    Please write about RAM configuration and timings.
    Also try memlat with turbo-boost enabled.

     
  • Igor Pavlov

    Igor Pavlov - 2017-12-31

    If memory latency is important for your tasks and your tasks use big amount of memory (more than 100 MB), then check that your progarms can use large pages (2 MB and 1 GB pages) instead of default 4 KB pages. Large pages really can improve performance in such cases.
    Example for large pages in 7-zip:

    7z b -md27
    7z b -md27 -slp
    
     
  • Igor Pavlov

    Igor Pavlov - 2017-12-31

    About difference for 1S / 2S.
    RAM latency difference is up to 15-20%, if NUMA works OK.
    But many programs don't depend from RAM latency. So no difference at all for such programs.
    But if some task uses threads in both sockets to access same memory blocks, there will be big traffic between sockets and losses for performance.

     
  • Joe Chang

    Joe Chang - 2017-12-31

    memory is 4 x 16GB, Crucial, DDR4 2166, CL15

     
  • Igor Pavlov

    Igor Pavlov - 2018-01-01

    So your E5-2630 v4, 1S result is:
    RAM Latency = 38 cycles + 58 ns (2.2 GHz)
    RAM Latency = 46 cycles + 58 ns (3.1 GHz)

    Your RAM is 14 ns (CL15 / 2166 * 2). Reading from RAM requires two operations, each of 14 ns. So 28 ns is minimal possible latency for 2166 CL15.
    58 ns - 28 ns = 30 ns - it's overhead of RING and memory controller delays.

     
  • Joe Chang

    Joe Chang - 2018-01-01

    The Broadwell LCC die has 1 memory controller connected to all 4 channels, unlike the MCC and HCC dies, with 2 controllers, 2 channels each. So the LCC reads 32-bytes per data tranfer, and the cache line fills on the second transfer, which occurs 0.46ns (1/2166) after the first data.
    My understanding of DDR DRAM is memory latency is RAS + CL + extra depending on how much of the 8 long burst is necessary to fill a cache line?
    However, for sustained random access, not rehits to the same column, it ends up being RP-RAS+CL + extra.
    In SQL Server or any other page-row database, there is a read to the page header, the row offsets, then the row, which might span more than a cache line, in which case variable length columns might an extra access, these might just incur the extra CL hit
    This is at the DRAM interface. There is transmission delay to the memory controller, then whatever the ring overhead is (which I wish Intel would explain, other than what they gave to say they are faster than EPYC).

     
  • Igor Pavlov

    Igor Pavlov - 2018-01-02

    1) broadwell and most another modern cpus read full cache line (64 bytes) from same channel. Full read is 8/2166 = 3.7 ns.
    Maybe it reads 2 lines (128 bytes) from same channel in broadwell.
    2) smart memory contoller can close RAM row after each access, if it sees that most accesses are random. So it takes only TRCD + CL for next random access to another bank (that will have row closed, if policy is closed-page).

     

    Last edit: Igor Pavlov 2018-01-02
  • Joe Chang

    Joe Chang - 2020-08-15

    I just purchased a Core i7-10700K (Comet Lake) with Asus ROG Maximus XII Formula motherboard and GSKILL F4-4000C17-16GTRS. Unfortunately, the system does not boot with both memory channels populated, so I did initial testing with only a single DIMM. The default DIMM setting is 2133, 15-15-15. I got different results with the 1 DIMM in B1 vs. B2. Also tested at 2933 MT/s 18-18-18 (let me verify this). For baseline - Xeon E3-1245v5 (Skylake) using unbuffered ECC, 2133, 15-15-15.

     
    • Igor Pavlov

      Igor Pavlov - 2020-08-16

      The results for fixed base frequency are interesting for low level numbers.
      But also test with default frequency (turbo) to see real maximum speed. rite also about ring frequency of benchmark. It affects L3 Cache speed, that was about 50 cylces in your tests.

      You can check ram frequency with CPU-Z, if you see some difference between results for DIMMs in different slots.

       
      • Joe Chang

        Joe Chang - 2020-08-16

        my mistake on the DIMM slot, this is my first gaming system in which the memory settings are adjustable and I did not understand the settings.attached are two runs with memory at 2133 MT/s, 15-15-15.1 locked to 3800, the other unlimited (but BIOS is set to normal, not aggressive overclocking)
        also, I am using a Noctua heat pipe for cooling instead of liquid cooling, not sure if this impacts OC
        this is an AMD slide showing 67ns raw memory latency for DDR4-3773https://www.reddit.com/r/Amd/comments/bz5b7x/ryzen_3000_ddr_memory_latency_comparison/
        vs. https://7-cpu.com/cpu/Zen2.htmlAMD 3800X (Zen2), 7 nm. RAM: 32 GB, RAM DDR4-3200 16-18-18-38-56-1T (dual channel)
        RAM Latency = 38 cycles + 66 ns

        Later, I might look into this
        On Sunday, August 16, 2020, 09:01:33 AM EDT, Igor Pavlov ipavlov@users.sourceforge.net wrote:

        The results for fixed base frequency are interesting for low level numbers.
        But also test with default frequency (turbo) to see real maximum speed. rite also about ring frequency of benchmark. It affects L3 Cache speed, that was about 50 cylces in your tests.

        You can check ram frequency with CPU-Z, if you see some difference between results for DIMMs in different slots.

        7-cpu benchmark on Skylake-X

        Sent from sourceforge.net because you indicated interest in https://sourceforge.net/p/sevenmax/discussion/399008/

        To unsubscribe from further messages, please visit https://sourceforge.net/auth/subscriptions/

         
  • Joe Chang

    Joe Chang - 2020-08-17

    I think I may have learned to set memory latencies in ASUS ROG UEFI. Here are tests for 1 channel, 2933MT/s with tCL / tRCD 21, 19, 17, 15 and 14 corresponding to 14.32, 12.956, 11.592, 10.228 and 9.547 ns. I set tRAS at tCL+tRCD + 3 clocks mostly (please advise if it should be 4?). A few tests are at the processor base frequency of 3.8 GHz, all timings were done at turbo-boost typically 4.8GHz, which lower latency by about 2ns. The memory latences for turbo-boost mode are : 63.25, 61.27, 57.76, 55.25 and 53.1 ns respectively.
    the earlier attachments were at 2133MT/s tCL 15, for 14.065ns with memory latency at 67.73ns. The memory I used, GSKILL was rated for 4000, 17CL (8.5ns), but my system did not boot. Will try 2933 tCL = 13 (8.865ns) later. and run the same sequence at 3200MT/s next weekend

     
  • Igor Pavlov

    Igor Pavlov - 2020-08-17

    If you need only RAM latency, you can use also:

    https://www.7-zip.org/a/large_pages.exe
    

    and look there numbers for 1 GB block lines. It must use 1 GB pages.
    The numbers before 1 GB in large_pages are 2 MB pages with additional TLB miss penalty.

     

    Last edit: Igor Pavlov 2020-08-17
  • Joe Chang

    Joe Chang - 2020-08-29

    Question: say memlat64 p l 2048 show memory latency at 58ns, while p 2048 shows 98ns. The lock/large page 58ns seems reasonable based on what we know about DRAM latency, L3, with the difference presumably between the transmission and the memory controller. But the latency of 98ns seems unreasonable. For locked page, the memory must be allocated at startup? but for conventional memory, allocation is at first use? so does memlat first force allocate the memory before the measurement, or does the measurement include memory allocation time?

     
  • Igor Pavlov

    Igor Pavlov - 2020-08-29

    There is TLB miss for 4 KB pages.
    It must translate virtual address to physical address, and read page table for that. That page table can be in L3 cache or in RAM.
    Look "4 KB pages mode, Windows 10 here":
    https://www.7-cpu.com/cpu/Skylake_X.html

     
  • Joe Chang

    Joe Chang - 2020-08-30

    Intel Core i7-10700K (8c, 3.8base
    MSI MEG Z490 ACE
    G.SKILL Trident Z Royal F4-4000C17D-16GTRS
    DDR4 4000 (PC4 32000)
    Timing 17-17-17-37
    Memory latency = 47.73 ns
    (w/o z22: 44.96 ns)

     
  • Igor Pavlov

    Igor Pavlov - 2020-08-31

    for more accurate RAM latency values look results from this program:
    https://www.7-zip.org/a/large_pages.exe
    at 1 GB line.

     
  • Joe Chang

    Joe Chang - 2020-08-31

    here is the large_pages (8192M z22) and a rerun of memlat64 p z22 l from yesterday. (In between today and the above tests, I had reverted the system to non-XMP memory : 2133MTs/ CL-15)

     
  • Igor Pavlov

    Igor Pavlov - 2020-08-31

    the latency is about 47 ns.
    in memlat you also must use 1g.
    1g uses 1 GB pages
    512m test uses 2 MB pages with about 2 ns penalty (9 cycles). So 512m result can be 2 ns larger than fastest result with 1g:

     512m 00C00000     1    32 138  50  15.11   48.55  48.63   48.52  48.83 
    1024m 80000000    18    47 115  25  13.72   46.97  46.82   46.51  47.47 
    

    you must look second column in Latency-64, that is 46.82 ns.

     

    Last edit: Igor Pavlov 2020-08-31
  • Joe Chang

    Joe Chang - 2020-08-31

    p z22 l 1g

     
<< < 1 2 (Page 2 of 2)

Log in to post a comment.