Menu

#3663 uCsim slow on 64-bit big-endian system

closed-wont-fix
None
other
5
2024-04-08
2023-10-03
No

At least for some inputs, uCsim is surprisingly slow on my Power 9 system. For most tasks (compiling SDCC; regression testing) that system is by far the fastest one I have, but there are a few oddities. The following are all times from systems with otherwise practically no load running Debian GNU/Linux:

Command used: time ../../sim/ucsim/mos6502.src/ucsim_mos6502 gen/uc6502/tst_gcc-torture-execute-arith-rand-ll.ihx < ./ports/uc6502/uCsim.cmd

My laptop (Ryzen 4800H):
Runtime: 3.145461 sec
real 0m3,150s
user 0m3,138s
sys 0m0,012s

My Raspi 4:
Runtime: 12.779123 sec
real 0m12,787s
user 0m12,737s
sys 0m0,016s

nemesis (dual 22-core SMT4 Power 9):
Runtime: 30.998519 sec
real 0m31,003s
user 0m30,993s
sys 0m0,013s

Apparently uCsim is so slow here, that this test times out during normal regression testing. I'll try to look into details (where in uCsim do we spent the time) later. But I also noticed a general trend: On most of my systems, during regression testing far more time is spent in SDCC than in uCsim. But for power64 it is more balanced. Using the default timeouts, usually two uc6502 tests fail due to timeouts, and a few z80-related ones are on the edge of failing (they usually pass as long as there is not too much load on the system - make -j 80 is still fine, make -j 120 tends to fail).

Now, that Power 9 system might not have the WOF tables configured correctly, so the CPUs might be running at a lower TDP, and Power 9 is a somewhat older architecture. Being a bit slower than the Ryzen 4800H for some single-threaded workloads is no surprise. But it definitely shouldn't be slower than the Raspi.

Discussion

  • Daniel Drotos

    Daniel Drotos - 2023-10-03

    Dear Philipp!
    Can you attach out file of that mentioned test?
    Daniel

     
  • Philipp Klaus Krause

    I have verified that the .ihx files are identical on the Raspi 4 and the Power9 machine.

    P.S.: The number of ticks simulated is the same, too:

    nemesis (power64):

    Simulation started, PC=0x000301
    --- Running: gcc-torture-execute-arith-rand-ll.c
    Running testTortureExecute
    --- Summary: 0/0/1: 0 failed of 0 tests in 1 cases.
    
    Stop at 0x00059e: (110) Program stopped itself
    F 0x00059e
    Simulated 51017861 ticks (5.102e+01 sec)
    Host usage: 30.975409 sec, rate=1.647044
    CPU state= OK PC= 0x00059e frequency= 1000000 HZ
    Operation since last reset= 49997714 vclks
    Inst= 17243919 Fetch= 34088336 Read= 10812268 Write= 5097110
    Total time since last reset= 51.017867982543287 sec (51017868 clks)
    Time in isr = 0.000000000000000 sec (0 clks) 0.00%
    Time in idle= 0.000000000000000 sec (0 clks) 0.00%
    Most value of stack pointer= 0x000000
    Simulation: stopped
    Runtime: 30.998519 sec
    

    raspi-rebstock (aarch64):

    Simulation started, PC=0x000301
    --- Running: gcc-torture-execute-arith-rand-ll.c
    Running testTortureExecute
    --- Summary: 0/0/1: 0 failed of 0 tests in 1 cases.
    
    Stop at 0x00059e: (110) Program stopped itself
    F 0x00059e
    Simulated 51017861 ticks (5.102e+01 sec)
    Host usage: 12.739029 sec, rate=4.004847
    CPU state= OK PC= 0x00059e frequency= 1000000 HZ
    Operation since last reset= 49997714 vclks
    Inst= 17243919 Fetch= 34088336 Read= 10812268 Write= 5097110
    Total time since last reset= 51.017867982543287 sec (51017868 clks)
    Time in isr = 0.000000000000000 sec (0 clks) 0.00%
    Time in idle= 0.000000000000000 sec (0 clks) 0.00%
    Most value of stack pointer= 0x000000
    Simulation: stopped
    Runtime: 12.779123 sec
    
     

    Last edit: Philipp Klaus Krause 2023-10-03
  • Philipp Klaus Krause

    This is the gprof output on powerpc64. I also tried valgrind, but it crashed.

    P.S.: And for comparison also the gprof output on the Raspi 4. For some reason, when compiled with -pg, both nemesis and the raspi take about 112s (according to time), which is much longer than what I measured without -pg.

     

    Last edit: Philipp Klaus Krause 2023-10-03
  • Daniel Drotos

    Daniel Drotos - 2023-10-04

    Dear Philipp,

    Would you please check effect of cperiod value on runtime? Write:

    cperiod=100

    into uCsim.cmd file (before run) and try some other values as well, such as 10000, 100000, 500000.

    Daniel

     
    • Philipp Klaus Krause

      It doesn't make much of a difference. uCsim is apparently slightly faster for higher values of cpointer (31.1s at cpointer=100000 vs 31.3s at cpointer=100).

       
  • Daniel Drotos

    Daniel Drotos - 2023-10-04

    Dear Philipp,

    Can you tgz and send me full content of results/uc6502 directory please?

    Daniel

     
    • Philipp Klaus Krause

      Here they are. If it helps, I could also give you an account on the machine.

       
      • Daniel Drotos

        Daniel Drotos - 2023-10-07

        Yes, I think an account would be usefull, so I could make more tests.

         
  • Daniel Drotos

    Daniel Drotos - 2023-11-09

    I wrote a simple CPU speed measurement (1 thread, no IO) and checked several machines I can access. Nemezis was surprisingly slow. I have no idea how uCsim could run faster on it.

     
    • Philipp Klaus Krause

      Did you try your "simple CPU speed measurement" on a Raspi 3 or 4? If yes, how does it perform there vs. nemesis?

       
      • Daniel Drotos

        Daniel Drotos - 2023-12-16

        This is the result of my tests:

         MFlop   kips       host   Model
            38    236     szoba2   ARM1176
            46    261        ait   sparcv9 750MHz
            74    755     szoba3   Cortex-A53
           127    784    nemesis   POWER9, altivec supported
           239   2362        mms   Intel(R) Celeron(R) CPU J3455 @ 1.50GHz
           233   3580    mazsola   Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
           293   4927         v3   Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
           381   5151  oldrender   Intel(R) Xeon(R) CPU E31240 @ 3.30GHz
           338   5416    avokado   Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz
           355   5460     render   Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
           385   5464     garazs   Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz
           394   6425         p4   Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
           430   7018       dori   Intel(R) Core(TM) i5-8400 CPU @ 2.80GHz
           453   8545       vdeb   12th Gen Intel(R) Core(TM) i7-12700H
           385   9270     balint   AMD Ryzen 7 5700X 8-Core Processor
           440  15674        mac   Intel(R) Xeon(R) W-3223 CPU @ 3.50GHz
        

        szoba2 is an rpi2 and szoba3 is an rpi3.

         

        Last edit: Daniel Drotos 2023-12-16
        • Daniel Drotos

          Daniel Drotos - 2023-12-16

          MFlop is measured with floating point operations but it is not really relevant to uCsim. kips column means kilo-instrution-per-second and it is measured with a cycle that is similar to uCsim instruction simulation.

           
  • Daniel Drotos

    Daniel Drotos - 2024-04-08
    • status: open --> closed-wont-fix
    • assigned_to: Daniel Drotos
     

Log in to post a comment.

MongoDB Logo MongoDB