#126 RFC on switching timer to MONOTONIC_COARSE

Unstable_(example)
closed
None
5
2015-01-07
2014-05-11
Andrew Bird
No

Hi all,
I've run some benchmarks that show a speedup if the timer source is switched to MONOTONIC_COARSE, in my tests at least 2x improvement. This seems to be especially important as all modern CPUs are multicore and hence treated as SMP which disables the Pentium timer.

The following two test runs were performed with a simple test program           
compiled both for native Linux and 16 Bit DOS inside Dosemu. A value            
of 1x indicates native speed, 2x is half as fast.                               

Benchmark before applying this patch                                            
======================================================================          
CPUEMU                                                                          
       off : FAIL target <= 5.0x, result = 9.9x                                 
      vm86 : FAIL target <= 8.0x, result = 10.8x                                
      full : FAIL target <= 8.0x, result = 10.6x                                
   vm86sim : FAIL target <= 10.0x, result = 14.9x                               
   fullsim : FAIL target <= 10.0x, result = 14.9x                               
----------------------------------------------------------------------          
Ran 1 test in 879.980s                                                          

Benchmark after applying this patch                                             
======================================================================          
CPUEMU                                                                          
       off :  OK  target <= 5.0x, result = 3.8x                                 
      vm86 :  OK  target <= 8.0x, result = 5.6x                                 
      full :  OK  target <= 8.0x, result = 5.4x                                 
   vm86sim :  OK  target <= 10.0x, result = 9.9x                                
   fullsim :  OK  target <= 10.0x, result = 9.8x                                

----------------------------------------------------------------------          
Ran 1 test in 471.387s                                                          

Does anyone see any downside to switching to the COARSE timer?

2 Attachments

Discussion

1 2 3 > >> (Page 1 of 3)
  • Stas Sergeev

    Stas Sergeev - 2014-05-11

    I think it is a good idea in general, but
    the fact that it helps, probably means that
    dosemu is calling gettime too often.
    Would it be possible, before applying this
    patch, to find out why it calls it so often
    that the difference is noticeable?
    I mean, if we are fine with the coarse timing,
    then there is likely no need to request it
    so frequently. Calling gettime very often just for
    coarse timing doesn't sound very logical to me.

     
  • Andrew Bird

    Andrew Bird - 2014-05-12

    Hi Stas,
    I've attached some snapshots of output from kcachegrind whilst running the test. I don't really understand the use of the timer to say whether there's anything wrong. But, as you say. it's being called a lot. 50 million times in my test, which at full speed takes about 50 secs or about once every microsecond. Regarding the switch to the coarse timer, I understand that it can return the value at the last kernel tick rather than calculating a better value. I think since HZ is 250 on modern kernels, that means it can be up to 4ms slow if it's called just before the next tick.

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-12

    How do you attach multiple files at once?

     
    • Andrew Bird

      Andrew Bird - 2014-05-12

      On 12/05/14 17:27, Stas Sergeev wrote:

      How do you attach multiple files at once?

      Browse but do multiple select files using Ctrl or Shift, it's
      non-intuitive and I found out by mistake :-)

       
  • Stas Sergeev

    Stas Sergeev - 2014-05-12

    I also wonder if we can use coarse time for RTC.

     
  • Andrew Bird

    Andrew Bird - 2014-05-12

    Hi Stas,
    Things are looking better. Here are the test results and the new call graph: The clock_gettime() is now only called 12.6 million times on the test run.

    ########## Timer standard MONOTONIC                                             
    ======================================================================          
    FAIL: TestTermTestWcc                                                           
    ----------------------------------------------------------------------          
    CPUEMU                                                                          
           off : FAIL target <= 5.0x, result = 9.9x                                 
          vm86 : FAIL target <= 8.0x, result = 10.8x                                
          full : FAIL target <= 8.0x, result = 10.6x                                
       vm86sim : FAIL target <= 10.0x, result = 14.9x                               
       fullsim : FAIL target <= 10.0x, result = 14.9x                               
    
    ########## Stas timer patches, standard MONOTONIC clock                         
    ======================================================================          
    PASS: TestTermTestWcc                                                           
    ----------------------------------------------------------------------          
    CPUEMU                                                                          
           off :  OK  target <= 5.0x, result = 2.9x                                 
          vm86 :  OK  target <= 8.0x, result = 3.9x                                 
          full :  OK  target <= 8.0x, result = 3.8x                                 
       vm86sim :  OK  target <= 10.0x, result = 7.8x                                
       fullsim :  OK  target <= 10.0x, result = 7.8x                                
    
    ########## Stas timer patches, plus MONOTONIC_COARSE clock patches              
    ======================================================================          
    PASS: TestTermTestWcc                                                           
    ----------------------------------------------------------------------          
    CPUEMU                                                                          
           off :  OK  target <= 5.0x, result = 1.9x                                 
          vm86 :  OK  target <= 8.0x, result = 3.0x                                 
          full :  OK  target <= 8.0x, result = 3.0x                                 
       vm86sim :  OK  target <= 10.0x, result = 7.0x                                
       fullsim :  OK  target <= 10.0x, result = 6.9x                                
    
    ----------------------------------------------------------------------          
    

    I saw this page which suggests the resolution of _COARSE variants is 1ms
    https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/sect-Realtime_Tuning_Guide-Timestamping-POSIX_Clocks.html

    Regarding RTC, is it possible to avoid polling and use setittimer()?

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-12

    Things are looking better. Here are the test results and the new call graph:

    Thanks. I'll improve the patch and will
    apply when I have some free time.

    Regarding RTC, is it possible to avoid polling and use setittimer()?

    Not sure how difficult it is.
    We already use SIGALRM for a few things,
    so we'll need to demultiplex it somehow.
    Also RTC can generate IRQs at rate 8192Hz,
    which gives us a 122mkS period. I wonder
    if sending SIGALRM with such a freq will
    even work.
    So while it is likely possible in theory,
    the investigations on the practical implementation
    are fully up to you. :) It is not unlikely
    that with some short-cuts the practical
    implementation will work very well, it just
    needs a bit of thinking.

     
  • Andrew Bird

    Andrew Bird - 2014-05-12

    Hi Stas,
    It seems the code is already there to call rtc_run() on a periodic signal basis. According to the comments it's done at 100Hz (10ms). So I tried it out by commenting out run_rtc() in hardware_run(). It looks fine on the test run to me. Here are the results with are still using your patches and my COARSE timer. The cpu == off result is getting near to native now! The total number of calls to clock_gettime() now is only 280,000

    ======================================================================          
    PASS: TestTermTestWcc                                                           
    ----------------------------------------------------------------------          
    CPUEMU                                                                          
           off :  OK  target <= 5.0x, result = 1.5x                                 
          vm86 :  OK  target <= 8.0x, result = 2.4x                                 
          full :  OK  target <= 8.0x, result = 2.4x                                 
       vm86sim :  OK  target <= 10.0x, result = 5.6x                                
       fullsim :  OK  target <= 10.0x, result = 5.7x                                
    
    ----------------------------------------------------------------------          
    

    One thing I considered is that since both the poll and the signal method were in operation as per git, were they being run from separate threads? If so would we not need locking to maintain sanity?

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-12

    According to the comments it's done at 100Hz (10ms).

    Exactly, and that's not enough.

    It looks fine on the test run to me.

    What kind of test? You haven't tested programming
    RTC on 8KHz surely?

    One thing I considered is that since both the poll and the signal method were in
    operation as per git

    There is no signal method in git.
    It is simply that SIGALRM is generating
    periodically and some things (including
    rtc) are run from there, but this is the
    same kind of polling. Since for RTC this
    was not enough, it also polls from hardware_run().

    Threading is currently used only for sound.

     
  • Andrew Bird

    Andrew Bird - 2014-05-13

    Hi Stas,
    You mention programming RTC for 8kHz, have you any examples of DOS programs that do this? What is it used for? I'm not sure anything better than 10ms is possible with setitimer() either. If your 8kHz example is as I suspect, then perhaps using a passthrough to HPET timer is a possibility. See the docs in the kernel:
    linux/Documentation/timers/hpet.txt
    linux/Documentation/timers/hpet_example.c

    My thought about it being run from another thread was due to this comment in SIGALRM_call().

        /* We update the RTC from here if it has not been defined as a thread */
    

    I assumed it meant hardware_run(), but I now think the comment existed before use of rtc_run() from hardware_run() as I don't see it in 1.2.2.

    You may wonder why I'm looking at this, well I have been using Dosemu 1.2.2 for several years. I periodically test the current version in git with a view to moving my complex(source unavailable) app to run on the 1.4.x+ series. I always find 1.4.x unacceptablely slower than 1.2.2, so I'm developing a set of benchmarks that test each area so we may home in on the regression.

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-13

    You mention programming RTC for 8kHz, have you any examples of DOS
    programs that do this?

    https://sourceforge.net/p/dosemu/bugs/_discuss/thread/b3e00133/dce4/attachment/irq0test.zip
    Maybe this one is.

    then perhaps using a passthrough to HPET timer is a possibility.

    I don't think so.
    And the similar idea was already tried:
    https://sourceforge.net/p/dosemu/patches/38/
    Instead, I think you can consider timer_create()
    or timerfd_create() APIs. They will internally
    use HPET too, I think.

    I assumed it meant hardware_run(), but I now think the comment
    existed before use of rtc_run()

    Maybe, and I have no guesses what does it mean.
    Perhaps some very first versions of dosemu were
    trying threading, but this was unlikely to succeed,
    so I think they were never even released.
    We (me, Bart, Clarence Dang) once tried really
    hard to get threading to work in a LinuxThreads
    times, and we had some success but decided not
    to release that. With NPTL things were much better
    and so sound code now uses threads. But the aforementioned
    comment pre-dates even our LinuxThreads compatibility
    efforts, so I really doubt someone could get threading
    to work at that time.

    always find 1.4.x unacceptablely slower than 1.2.2, so I'm developing a set of
    benchmarks that test each area so we may home in on the regression.

    This is very helpful, thanks!
    Fortunately, most things are easy to fix.
    For example, I'll split the sound code to
    fast and time-consuming parts, and leave only
    the fast path for hardware_run().
    Not so easy with rtc... :(

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-13

    Hi, I applied a few patches to git.
    Could you see how this work?

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-13
    • assigned_to: Stas Sergeev
     
  • Stas Sergeev

    Stas Sergeev - 2014-05-13

    By the way, there may be something
    wrong with your config. I wonder if
    reading clock is always that expensive.
    What does your
    cat /proc/timer_list
    says? I think at least lapic should
    be very fast to read...

     
  • Andrew Bird

    Andrew Bird - 2014-05-13

    Hi Stas,
    There should be nothing special or broken about the machine, it's a standard 32 bit Fedora 19. Here's the timer_list anyway. I'm running the benchmark now.

     
  • Andrew Bird

    Andrew Bird - 2014-05-13

    Hi Stas,
    Here are the benchmark results, which are pretty good and better than just switching to the COARSE timer source.

    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 3.1x
          vm86 :  OK  target <= 8.0x, result = 3.5x
          full :  OK  target <= 8.0x, result = 3.4x
       vm86sim :  OK  target <= 10.0x, result = 5.1x
       fullsim :  OK  target <= 10.0x, result = 5.2x
    
     
  • Andrew Bird

    Andrew Bird - 2014-05-13

    Hi Stas,
    Regarding the irq0test.zip. The precompiled .exe doesn't output any log to screen or disk file, so I can't tell if it's working or not. No errors appeared in boot.log either. I also checked out tag 1.4.0.6 (earliest that builds and runs on my machine), but I get the same result there too. Does it work for you?

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-14

    Here are the benchmark results, which are pretty good and better
    than just switching to the COARSE timer source.

    Hi, thanks.
    I think we can kick the problem from another
    side too. How about the attached patch?

    Regarding the irq0test.zip. The precompiled .exe doesn't output
    any log to screen or disk file

    Just checked: it works and outputs to screen.

     
  • Andrew Bird

    Andrew Bird - 2014-05-14

    Hi Stas,
    I found the problem with irq0test, it doesn't work with either Freedos 1.0 or 1.1, but it does work with msdos 6.22. Not quite sure why that should be? I now get output, how should I interpret the results as a pass?

    BTW I've not tested the caching yet, but I'm a little unsure of how you will know when to use cache and when to fetch?

     
    Last edit: Andrew Bird 2014-05-14
  • Stas Sergeev

    Stas Sergeev - 2014-05-14

    Hi, the output of irq0test is good if
    the rightmost column increments by 1.
    If it increments by more than one - a
    problem.
    As for why it doesn't work on FreeDOS -
    you can ask its author, here he is:
    https://sourceforge.net/users/japheth

    Caching is simple: after executing some
    DOS code or sleep - fetch. Otherwise - cache.

     
  • Andrew Bird

    Andrew Bird - 2014-05-15

    Hi Stas,
    Here are the test results, including some I did for a quick and dirty timer_create() version. As you can see there's no time difference between the timer_create() version and the caching patch. I'd choose the caching patch for simplicity. I've attached the timer_create diff for reference, but I think it might have been a more interesting approach if the value of CLOCK_MONOTONIC had been available in the signal handler. As it is it's only able to set a flag indicating that rtc_run() should be run next time through hardware_run() loop.

    Thanks again.

    # latest git before ser_demux merge
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 3.9x
          vm86 :  OK  target <= 8.0x, result = 4.2x
          full :  OK  target <= 8.0x, result = 4.2x
       vm86sim :  OK  target <= 10.0x, result = 6.4x
       fullsim :  OK  target <= 10.0x, result = 6.2x
    ----------------------------------------------------------------------
    
    # latest git after ser_demux merge
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 4.0x
          vm86 :  OK  target <= 8.0x, result = 4.2x
          full :  OK  target <= 8.0x, result = 4.2x
       vm86sim :  OK  target <= 10.0x, result = 6.2x
       fullsim :  OK  target <= 10.0x, result = 6.2x
    ----------------------------------------------------------------------
    
    # latest git after ser_demux merge with cache patch
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 2.6x
          vm86 :  OK  target <= 8.0x, result = 2.8x
          full :  OK  target <= 8.0x, result = 2.8x
       vm86sim :  OK  target <= 10.0x, result = 4.6x
       fullsim :  OK  target <= 10.0x, result = 4.6x
    ----------------------------------------------------------------------
    
    # 100 us timer_create() set rtc_run_required flag
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 2.9x
          vm86 :  OK  target <= 8.0x, result = 3.1x
          full :  OK  target <= 8.0x, result = 2.9x
       vm86sim :  OK  target <= 10.0x, result = 5.1x
       fullsim :  OK  target <= 10.0x, result = 5.1x
    ----------------------------------------------------------------------
    
    # 333 us timer_create() set rtc_run_required flag
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 2.7x
          vm86 :  OK  target <= 8.0x, result = 2.9x
          full :  OK  target <= 8.0x, result = 2.9x
       vm86sim :  OK  target <= 10.0x, result = 4.9x
       fullsim :  OK  target <= 10.0x, result = 4.8x
    ----------------------------------------------------------------------
    
    # 1ms timer_create() set rtc_run_required flag
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 2.6x
          vm86 :  OK  target <= 8.0x, result = 2.9x
          full :  OK  target <= 8.0x, result = 2.9x
       vm86sim :  OK  target <= 10.0x, result = 4.7x
       fullsim :  OK  target <= 10.0x, result = 4.7x
    ----------------------------------------------------------------------
    
    # 5ms timer_create() set rtc_run_required flag
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 2.6x
          vm86 :  OK  target <= 8.0x, result = 2.8x
          full :  OK  target <= 8.0x, result = 2.8x
       vm86sim :  OK  target <= 10.0x, result = 4.6x
       fullsim :  OK  target <= 10.0x, result = 4.7x
    ----------------------------------------------------------------------
    
    # 10ms timer_create() set rtc_run_required flag
    ======================================================================
    PASS: TestTermTestWcc
    ----------------------------------------------------------------------
    CPUEMU
           off :  OK  target <= 5.0x, result = 2.6x
          vm86 :  OK  target <= 8.0x, result = 2.6x
          full :  OK  target <= 8.0x, result = 2.6x
       vm86sim :  OK  target <= 10.0x, result = 4.7x
       fullsim :  OK  target <= 10.0x, result = 4.7x
    ----------------------------------------------------------------------
    
     
  • Stas Sergeev

    Stas Sergeev - 2014-05-15

    Your patch is not what I mean.
    This way you won't gain anything, and
    does the irq testing program work well?
    What I meant is integrating timer_create()
    into RTC code, so that the event handler
    to launch at precise time (depending on RTC
    registers), in which case you won't need to
    re-ask time inside the handler or to resort
    to hardware_run().
    And it will likely be faster if the handler
    to use thread instead of signal.

    Oh cmon, don't tell me that ser_demux regressed
    the performance! :) Really, I don't think
    its possible.

     
    Last edit: Stas Sergeev 2014-05-15
  • Andrew Bird

    Andrew Bird - 2014-05-15

    Hi Stas,
    Well I didn't make it clear that the timer_create() tests were done without the benefit of the cache patch, so clearly it did something to achieve comparable results!
    Nope, the before and after ser_demux merge results are almost identical, and included here only for completeness.
    Yes I would like the full RTC integration too, but unless I'm doing something really wrong it's very limited what you can do in the handler, for instance even a printf or puts causes a segfault. I definitely couldn't even call clock_gettime() either.

     
  • Stas Sergeev

    Stas Sergeev - 2014-05-15

    What you did is basically to reduce
    the polling frequency. Instead, there
    is a possibility to remove the polling
    completely. I just don't think it worth
    an efforts to do such a half-step thing.
    And very likely irq test only passes on
    100us freq, which is an extra overhead,
    so this have also a downsides.

    What handler causes segfault, SIGEV_SIGNAL
    or SIGEV_THREAD?

    You are not supposed to clock_gettime()
    from a handler if handler is dispatched
    at a precise time set by RTC registers.
    clock_gettime() is then entirely avoided.

     
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks