Menu

#103 optimization

open
nobody
None
5
2012-09-07
2005-07-19
Keith
No

$ patch -p0 < opt.diff
patching file src/cpu/core_dyn_x86/risc_x86.h' patching filesrc/gui/render_scalers.cpp'
patching file src/gui/render_templates.h' patching filesrc/hardware/gus.cpp'
patching file src/hardware/mpu401.cpp' patching filesrc/hardware/ymf262.c'

Depending, some stuff is 2x like gus handling, major speed up for some functions(gen_lea near 2x, finddynreg is 20%), etc, etc, but overall it's just 3-4% speed gain by my measuring. Note though, I'm cpu bound & finddynreg passes the normalxxx video routines as big-o once cycles pass 13-16k or so, varies depending which renderer is used, so guys with beefier cpus that can handle higher cycle settings should actually see more improvement(I know, I know, not fair that the guys with beefier systems get a bigger benefit:). There's also some code clean up so some things are more readable, simplified, reduced. Have a look-see.

Discussion

1 2 > >> (Page 1 of 2)
  • Keith

    Keith - 2005-07-19
     
  • Keith

    Keith - 2005-07-20

    Logged In: YES
    user_id=1008467

    There's also a bug fix or two in there like correcting
    define order so that finddynreg uses bx.

     
  • Keith

    Keith - 2005-07-21

    Logged In: YES
    user_id=1008467

    There's a bug in the gus optimizations introduced at the
    last minute from tyding it up for release. I'll update when
    I figure it out.

     
  • Keith

    Keith - 2005-07-22
     
  • Keith

    Keith - 2005-07-22

    Logged In: YES
    user_id=1008467

    Yep, t'was a tydy bug.

     
  • Keith

    Keith - 2005-09-13
     
  • Keith

    Keith - 2005-09-13

    Logged In: YES
    user_id=1008467

    applied to latest cvs.

     
  • Keith

    Keith - 2005-10-02

    Logged In: YES
    user_id=1008467

    I'm getting a big drop in cpu utilization(2.1ghz 256k ath
    xp) with the small changes in fast.diff. Rather than rattle
    off numbers, run it; should be apparent.

    These kind of changes wd thinks should be left up to the
    compiler.
    wd puts too much faith in gcc's optimizing ability as it
    didn't do such tuning as assumed by optimizations being
    turned on:
    export CFLAGS="-s -O3 -pipe -fomit-frame-pointer
    -march=athlon-xp"
    export CXXFLAGS="-s -O3 -pipe -fomit-frame-pointer
    -march=athlon-xp"
    ./configure --enable-core-inline

     
  • Keith

    Keith - 2005-10-02
     
  • c2woody

    c2woody - 2005-10-03

    Logged In: YES
    user_id=1304940

    The statement is from a thread about dynamic core
    optimizations, and i was talking about that part in your patch
    only.

     
  • Keith

    Keith - 2005-10-04

    Logged In: YES
    user_id=1008467

    fast.diff is exactly the type change you were refering to;
    "negating ifs".
    Comparing the gcc assembly output, it was one conditional
    jump that was "negated"/gcc didn't optimize which was the
    culprit. It and an
    inconsequential unconditional jump are difference between
    the generated code. I would venture to guess it had a high
    misprediction rate.

     
  • c2woody

    c2woody - 2005-10-05

    Logged In: YES
    user_id=1304940

    Again, fast.diff is NOT what i was talking about.
    "negating ifs" in the dynamic core generator functions will of
    course be faster if the code path is optimized by this. But
    the generator functions are called only rarely compared to the
    blocks that it generates, so it doesn't have any effect. In the
    extreme case one would even make the generator functions
    much
    slower, if the generated code is faster.

    I'm certain that optimizing mispredicted conditional jumps as
    you did it in fast.diff can have great impact on speed.
    Btw. did you compare the code (fast.diff) and speed against
    a profile-optimized build? Might give some more insight.

     
  • Keith

    Keith - 2005-10-05

    Logged In: YES
    user_id=1008467

    Compared to which optimizations? I interpret profile-optimized as optimizations formulated with help from profiling, or is it a, or set of, specific build(s) to which you are refering? Or asking If I profiled?

     
  • Keith

    Keith - 2005-10-06

    Logged In: YES
    user_id=1008467

    Regarding generator functions, I disagree, or more specifcally, the gprof results I've ever seen do, it's the opposite, they use the most time. The top utilization are rendering routines followed by memory & generator functions /w a low cycle setting. When cycles are raised, generator & memory functions are the biggest consumers of cpu time. FindDynReg is big-o of the whole runtime for wolf3d set at 26k cycles(26k isn't even that high) using 8.06%. The time the block that's generated uses, aka gen_runcode, 41th on the list using .55% cpu time.

     
  • c2woody

    c2woody - 2005-10-06

    Logged In: YES
    user_id=1304940

    Compared to which optimizations?

    Was thinking about something like -fprofile-arcs and -fbranch-
    probabilities

    The time the block that's generated uses, aka
    gen_runcode, 41th on the
    list using .55% cpu time.

    Do you know if this includes the generated code, or if it
    is just plain gen_runcode (a few push/pop pairs and a jump)?
    Sounds like the latter.

    Anyways FindDynreg is called quite often as you said, so it
    might be worth playing around with the dyncore cache (got
    a lot less FindDynreg-calls in Quake when raising the cache
    size and number of cache blocks).

     
  • Keith

    Keith - 2005-10-07

    Logged In: YES
    user_id=1008467

    For gen_runcode not to include that run time would be an exception to how gprof operates. gprof sees that gen_runcode gets called, does it's thing, then returns. With no subroutines, gprof won't be able to report in detail what sections gen_runcode is taking x amount of time, but it will note the total time it waited for gen_runcode to return after being called. If you look at gen_runcode, it's thing is to do a little prep, jump to the code block, the code block runs, returns back to gen_runcode, gen_runcode does a little clean up, returns.

     
  • Keith

    Keith - 2005-10-07

    Logged In: YES
    user_id=1008467

    It might help to remember that wolf3d ran on 386s and it's .55% of 2.1GHz for my machine. If anything that's high/the generated code is inefficient.

     
  • c2woody

    c2woody - 2005-10-07

    Logged In: YES
    user_id=1304940

    Simple example: a tight loop that just contains a dec,
    a compare and a conditional jump (preceded by a cli).
    With cycles not too low (20000) this should call gen_runcode
    only a few times and stay in the generated code block most
    of the time.
    gprof says that gen_runcode is called only a few times
    (correct)
    but accounts nearly 0% of the time to it (this cannot be true).
    So
    gprof does not measure the generated code, only the code
    up to the jump.
    Maybe there's a switch to include the generated code in
    the time of gen_runcode, or some other possibility. Otherwise
    it is not usefull when profiling the dynamic core.

     
  • Keith

    Keith - 2005-10-08

    Logged In: YES
    user_id=1008467

    Actually it can; for instance, if you havn't profiled long enough.

    Post your test progem/source on the need some testors vogons thread in the dosbox devopement section. Looks like your trying to get dosbox to live inside the code block.

     
  • c2woody

    c2woody - 2005-10-08

    Logged In: YES
    user_id=1304940

    Looks like your trying to get dosbox to live inside the code
    block.

    Yes, as in this case the dyncore block is only left when the
    cycles
    run out (so gen_runcode uses most of the processing power,
    but is
    called not too often).

    The program was profiled with varying runtimes.

     
  • Keith

    Keith - 2005-10-08

    Logged In: YES
    user_id=1008467

    I suspected dosbox doesn't generate code such that it could live in the code block so I made a small loop of my own, an infinite loop, and put log_msgs before & after run_gencode. If it could live in the code block, only the entering log_msg would show upon running the test program. The dosbox session itself gets stuck in the infinite loop, but enter/exit msgs are still continually generated, showing that dosbox keeps processing outside of the dyn code, which explains why your test code is still a small percentage of the runtime.

    test is in pascal.

    uses dos;
    begin
    writeln('hello');
    asm
    mov cx,0
    @@myloop:
    cli
    jz @@myloop
    end;
    writeln('world');
    end.

     
  • c2woody

    c2woody - 2005-10-08

    Logged In: YES
    user_id=1304940

    If it could live in the code block, only the entering log_msg
    would show upon running the test program.

    No, the block is linked to itself but has an exit condition
    that checks the cycles. So if you have the cycles at normal
    values (like 20000 or so) the code block will be executed
    for some time until the cycle value runs out, then exit the
    block, and enter it again using gen_runcode.

    It's quite easy to see it if you set a breakpoint at gen_runcode
    and debug the generated code.

    which explains why your test code is still a small
    percentage of the runtime.

    Raise the cycles to some very high value and let it run
    for a very long time, and you'll still get almost-zero values
    for gen_runcode.

     
  • Keith

    Keith - 2005-10-08

    Logged In: YES
    user_id=1008467

    Quick question, how are you identifying the address of gen_runcode for setting the breakpoint?

     
  • c2woody

    c2woody - 2005-10-09

    Logged In: YES
    user_id=1304940

    I was using msvc for this, but it should work with gdb+some
    frontend (ddd) as well.

    In pseudocode, the generated code for the loop looks like this:

    @start_of_block:
    cmp CPU_Cycles,0
    if equal: exit block
    dec eax
    cmp eax,0
    if equal: advance eip to next instruction, exit block
    else decrease cycles, jump to start_of_block

     
  • Keith

    Keith - 2005-10-09

    Logged In: YES
    user_id=1008467

    You're just randomly searching for the code seqeunce?

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.