DOSBox / Patches / #103 optimization

Keith - 2005-07-19

opt.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-07-20

Logged In: YES
user_id=1008467

There's also a bug fix or two in there like correcting
define order so that finddynreg uses bx.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-07-21

Logged In: YES
user_id=1008467

There's a bug in the gus optimizations introduced at the
last minute from tyding it up for release. I'll update when
I figure it out.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-07-22

optfix.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-07-22

Logged In: YES
user_id=1008467

Yep, t'was a tydy bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-09-13

opt3.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-09-13

Logged In: YES
user_id=1008467

applied to latest cvs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-02

Logged In: YES
user_id=1008467

I'm getting a big drop in cpu utilization(2.1ghz 256k ath
xp) with the small changes in fast.diff. Rather than rattle
off numbers, run it; should be apparent.

These kind of changes wd thinks should be left up to the
compiler.
wd puts too much faith in gcc's optimizing ability as it
didn't do such tuning as assumed by optimizations being
turned on:
export CFLAGS="-s -O3 -pipe -fomit-frame-pointer
-march=athlon-xp"
export CXXFLAGS="-s -O3 -pipe -fomit-frame-pointer
-march=athlon-xp"
./configure --enable-core-inline

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-02

fast.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-03

Logged In: YES
user_id=1304940

The statement is from a thread about dynamic core
optimizations, and i was talking about that part in your patch
only.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-04

Logged In: YES
user_id=1008467

fast.diff is exactly the type change you were refering to;
"negating ifs".
Comparing the gcc assembly output, it was one conditional
jump that was "negated"/gcc didn't optimize which was the
culprit. It and an
inconsequential unconditional jump are difference between
the generated code. I would venture to guess it had a high
misprediction rate.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-05

Logged In: YES
user_id=1304940

Again, fast.diff is NOT what i was talking about.
"negating ifs" in the dynamic core generator functions will of
course be faster if the code path is optimized by this. But
the generator functions are called only rarely compared to the
blocks that it generates, so it doesn't have any effect. In the
extreme case one would even make the generator functions
much
slower, if the generated code is faster.

I'm certain that optimizing mispredicted conditional jumps as
you did it in fast.diff can have great impact on speed.
Btw. did you compare the code (fast.diff) and speed against
a profile-optimized build? Might give some more insight.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-05

Logged In: YES
user_id=1008467

Compared to which optimizations? I interpret profile-optimized as optimizations formulated with help from profiling, or is it a, or set of, specific build(s) to which you are refering? Or asking If I profiled?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-06

Logged In: YES
user_id=1008467

Regarding generator functions, I disagree, or more specifcally, the gprof results I've ever seen do, it's the opposite, they use the most time. The top utilization are rendering routines followed by memory & generator functions /w a low cycle setting. When cycles are raised, generator & memory functions are the biggest consumers of cpu time. FindDynReg is big-o of the whole runtime for wolf3d set at 26k cycles(26k isn't even that high) using 8.06%. The time the block that's generated uses, aka gen_runcode, 41th on the list using .55% cpu time.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-06

Logged In: YES
user_id=1304940

Compared to which optimizations?

Was thinking about something like -fprofile-arcs and -fbranch-
probabilities

The time the block that's generated uses, aka
gen_runcode, 41th on the
list using .55% cpu time.

Do you know if this includes the generated code, or if it
is just plain gen_runcode (a few push/pop pairs and a jump)?
Sounds like the latter.

Anyways FindDynreg is called quite often as you said, so it
might be worth playing around with the dyncore cache (got
a lot less FindDynreg-calls in Quake when raising the cache
size and number of cache blocks).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-07

Logged In: YES
user_id=1008467

For gen_runcode not to include that run time would be an exception to how gprof operates. gprof sees that gen_runcode gets called, does it's thing, then returns. With no subroutines, gprof won't be able to report in detail what sections gen_runcode is taking x amount of time, but it will note the total time it waited for gen_runcode to return after being called. If you look at gen_runcode, it's thing is to do a little prep, jump to the code block, the code block runs, returns back to gen_runcode, gen_runcode does a little clean up, returns.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-07

Logged In: YES
user_id=1008467

It might help to remember that wolf3d ran on 386s and it's .55% of 2.1GHz for my machine. If anything that's high/the generated code is inefficient.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-07

Logged In: YES
user_id=1304940

Simple example: a tight loop that just contains a dec,
a compare and a conditional jump (preceded by a cli).
With cycles not too low (20000) this should call gen_runcode
only a few times and stay in the generated code block most
of the time.
gprof says that gen_runcode is called only a few times
(correct)
but accounts nearly 0% of the time to it (this cannot be true).
So
gprof does not measure the generated code, only the code
up to the jump.
Maybe there's a switch to include the generated code in
the time of gen_runcode, or some other possibility. Otherwise
it is not usefull when profiling the dynamic core.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-08

Logged In: YES
user_id=1008467

Actually it can; for instance, if you havn't profiled long enough.

Post your test progem/source on the need some testors vogons thread in the dosbox devopement section. Looks like your trying to get dosbox to live inside the code block.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-08

Logged In: YES
user_id=1304940

Looks like your trying to get dosbox to live inside the code
block.

Yes, as in this case the dyncore block is only left when the
cycles
run out (so gen_runcode uses most of the processing power,
but is
called not too often).

The program was profiled with varying runtimes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-08

Logged In: YES
user_id=1008467

I suspected dosbox doesn't generate code such that it could live in the code block so I made a small loop of my own, an infinite loop, and put log_msgs before & after run_gencode. If it could live in the code block, only the entering log_msg would show upon running the test program. The dosbox session itself gets stuck in the infinite loop, but enter/exit msgs are still continually generated, showing that dosbox keeps processing outside of the dyn code, which explains why your test code is still a small percentage of the runtime.

test is in pascal.

uses dos;
begin
writeln('hello');
asm
mov cx,0
@@myloop:
cli
jz @@myloop
end;
writeln('world');
end.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-08

Logged In: YES
user_id=1304940

If it could live in the code block, only the entering log_msg
would show upon running the test program.

No, the block is linked to itself but has an exit condition
that checks the cycles. So if you have the cycles at normal
values (like 20000 or so) the code block will be executed
for some time until the cycle value runs out, then exit the
block, and enter it again using gen_runcode.

It's quite easy to see it if you set a breakpoint at gen_runcode
and debug the generated code.

which explains why your test code is still a small
percentage of the runtime.

Raise the cycles to some very high value and let it run
for a very long time, and you'll still get almost-zero values
for gen_runcode.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-08

Logged In: YES
user_id=1008467

Quick question, how are you identifying the address of gen_runcode for setting the breakpoint?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

c2woody - 2005-10-09

Logged In: YES
user_id=1304940

I was using msvc for this, but it should work with gdb+some
frontend (ddd) as well.

In pseudocode, the generated code for the loop looks like this:

@start_of_block:
cmp CPU_Cycles,0
if equal: exit block
dec eax
cmp eax,0
if equal: advance eip to next instruction, exit block
else decrease cycles, jump to start_of_block

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Keith - 2005-10-09

Logged In: YES
user_id=1008467

You're just randomly searching for the code seqeunce?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

optimization

An Open Source DOS emulator to run old DOS games

Group

Searches

Help

#103 optimization

Discussion