$ patch -p0 < opt.diff
patching file src/cpu/core_dyn_x86/risc_x86.h'
patching file
src/gui/render_scalers.cpp'
patching file src/gui/render_templates.h'
patching file
src/hardware/gus.cpp'
patching file src/hardware/mpu401.cpp'
patching file
src/hardware/ymf262.c'
Depending, some stuff is 2x like gus handling, major speed up for some functions(gen_lea near 2x, finddynreg is 20%), etc, etc, but overall it's just 3-4% speed gain by my measuring. Note though, I'm cpu bound & finddynreg passes the normalxxx video routines as big-o once cycles pass 13-16k or so, varies depending which renderer is used, so guys with beefier cpus that can handle higher cycle settings should actually see more improvement(I know, I know, not fair that the guys with beefier systems get a bigger benefit:). There's also some code clean up so some things are more readable, simplified, reduced. Have a look-see.
Logged In: YES
user_id=1008467
There's also a bug fix or two in there like correcting
define order so that finddynreg uses bx.
Logged In: YES
user_id=1008467
There's a bug in the gus optimizations introduced at the
last minute from tyding it up for release. I'll update when
I figure it out.
Logged In: YES
user_id=1008467
Yep, t'was a tydy bug.
Logged In: YES
user_id=1008467
applied to latest cvs.
Logged In: YES
user_id=1008467
I'm getting a big drop in cpu utilization(2.1ghz 256k ath
xp) with the small changes in fast.diff. Rather than rattle
off numbers, run it; should be apparent.
These kind of changes wd thinks should be left up to the
compiler.
wd puts too much faith in gcc's optimizing ability as it
didn't do such tuning as assumed by optimizations being
turned on:
export CFLAGS="-s -O3 -pipe -fomit-frame-pointer
-march=athlon-xp"
export CXXFLAGS="-s -O3 -pipe -fomit-frame-pointer
-march=athlon-xp"
./configure --enable-core-inline
Logged In: YES
user_id=1304940
The statement is from a thread about dynamic core
optimizations, and i was talking about that part in your patch
only.
Logged In: YES
user_id=1008467
fast.diff is exactly the type change you were refering to;
"negating ifs".
Comparing the gcc assembly output, it was one conditional
jump that was "negated"/gcc didn't optimize which was the
culprit. It and an
inconsequential unconditional jump are difference between
the generated code. I would venture to guess it had a high
misprediction rate.
Logged In: YES
user_id=1304940
Again, fast.diff is NOT what i was talking about.
"negating ifs" in the dynamic core generator functions will of
course be faster if the code path is optimized by this. But
the generator functions are called only rarely compared to the
blocks that it generates, so it doesn't have any effect. In the
extreme case one would even make the generator functions
much
slower, if the generated code is faster.
I'm certain that optimizing mispredicted conditional jumps as
you did it in fast.diff can have great impact on speed.
Btw. did you compare the code (fast.diff) and speed against
a profile-optimized build? Might give some more insight.
Logged In: YES
user_id=1008467
Compared to which optimizations? I interpret profile-optimized as optimizations formulated with help from profiling, or is it a, or set of, specific build(s) to which you are refering? Or asking If I profiled?
Logged In: YES
user_id=1008467
Regarding generator functions, I disagree, or more specifcally, the gprof results I've ever seen do, it's the opposite, they use the most time. The top utilization are rendering routines followed by memory & generator functions /w a low cycle setting. When cycles are raised, generator & memory functions are the biggest consumers of cpu time. FindDynReg is big-o of the whole runtime for wolf3d set at 26k cycles(26k isn't even that high) using 8.06%. The time the block that's generated uses, aka gen_runcode, 41th on the list using .55% cpu time.
Logged In: YES
user_id=1304940
Was thinking about something like -fprofile-arcs and -fbranch-
probabilities
Do you know if this includes the generated code, or if it
is just plain gen_runcode (a few push/pop pairs and a jump)?
Sounds like the latter.
Anyways FindDynreg is called quite often as you said, so it
might be worth playing around with the dyncore cache (got
a lot less FindDynreg-calls in Quake when raising the cache
size and number of cache blocks).
Logged In: YES
user_id=1008467
For gen_runcode not to include that run time would be an exception to how gprof operates. gprof sees that gen_runcode gets called, does it's thing, then returns. With no subroutines, gprof won't be able to report in detail what sections gen_runcode is taking x amount of time, but it will note the total time it waited for gen_runcode to return after being called. If you look at gen_runcode, it's thing is to do a little prep, jump to the code block, the code block runs, returns back to gen_runcode, gen_runcode does a little clean up, returns.
Logged In: YES
user_id=1008467
It might help to remember that wolf3d ran on 386s and it's .55% of 2.1GHz for my machine. If anything that's high/the generated code is inefficient.
Logged In: YES
user_id=1304940
Simple example: a tight loop that just contains a dec,
a compare and a conditional jump (preceded by a cli).
With cycles not too low (20000) this should call gen_runcode
only a few times and stay in the generated code block most
of the time.
gprof says that gen_runcode is called only a few times
(correct)
but accounts nearly 0% of the time to it (this cannot be true).
So
gprof does not measure the generated code, only the code
up to the jump.
Maybe there's a switch to include the generated code in
the time of gen_runcode, or some other possibility. Otherwise
it is not usefull when profiling the dynamic core.
Logged In: YES
user_id=1008467
Actually it can; for instance, if you havn't profiled long enough.
Post your test progem/source on the need some testors vogons thread in the dosbox devopement section. Looks like your trying to get dosbox to live inside the code block.
Logged In: YES
user_id=1304940
Yes, as in this case the dyncore block is only left when the
cycles
run out (so gen_runcode uses most of the processing power,
but is
called not too often).
The program was profiled with varying runtimes.
Logged In: YES
user_id=1008467
I suspected dosbox doesn't generate code such that it could live in the code block so I made a small loop of my own, an infinite loop, and put log_msgs before & after run_gencode. If it could live in the code block, only the entering log_msg would show upon running the test program. The dosbox session itself gets stuck in the infinite loop, but enter/exit msgs are still continually generated, showing that dosbox keeps processing outside of the dyn code, which explains why your test code is still a small percentage of the runtime.
test is in pascal.
uses dos;
begin
writeln('hello');
asm
mov cx,0
@@myloop:
cli
jz @@myloop
end;
writeln('world');
end.
Logged In: YES
user_id=1304940
No, the block is linked to itself but has an exit condition
that checks the cycles. So if you have the cycles at normal
values (like 20000 or so) the code block will be executed
for some time until the cycle value runs out, then exit the
block, and enter it again using gen_runcode.
It's quite easy to see it if you set a breakpoint at gen_runcode
and debug the generated code.
Raise the cycles to some very high value and let it run
for a very long time, and you'll still get almost-zero values
for gen_runcode.
Logged In: YES
user_id=1008467
Quick question, how are you identifying the address of gen_runcode for setting the breakpoint?
Logged In: YES
user_id=1304940
I was using msvc for this, but it should work with gdb+some
frontend (ddd) as well.
In pseudocode, the generated code for the loop looks like this:
@start_of_block:
cmp CPU_Cycles,0
if equal: exit block
dec eax
cmp eax,0
if equal: advance eip to next instruction, exit block
else decrease cycles, jump to start_of_block
Logged In: YES
user_id=1008467
You're just randomly searching for the code seqeunce?