|
From: Jeremy F. <je...@go...> - 2002-11-19 23:14:13
|
On Tue, 2002-11-19 at 11:10, Julian Seward wrote:
> > I got basic block chaining working last night. I got about 25%
> > improvement (which is nice, but I was hoping for more) in the particular
> > benchmark I tried (gcc 3.0.4's cc1 -O2 pass over vg_from_ucode). On the
> > whole, the performance was pretty dismal: the native run took about 4.6
> > seconds; the non-chained-bb nulgrind took 81.2 seconds, and the
> > chained-bb nulgrind took 60 seconds. I haven't looked into it further:
> > I was hoping it would be a largely CPU-bound test, but maybe its
> > actually spending all its time in malloc or something.
>
> Strange it's so slow. I think you should try bzip2; that is almost
> completely compute bound and does something like 7 malloc calls per
> file processed. Here's what I have:
>
> time ./Inst/bin/valgrind --skin=none ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps
> > /dev/null
> (valgrind startup msgs deleted)
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m7.760s
> user 0m7.670s
> sys 0m0.030s
>
> time ~/bzip2-1.0.2/bzip2 -v < ~/wbt00.ps > /dev/null
> (stdin): 13.372:1, 0.598 bits/byte, 92.52% saved, 782064 in, 58487 out.
>
> real 0m0.738s
> user 0m0.680s
> sys 0m0.050s
>
> So more like a 11 - 12 x slowdown than the 35 x you get.
Well, I found that there is an easy way of getting V to generate
extended basic blocks: simply allow conditional jumps to not end the
basic block (they code was all set up to do it, complete with
almost-accurate comment). I haven't done on overall measurement of what
the average increase in BB length is, but from looking a the output of
--trace-codegen, they can get quite long. It certainly allows us to see
what the effect of having long register lifetimes is vs.
indescriminately over-compiling things. The results are far from
conclusive: it seems to vary between a few percent better to a few
percent worse (and it also seems to depend on the skin, though I haven't
really tested that yet).
These are the results I'm seeing:
baseline: bzip2 < TAGS > /dev/null
time=0.49s
valgrind --skin=none --chain-bb=no --extended-bb=no bzip2 < TAGS > /dev/null
time=6.15s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=no bzip2 < TAGS > /dev/null
time=4.95s ratio:10.1
valgrind --skin=none --chain-bb=no --extended-bb=yes bzip2 < TAGS > /dev/null
time=6.17s ratio:12.5
valgrind --skin=none --chain-bb=yes --extended-bb=yes bzip2 < TAGS > /dev/null
time=5.48s ratio:11.1
baseline: /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 -fpreprocessed coregrind/x.i \
-quiet -dumpbase x.i -O2 -version -o /dev/null
time=4.44s
valgrind --skin=none --chain-bb=no --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=79.88s ratio:17.9
valgrind --skin=none --chain-bb=yes --extended-bb=no /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=59.14s ratio:13.3
valgrind --skin=none --chain-bb=no --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=73.77s ratio:16.6
valgrind --skin=none --chain-bb=yes --extended-bb=yes /usr/lib/gcc-lib/i386-redhat-linux/3.0.4/cc1 \
-fpreprocessed coregrind/x.i -quiet -dumpbase x.i -O2 -version -o /dev/null
time=58.16s ratio:13.0
> > My next experiment might be [...]
>
> Let me encourage you to make measurements (direct vs indirect jump counts)
> to gain insight into your current hackery, before embarking on more. I for
> one would like to be assured with numbers that the Right Thing is happening
> and that our assumptions about costs, event frequencies, etc, are justified.
Oh, yes, I quite agree. The results above make me think that doing
naive trace caching probably won't help, but eliminating dispatch-loop
calls probably will.
I've put the patch up in the usual place.
J
|