|
From: James B. <ja...@ex...> - 2000-04-28 06:44:40
|
I had a slightly weird experience this afternoon. I was running some benchmarks, and found that I was changing about 5% when I built with/without the ycrcb_to_rgb32.o module. Even though it wasn't being called. I took a look at module sizes and found that we're at about 16k of text: we're blowing the code cache, and when code moves around (like when you remove a module) different functions are cacheing against each other and changing performance in surprising ways. The code should get smaller as we optmize it, though, so this effect will go away. We should be safe if the decode loop fits in 12k. -- James Bowman ja...@ex... |
|
From: Erik W. <om...@cs...> - 2000-04-28 06:48:40
|
On Thu, 27 Apr 2000, James Bowman wrote:
> I took a look at module sizes and found that we're at about 16k of text:
> we're blowing the code cache, and when code moves around (like when you
> remove a module) different functions are cacheing against each other and
> changing performance in surprising ways.
Whee! ;-)
> The code should get smaller as we optmize it, though, so this effect
> will go away. We should be safe if the decode loop fits in 12k.
Yeah, we can do that pretty easily. Eventually I expect that a sufficient
percentage of this will be written in ASM to keep it well below that.
Then of course we have to worry about blowing the data cache. That means
all sorts of tricks, most of which aren't set up yet (such as using
non-cachable pages, which means a kernel hack).
Erik Walthinsen <om...@cs...> - Staff Programmer @ OGI
Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/
Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/
__
/ \ SEUL: Simple End-User Linux - http://www.seul.org/
| | M E G A Helping Linux become THE choice
_\ /_ for the home or office user
|
|
From: Scott F. J. <sc...@fl...> - 2000-04-28 19:16:21
|
Which is faster: unrolling the loops and growing past 12K or leaving the loops in and keeping it under? Switching to a block-based ycrcb_to_rgb gave me about 5% speed improvement over full-frame conversion. This and other changes are with Erik for review. (Got rid of place.c, broke PAL decoding, repackaged closer to library form, added dv2ppm.c, ...) I propose we stay away from kernel hacks for as long as possible. Ideally we should keep maintaining C-versions of each routine, to assist in cross-platform development. I'm sure the LinuxPPC folks will want this code, and if we keep it populated with too much ia32, they may revolt! We may also want to offer speed vs. quality options for our users: One improvement is to skip the third pass AC decoding and just return from dv_parse_video_segment() without calling dv_parse_ac_coeffs(seg). In playback, the additional error is barely noticable. (I had to use dv2ppm to grab frames and compare the results.) For some uses, like DV editting, where speed is more important than quality, I'd even be willing to forego *ALL* the AC decoding. Just give me 8x8 blocks of DC, which my tests show runs more than 3x faster-- those ducks look awfully blocky, though! There may be other intermediate "exit-points" in the decoder that we'll want to maintain as options. (Y_ONLY is another example: great for video editting when detail is needed, but color isn't.) Erik Walthinsen wrote: > > On Thu, 27 Apr 2000, James Bowman wrote: > > > I took a look at module sizes and found that we're at about 16k of text: > > we're blowing the code cache, and when code moves around (like when you > > remove a module) different functions are cacheing against each other and > > changing performance in surprising ways. > Whee! ;-) > > > The code should get smaller as we optmize it, though, so this effect > > will go away. We should be safe if the decode loop fits in 12k. > Yeah, we can do that pretty easily. Eventually I expect that a sufficient > percentage of this will be written in ASM to keep it well below that. > Then of course we have to worry about blowing the data cache. That means > all sorts of tricks, most of which aren't set up yet (such as using > non-cachable pages, which means a kernel hack). > |
|
From: Erik W. <om...@cs...> - 2000-04-28 20:19:13
|
On Fri, 28 Apr 2000, Scott F. Johnston wrote:
> Which is faster: unrolling the loops and growing past 12K
> or leaving the loops in and keeping it under?
It depends on the loops involved. If the bodies are very small (say,
<10cycles), it's worth unrolling them. If not, leave them loops. I'm
guessing we won't have too much trouble fitting under 12k regardless, but
if we have to make a choice, it would go along the lines of total branches
when picking which ones to unroll.
> Switching to a block-based ycrcb_to_rgb gave me about 5%
> speed improvement over full-frame conversion. This
> and other changes are with Erik for review.
> (Got rid of place.c, broke PAL decoding, repackaged
> closer to library form, added dv2ppm.c, ...)
I hope to look through that to day and try to merge everything into CVS.
> I propose we stay away from kernel hacks for as long
> as possible.
Right, we won't ever depend on kernel hacks. Besides, they don't even
exist yet... They just happen to make certain things faster in certain
situations.
> Ideally we should keep maintaining C-versions
> of each routine, to assist in cross-platform
> development. I'm sure the LinuxPPC folks will want this
> code, and if we keep it populated with too much
> ia32, they may revolt!
Exactly. The C version will always exist for everything, and be the
fallback position. If someone compiles it for an arch for which not
everything has been optimized, they get lots of C....
> We may also want to offer speed vs. quality options for
> our users: One improvement is to skip the third pass AC
> decoding and just return from dv_parse_video_segment()
> without calling dv_parse_ac_coeffs(seg). In playback,
> the additional error is barely noticable. (I had to use
> dv2ppm to grab frames and compare the results.)
> For some uses, like DV editting, where speed is more
> important than quality, I'd even be willing to forego
> *ALL* the AC decoding. Just give me 8x8 blocks of DC,
> which my tests show runs more than 3x faster-- those
> ducks look awfully blocky, though!
> There may be other intermediate "exit-points" in the decoder that
> we'll want to maintain as options. (Y_ONLY is another
> example: great for video editting when detail is needed,
> but color isn't.)
Yup. I can imagine quite a few options. The problem is the fact that
this means branches. This is where specialization comes into play. You
have various forms of the functions, and at some point a vtable is filled
with pointers to the currently appropriate ones (based on the criteria),
which are called as needed. This method is probably the prefered way of
supporting all Intel ia32 chips from one binary. The way the bitstream
code would be set up would let one compile multiple copies of the same
code with different names, each with some level of processor support
(ia32, MMX, SSE, etc.), which would then be specialized at a higher level.
I need to write up my ideas on the bitstream API and how specialization of
that sort works.
Now I just need to figure out what to do first ;-)
Erik Walthinsen <om...@cs...> - Staff Programmer @ OGI
Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/
Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/
__
/ \ SEUL: Simple End-User Linux - http://www.seul.org/
| | M E G A Helping Linux become THE choice
_\ /_ for the home or office user
|