Re: [libdv-dev] Code size

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Fri, 28 Apr 2000, Scott F. Johnston wrote:

> Which is faster: unrolling the loops and growing past 12K
> or leaving the loops in and keeping it under?
It depends on the loops involved.  If the bodies are very small (say,
<10cycles), it's worth unrolling them.  If not, leave them loops.  I'm
guessing we won't have too much trouble fitting under 12k regardless, but
if we have to make a choice, it would go along the lines of total branches
when picking which ones to unroll.

> Switching to a block-based ycrcb_to_rgb gave me about 5%
> speed improvement over full-frame conversion. This
> and other changes are with Erik for review.
> (Got rid of place.c, broke PAL decoding, repackaged
> closer to library form, added dv2ppm.c, ...)
I hope to look through that to day and try to merge everything into CVS.

> I propose we stay away from kernel hacks for as long
> as possible.
Right, we won't ever depend on kernel hacks.  Besides, they don't even
exist yet...  They just happen to make certain things faster in certain
situations.

> Ideally we should keep maintaining C-versions
> of each routine, to assist in cross-platform
> development. I'm sure the LinuxPPC folks will want this
> code, and if we keep it populated with too much
> ia32, they may revolt!
Exactly.  The C version will always exist for everything, and be the
fallback position.  If someone compiles it for an arch for which not
everything has been optimized, they get lots of C....

> We may also want to offer speed vs. quality options for
> our users: One improvement is to skip the third pass AC
> decoding and just return from dv_parse_video_segment()
> without calling dv_parse_ac_coeffs(seg). In playback,
> the additional error is barely noticable. (I had to use
> dv2ppm to grab frames and compare the results.)
> For some uses, like DV editting, where speed is more
> important than quality, I'd even be willing to forego
> *ALL* the AC decoding. Just give me 8x8 blocks of DC,
> which my tests show runs more than 3x faster-- those
> ducks look awfully blocky, though!
> There may be other intermediate "exit-points" in the decoder that
> we'll want to maintain as options. (Y_ONLY is another
> example: great for video editting when detail is needed,
> but color isn't.)
Yup.  I can imagine quite a few options.  The problem is the fact that
this means branches.  This is where specialization comes into play.  You
have various forms of the functions, and at some point a vtable is filled
with pointers to the currently appropriate ones (based on the criteria),
which are called as needed.  This method is probably the prefered way of
supporting all Intel ia32 chips from one binary.  The way the bitstream
code would be set up would let one compile multiple copies of the same
code with different names, each with some level of processor support
(ia32, MMX, SSE, etc.), which would then be specialized at a higher level.

I need to write up my ideas on the bitstream API and how specialization of
that sort works.

Now I just need to figure out what to do first ;-)

         Erik Walthinsen <om...@cs...> - Staff Programmer @ OGI
        Quasar project - http://www.cse.ogi.edu/DISC/projects/quasar/
   Video4Linux Two drivers and stuff - http://www.cse.ogi.edu/~omega/v4l2/
        __
       /  \             SEUL: Simple End-User Linux - http://www.seul.org/
      |    | M E G A           Helping Linux become THE choice
      _\  /_                          for the home or office user