As you know I have been working on my own implementation of a Dirac
codec during the duration of Google's Summer of Code for the FFmpeg
project. When my work got its first review, I got some feedback from
Michael Niedermayer, he is the maintainer of FFmpeg and has quite some
experience with video codecs.
One thing he mentioned was that it would be better to use an 8tap
filter for half pixel motion compensation, instead of an 10tap filter.
Here is the email with this comment, I hope this can help you:
It's maybe slow in C or C++, but in assembler it's reasonably fast I think - basically you can do 4 product and adds at once. So 8 would be 1.5 times faster than 10, and 12 would take no more time than 10. Whether or not the coeffs factorise or not doesn't really matter - and 6 taps is the same speed as 8.
So why did we do 10 and not 8 or 12? Well 10 taps gave about 0.1 - 0.15 dB improvement over 8 taps and 12 taps gave no improvement overall that I could measure. 0.1dB isn't very much, but it's not nothing. (BTW, the filter was designed by minimising interpolation error over all 8 phases assuming 1/8 pixel interpolation - i.e. by following the half-pel interpolation with linear interpolation. It adds a little bit of passband boost in order to compensate for the saggy response of the linear filter in the other phases.)
So the upconversion code is 2/3 the speed of what it could be for 0.1dB improvement. This sounds maybe a bad compromise, but we found in Schrodinger that upconversion takes virtually no time in the encoder or the decoder. Motion compensation and the wavelet transform don't take much time in Schrodinger either. What really takes the time is arithmetic coding/decoding. The loss in time in Schrodinger is probably a small fraction of 1%.
In _Dirac_, motion compensation and the wavelet transform are slow - but even here upconversion isn't a big deal. But the Dirac code isn't intended to be real time and doesn't behave like a really optimised implementation. So we were careful not to optimise things in a C++ implementation that wouldn't improve things in a really fast implementation.
Maybe it still wasn't worth that 0.1dB to have 10 taps, but in speed terms it doesn't matter very much at all.
> It's maybe slow in C or C++, but in assembler it's reasonably fast I think - basically you can do 4
> product and adds at once. So 8 would be 1.5 times faster than 10, and 12 would take no more time
> than 10.
> Whether or not the coeffs factorise or not doesn't really matter - and 6 taps is the same speed as 8.
i think you mean mmx here not assembler in general as it really doesnt make sense otherwise
and with mmx i would normally rather filter 4pixels at a time than filter 1 pixel, the reason here is that mmx tends to be inefficient with the later, and if you do work with 4 pixels at a time then there is a difference between a *1 and a *3 also there is a differeence between 6 taps and 8 taps
but in case of dirac the 4 pixel at a time wont be easy as the intermediate sum doesnt fit in 16bit (167*2*255 is too large)
as simple comparission (lets assume all intermediates would fit in 16bit)
your 1 pixel suggestion:
read 4 pixels (movd)
unpack them to 16bit (punpcklbw)
multiply and add to 32bit (pmaddwd)
add the first 2 pairs of 32bit (paddd)
add the the third pairs of 32bit (paddd)
duplicate the result (movq)
shift right (psrlq $32)
add the pair (paddd)
add bias (paddd)
shift right (psraw)
move to integer register (movd)
store byte (mov)
18 instructions for 1 pixel
read "left" 4 pixels (movd)
read "right" 4 pixels (movd)
unpack left to 16bit (punpcklbw)
unpack right to 16bit (punpcklbw)
add pixels with same coeff(paddw)
multiply pixels as needed (pmullw)
add to sum (paddw) (not needed in the first iteration)
add bias (paddw)
shift right (psraw)
pack to 8bit (packuswb)
38 instructions for 4 pixels (9.5 per pixel)
> So why did we do 10 and not 8 or 12? Well 10 taps gave about 0.1 - 0.15 dB improvement over 8 taps and
> 12 taps gave no improvement overall that I could measure.
have you tried:
-1 3 -7 21 21 -7 3 -1
here this one has only about 0.02 db loss with foreman over what dirac uses in a crude test (though that test was with snow not dirac)
PS: is there some way to use a normal mail user agent with these SF forums? not that i dislike iceweasel but normal mailing lists have some advantages like having them available offline ...
Afer some internal discussion and some experiments by Thomas Davies we have decided to go with your suggestion of using a -1 3 -7 21 21 -7 3 -1 upconversion filter.
The coding loss was a little more than we had hoped, typically 0.08dB, but this seems justified by the reduction in computation and word width. It is also much easier to implement in hardware (requiring, as you probably realise, only a few shifts and adds). The filter you suggest appears to be significantly better than the H264 filter.
Many thanks for your suggestion which I think will significnatly improve the optimisation of the code.
Ok, that makes more sense to me now - thanks :-)
I'll try your filter. In my experience the hits tend to be worst at low res (CIF/QCIF), so we need to try a range of resolutions and qualities.
Log in to post a comment.