Thread: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

Brought to you by: dgp85, dsalt, f1rmb, miguelfreitas, and 5 others

xine-devel

[xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: Reinhard N. <rn...@gm...> - 2006-12-29 00:47:29

Hi,

xxmc typically allocates 8 buffers for frames. This can be switched to
15, but my GF6600GT fails to allocate more than 10 frames.

Question 1: where does this number 8 come from?
Question 2: what do you think about making it configureable /
autodetectable?

frame_drop_limit typically has a value of 3 so there must be 4 frames in
 the buffer or the decoder will be informed to drop some frames.

In the above scenario it is hardly possible to always have these 4
frames in the buffer as 3 frames are in use while decoding and 1-2
frames are in use while displaying.

That's why activating the bob_deinterlacer caused frame drops over and
over on my machine although CPU load didn't change at all.

Question 3: what do you think about introducing frame_drop_limit_max to
replace the fixed constant 3?

frame_drop_limit_max should be set in relation to num_frame_buffers, e. g.

   frame_drop_limit_max = min(3, num_frame_buffers - 3 - 2 - 2);

The three constants reserve buffers for decoding, displaying and buffer
fluctuation. So for the above case, frame_drop_limit_max will be 1. I've
tried that and it works properly, e. g. for watching the ASTRA HD demo loop.

Bye.
-- 
Dipl.-Inform. (FH) Reinhard Nissl
mailto:rn...@gm...

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: <th...@tu...> - 2006-12-29 15:56:45

Reinhard Nissl wrote:

>Hi,
>
>xxmc typically allocates 8 buffers for frames. This can be switched to
>15, but my GF6600GT fails to allocate more than 10 frames.
>
>Question 1: where does this number 8 come from?
>Question 2: what do you think about making it configureable /
>autodetectable?
>  
>
Hi,
IIRC, The number 8 comes from the Nvidia hardware I had accessible when 
the code was
written. I think it's a good idea to make it configureable. 
Autodetecting might be a bit harder to do, and I'm not sure that it will 
be completely robust on all supported hardware. If we autodetect, we at 
least need to provide an upper limit on the number of frames.

>frame_drop_limit typically has a value of 3 so there must be 4 frames in
> the buffer or the decoder will be informed to drop some frames.
>
>In the above scenario it is hardly possible to always have these 4
>frames in the buffer as 3 frames are in use while decoding and 1-2
>frames are in use while displaying.
>
>That's why activating the bob_deinterlacer caused frame drops over and
>over on my machine although CPU load didn't change at all.
>
>Question 3: what do you think about introducing frame_drop_limit_max to
>replace the fixed constant 3?
>
>frame_drop_limit_max should be set in relation to num_frame_buffers, e. g.
>
>   frame_drop_limit_max = min(3, num_frame_buffers - 3 - 2 - 2);
>
>The three constants reserve buffers for decoding, displaying and buffer
>fluctuation. So for the above case, frame_drop_limit_max will be 1. I've
>tried that and it works properly, e. g. for watching the ASTRA HD demo loop.
>
>  
>
I'm not familiar with that code, so I can't really comment. Anyone else?

>Bye.
>  
>
/Thomas

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: Reinhard N. <rn...@gm...> - 2006-12-30 00:23:55

Hi,

Thomas Hellström wrote:

>> xxmc typically allocates 8 buffers for frames. This can be switched to
>> 15, but my GF6600GT fails to allocate more than 10 frames.
>>
>> Question 1: where does this number 8 come from?
>> Question 2: what do you think about making it configureable /
>> autodetectable?
>  
> IIRC, The number 8 comes from the Nvidia hardware I had accessible when
> the code was
> written. I think it's a good idea to make it configureable.
> Autodetecting might be a bit harder to do, and I'm not sure that it will
> be completely robust on all supported hardware. If we autodetect, we at
> least need to provide an upper limit on the number of frames.

Hhm, don't know what I counted yesterday: after having implemented a
range value from 8 to 15, I realized that I cannot allocate more than 8
frames. So this NVIDIA limit is still true and I dropped that
implementation again.

>> frame_drop_limit typically has a value of 3 so there must be 4 frames in
>> the buffer or the decoder will be informed to drop some frames.
>>
>> In the above scenario it is hardly possible to always have these 4
>> frames in the buffer as 3 frames are in use while decoding and 1-2
>> frames are in use while displaying.
>>
>> That's why activating the bob_deinterlacer caused frame drops over and
>> over on my machine although CPU load didn't change at all.
>>
>> Question 3: what do you think about introducing frame_drop_limit_max to
>> replace the fixed constant 3?
>>
>> frame_drop_limit_max should be set in relation to num_frame_buffers,
>> e. g.
>>
>>   frame_drop_limit_max = min(3, num_frame_buffers - 3 - 2 - 2);
>>
>> The three constants reserve buffers for decoding, displaying and buffer
>> fluctuation. So for the above case, frame_drop_limit_max will be 1. I've
>> tried that and it works properly, e. g. for watching the ASTRA HD demo
>> loop.
>
> I'm not familiar with that code, so I can't really comment. Anyone else?

I once again had a look into this with concern to xxmc. Basically the
above code in it's original form tries to detect that the CPU has not
enough power to decode the stream. The test is as simple as this: when
the decoder pushes a decoded frame into the video output buffer a test
is made how far this frame is ahead in time. If it is less than 4 times
   it's duration then it is assumed that the decoder will hardly be able
to supply frames with a time stamp in the future, so it has to drop
decoding of some frames to decode and deliver soon a frame with fitting
time stamp.

As more complex scenes or different frame types seem to take longer to
decode, it is likely that a decoded frame is only 3 or 2 times it's
duration ahead in time. Telling the decoder to drop a frame each time
this happens would make the stream unwatchable. Therefore a further test
checks whether there are at least 4 frames in the video output queue and
in this case the decoder will not be asked to drop some frames.

So, xxmc supplies only 8 frames, 3 will be used in the decoder (e. g. to
hold the related I, P and B frames at the same time) and 1 will be used
in the output device while displaying the frame, resulting in a maximum
of 4 frames which can be in the video output queue when the decoder
wants to push a further decoded frame into this queue. After that the
decoder has to wait for the video output device to free one frame so
that it can get it and continue decoding.

With deactivated bob deinterlacer and a large enough input buffer I can
see that the number of available frames in the video output queue are
typically 4. But with activated bob deinterlacer I can see this number
drops to 3 or even lower, depending on the complexity of the scene. It
seems like the decoder has to wait too long to get a frame to continue
decoding.

Having a look into xxmc I found that drawing the frame takes about 0.3
ms when the bob deinterlacer is disabled. But it takes sometimes more
than 25 ms to paint the frame twice in the bob deinterlacer
implementation as the second draw has to be delayed by the halve frame
duration which is in this case 20 ms. This means the decoder must be
able to decode at twice the normal speed to keep the video output queue
filled and this seems to be not possible for complex scenes.

As a solution might be rather complex I'd first like to hear your
thoughts about the following idea: initially one frame should be taken
away from the frame pool. As a result the maximum number of available
frames in the video output queue will decrease by one so the above
detection code must be modified by one (but that's not complicated). The
xxmc frame drawing code must be changed so that it puts this spare frame
into the frame pool just before the usleep(). This will let the decoder
get a frame while the current frame is still to be displayed a second
time. Furthermore, when the current frame would be disposed to the frame
pool, it will be taken away to be the new spare one.

I think it should be possible to implement this only by modifying the
xxmc implementation, isn't it?

Bye.
-- 
Dipl.-Inform. (FH) Reinhard Nissl
mailto:rn...@gm...

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: Reinhard N. <rn...@gm...> - 2007-01-04 23:54:16

Attachments: ax1.txt ax3.txt ax4.txt

Hi,

Thomas Hellström wrote:

> In theory, the decoding thread should be able to work during the
> video-out usleep(),
> but might be prevented to do so by the context_reader_lock being held.
> Please try the attached patch to see if that improves the situation.
> 
> Also the long usleep() before displaying the next field is meaningless
> on some hardware since the hardware doesn't allow a new frame to be
> displayed before the next vblank (to stop tearing). A configurable
> option to shorten the usleep() could be useful here. A short usleep()
> will always be needed to avoid busy-wait loops in the driver.
> Unichrome's don't have interrupts for this.

I've tested your patch on my EPIA MII6000E, but do not see a noticeable
gain.

I've instrumented the code (ax1.txt) with gettimeofday() and the
attached result ax3.txt shows that after several seconds of smooth
replay (dt ~ 25 ms) XvMCPutSurface() needs extremely longer to process
(dt9, dt4). With the fixed usleep() of 18 ms (dt6) the total processing
time (dt) exceeds 30 ms which results in a lack of free frames for the
decoder.

The decoder my even not be able to work at full speed when it has to
wait for the above mentioned lock which is locked during dt3, dt4 and dt9.

Result ax4.txt from a different test run also shows that the issue first
appears at the second XvMCPutSurface() (dt9), but this time
XvMCSyncSurface() seems to consume some time too, resulting in a total
of more than 30 ms.

As I wrote, prior to your changes I've used a while loop with shorter
usleeps() and using gettimeofday() to determine the time passed so far.
The result was that the time slept was shorter when XvMCSyncSurface()
and/or XvMCPutSurface() took longer to process. As a result it didn't
happen that often that the total time spent in xxmc_display_frame()
exceeded 35 ms.

But it happened repeatedly. top -d1 -H -p `pidofproc xine` showed me
that before the issue, the video decoder thread took about 25 % CPU time
and the audio decoder thread about 20 % and the system was about 35 %
idle. When the above issue happens, then the video out thread consumes
almost the whole remaining CPU time.

Could it be that this issue is the result of having the X server running
at a refresh rate of 50 Hz?

Bye.
-- 
Dipl.-Inform. (FH) Reinhard Nissl
mailto:rn...@gm...

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: Petri H. <phi...@cc...> - 2007-01-05 07:09:48

On Fri, 5 Jan 2007, Reinhard Nissl wrote:
> Thomas Hellstr=F6m wrote:
>
>> In theory, the decoding thread should be able to work during the
>> video-out usleep(),

In my experience small usleeps are always busy-wait, and video-out tries=20
to increase its priority with nice(-2) when started.

> But it happened repeatedly. top -d1 -H -p `pidofproc xine` showed me
> that before the issue, the video decoder thread took about 25 % CPU tim=
e
> and the audio decoder thread about 20 % and the system was about 35 %
> idle. When the above issue happens, then the video out thread consumes
> almost the whole remaining CPU time.
>
> Could it be that this issue is the result of having the X server runnin=
g
> at a refresh rate of 50 Hz?

I've got several reports having similar problems with xine-lib +=20
vdr-xineliboutput when running tvtime deinterlacer to 50Hz Xv output.=20
If data is fed at ~50Hz (25Hz interlaced) but little bit faster than vide=
o
card can consume it, deinterlacer prevents decoder to drop frames. When=20
frames are fed to Xv driver faster than it can consume, it ends up=20
busy-waiting next buffer slot from hardware causing CPU usage of X server=
=20
to rise to ->100% during few seconds. Then, input buffer overflows, engin=
e=20
is resetted and all starts from start again ...

I belive XvMC might have similar problems with frame dropping if it can=20
not drop frames ... ?

You could try to change engine sync medhod to audio resampling and modify=
=20
metronom to run 1...2% slower to verify this. Or change display refresh
to 51Hz...

- Petri

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: <th...@tu...> - 2007-01-05 09:26:46

Petri Hintukainen wrote:

> On Fri, 5 Jan 2007, Reinhard Nissl wrote:
>
>> Thomas Hellstr=F6m wrote:
>>
>>> In theory, the decoding thread should be able to work during the
>>> video-out usleep(),
>>
>
> In my experience small usleeps are always busy-wait, and video-out=20
> tries to increase its priority with nice(-2) when started.

These usleeps are quite long and should really be sleeps.

>
>> But it happened repeatedly. top -d1 -H -p `pidofproc xine` showed me
>> that before the issue, the video decoder thread took about 25 % CPU ti=
me
>> and the audio decoder thread about 20 % and the system was about 35 %
>> idle. When the above issue happens, then the video out thread consumes
>> almost the whole remaining CPU time.
>>
>> Could it be that this issue is the result of having the X server runni=
ng
>> at a refresh rate of 50 Hz?
>
>
> I've got several reports having similar problems with xine-lib +=20
> vdr-xineliboutput when running tvtime deinterlacer to 50Hz Xv output.=20
> If data is fed at ~50Hz (25Hz interlaced) but little bit faster than=20
> video
> card can consume it, deinterlacer prevents decoder to drop frames.=20
> When frames are fed to Xv driver faster than it can consume, it ends=20
> up busy-waiting next buffer slot from hardware causing CPU usage of X=20
> server to rise to ->100% during few seconds. Then, input buffer=20
> overflows, engine is resetted and all starts from start again ...
>
> I belive XvMC might have similar problems with frame dropping if it=20
> can not drop frames ... ?

Certainly, on Unichromes, the Xv / XvMC code doesn't allow displaying=20
frames faster than the refresh rate. The result will be a busy-wait in=20
the driver code and later a dropped frame in xine. This is not because=20
the video engine is too slow, but because the overlay hardware only=20
allows updating during vblank to avoid tearing. I'm not sure how nvidia=20
XvMC handles this.

>
> You could try to change engine sync medhod to audio resampling and=20
> modify metronom to run 1...2% slower to verify this. Or change display=20
> refresh
> to 51Hz...
>
>
> - Petri


/Thomas

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: Reinhard N. <rn...@gm...> - 2006-12-30 07:37:13

Hi,

Reinhard Nissl wrote:

> As a solution might be rather complex I'd first like to hear your
> thoughts about the following idea: initially one frame should be taken
> away from the frame pool. As a result the maximum number of available
> frames in the video output queue will decrease by one so the above
> detection code must be modified by one (but that's not complicated). The
> xxmc frame drawing code must be changed so that it puts this spare frame
> into the frame pool just before the usleep(). This will let the decoder
> get a frame while the current frame is still to be displayed a second
> time. Furthermore, when the current frame would be disposed to the frame
> pool, it will be taken away to be the new spare one.

I've to add some information to explain the effect of the change:
One may think what shall be the benefit of this change as frames are
still displayed every 40 ms and put back to the frame pool every 40 ms?

The difference is the phase at which these actions happen. Before the
change the phase offset is more than 20 ms (> 180 °) and afterwards it
is almost 0 ms (~ 0 °).

Using more frame buffers is actually no solution to this issue as it
seems to protect you from buffer underruns (= frame drops) but complex
streams my still result in a buffer underrun as not every decoder is
able to operate at twice the normal speed.

This makes me think of not touching xxmc in this regard, but to put this
functionality into the video out loop of xine-engine as this phase
offset concerns all implementations -- decoders and output devices.

Bye.
-- 
Dipl.-Inform. (FH) Reinhard Nissl
mailto:rn...@gm...

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: Reinhard N. <rn...@gm...> - 2006-12-31 15:27:44

Hi,

Reinhard Nissl wrote:

> Using more frame buffers is actually no solution to this issue as it
> seems to protect you from buffer underruns (= frame drops) but complex
> streams my still result in a buffer underrun as not every decoder is
> able to operate at twice the normal speed.
> 
> This makes me think of not touching xxmc in this regard, but to put this
> functionality into the video out loop of xine-engine as this phase
> offset concerns all implementations -- decoders and output devices.

Well, while talking to myself, I have to correct some statements from
the previous emails.

Concerning the phase issue: it is already addressed in the video out
loop as it keeps a reference to the current frame (in that context named
last_frame) which is released immediately before showing the next frame.

Furthermore, xxmc keeps references (in that context named recent_frames)
to the last two frames. These references are updated just before showing
the current frame so releasing the no longer needed reference is in
correct phase too.

This leads to the question, why holding these two references if they are
never used?

A comment says for deinterlacing but I cannot see any code yet. So by
holding just one reference (or even no reference, as the video out loop
does this already) an extra frame buffer could be freed up to and made
available for the video output queue.

So this means just 1 frame buffer is in use while displaying, 2 are in
use by the decoder (I was wrong to assume 3 as the forward reference
frame has already been pushed to the video output queue) and as a result
5 frame buffers can be ready for display while a 6th is pushed to the
queue. After that the decoder has to wait for a frame to get disposed.

Measuring how much time the decoder needs to decode a frame shows
occasionally a value of 35 to 40 ms for the ASTRA HD 1080i demo loop.
This means that a buffer fluctuation of at least 1 should be considered
when determining frame_drop_limit_max.

In my current experiments with xxmc's bob deinterlacer it turns out that
the simple usleep() of almost halve the frame duration seems to be of
major concern. Occasionally it happens that showing a frame deinterlaced
takes up to 45 ms. As a result, the decoder gets a free buffer later
than expected. Although the video out loop tries to handle this delay,
xxmc doesn't know of that and takes at least further 20 ms to display
the next frame deinterlaced. Finally the buffer fill level breaks in for
a short time but the detection logic orders a frame drop already.

At the moment I've replaced the single usleep() by a loop which uses
gettimeofday() and usleep(10) to delay more precisely. The result is
that it now takes roughly 25 ms to show the frame deinterlaced. CPU load
on my P4 2.8 GHz HT hasn't changed much, but I still need to run tests
on my EPIA 6000.

Bye.
-- 
Dipl.-Inform. (FH) Reinhard Nissl
mailto:rn...@gm...

Re: [xine-devel] xxmc: num_frame_buffers and frame_drop_limit must correlate

From: <th...@tu...> - 2007-01-02 13:45:06

Attachments: xxmc_timings.patch

Reinhard Nissl wrote:

>Hi,
>
>Reinhard Nissl wrote:
>
>  
>
>>Using more frame buffers is actually no solution to this issue as it
>>seems to protect you from buffer underruns (= frame drops) but complex
>>streams my still result in a buffer underrun as not every decoder is
>>able to operate at twice the normal speed.
>>
>>This makes me think of not touching xxmc in this regard, but to put this
>>functionality into the video out loop of xine-engine as this phase
>>offset concerns all implementations -- decoders and output devices.
>>    
>>
>
>Well, while talking to myself, I have to correct some statements from
>the previous emails.
>
>Concerning the phase issue: it is already addressed in the video out
>loop as it keeps a reference to the current frame (in that context named
>last_frame) which is released immediately before showing the next frame.
>
>Furthermore, xxmc keeps references (in that context named recent_frames)
>to the last two frames. These references are updated just before showing
>the current frame so releasing the no longer needed reference is in
>correct phase too.
>
>This leads to the question, why holding these two references if they are
>never used?
>  
>
There are two reasons.
1) The deinterlacing hardware in some Unichromes needs the previous 
frame as a reference.
Since there is no xvmc support for this hardware yet, this is not a good 
reason so one frame can
be dropped.

2) We're always keeping the previous frame since we cannot be sure that 
the hardware is finished
with it until the current frame is displayed. XvMCPutSurface makes sure 
the previous frame is displayed and tells the hardware to display the 
current frame as soon as possible.  So if we want to go down to only one 
previous frame, we need to call xxmc_add_recent_frame() after the 
surface has been put at least once. I've done that in the attached patch.

>A comment says for deinterlacing but I cannot see any code yet. So by
>holding just one reference (or even no reference, as the video out loop
>does this already) an extra frame buffer could be freed up to and made
>available for the video output queue.
>
>So this means just 1 frame buffer is in use while displaying, 2 are in
>use by the decoder (I was wrong to assume 3 as the forward reference
>frame has already been pushed to the video output queue) and as a result
>5 frame buffers can be ready for display while a 6th is pushed to the
>queue. After that the decoder has to wait for a frame to get disposed.
>
>Measuring how much time the decoder needs to decode a frame shows
>occasionally a value of 35 to 40 ms for the ASTRA HD 1080i demo loop.
>This means that a buffer fluctuation of at least 1 should be considered
>when determining frame_drop_limit_max.
>
>In my current experiments with xxmc's bob deinterlacer it turns out that
>the simple usleep() of almost halve the frame duration seems to be of
>major concern. Occasionally it happens that showing a frame deinterlaced
>takes up to 45 ms. As a result, the decoder gets a free buffer later
>than expected. Although the video out loop tries to handle this delay,
>xxmc doesn't know of that and takes at least further 20 ms to display
>the next frame deinterlaced. Finally the buffer fill level breaks in for
>a short time but the detection logic orders a frame drop already.
>
>At the moment I've replaced the single usleep() by a loop which uses
>gettimeofday() and usleep(10) to delay more precisely. The result is
>that it now takes roughly 25 ms to show the frame deinterlaced. CPU load
>on my P4 2.8 GHz HT hasn't changed much, but I still need to run tests
>on my EPIA 6000.
>  
>
In theory, the decoding thread should be able to work during the 
video-out usleep(),
but might be prevented to do so by the context_reader_lock being held. 
Please try the attached patch to see if that improves the situation.

Also the long usleep() before displaying the next field is meaningless 
on some hardware since the hardware doesn't a low a new frame to be 
displayed before the next vblank (to stop tearing). A configurable 
option to shorten the usleep() could be useful here. A short usleep() 
will always be needed to avoid busy-wait loops in the driver. 
Unichrome's don't have interrupts for this.

>Bye.
>  
>
/Thomas