From: Reinhard N. <rn...@gm...> - 2006-12-29 00:47:29
|
Hi, xxmc typically allocates 8 buffers for frames. This can be switched to 15, but my GF6600GT fails to allocate more than 10 frames. Question 1: where does this number 8 come from? Question 2: what do you think about making it configureable / autodetectable? frame_drop_limit typically has a value of 3 so there must be 4 frames in the buffer or the decoder will be informed to drop some frames. In the above scenario it is hardly possible to always have these 4 frames in the buffer as 3 frames are in use while decoding and 1-2 frames are in use while displaying. That's why activating the bob_deinterlacer caused frame drops over and over on my machine although CPU load didn't change at all. Question 3: what do you think about introducing frame_drop_limit_max to replace the fixed constant 3? frame_drop_limit_max should be set in relation to num_frame_buffers, e. g. frame_drop_limit_max = min(3, num_frame_buffers - 3 - 2 - 2); The three constants reserve buffers for decoding, displaying and buffer fluctuation. So for the above case, frame_drop_limit_max will be 1. I've tried that and it works properly, e. g. for watching the ASTRA HD demo loop. Bye. -- Dipl.-Inform. (FH) Reinhard Nissl mailto:rn...@gm... |
From: <th...@tu...> - 2006-12-29 15:56:45
|
Reinhard Nissl wrote: >Hi, > >xxmc typically allocates 8 buffers for frames. This can be switched to >15, but my GF6600GT fails to allocate more than 10 frames. > >Question 1: where does this number 8 come from? >Question 2: what do you think about making it configureable / >autodetectable? > > Hi, IIRC, The number 8 comes from the Nvidia hardware I had accessible when the code was written. I think it's a good idea to make it configureable. Autodetecting might be a bit harder to do, and I'm not sure that it will be completely robust on all supported hardware. If we autodetect, we at least need to provide an upper limit on the number of frames. >frame_drop_limit typically has a value of 3 so there must be 4 frames in > the buffer or the decoder will be informed to drop some frames. > >In the above scenario it is hardly possible to always have these 4 >frames in the buffer as 3 frames are in use while decoding and 1-2 >frames are in use while displaying. > >That's why activating the bob_deinterlacer caused frame drops over and >over on my machine although CPU load didn't change at all. > >Question 3: what do you think about introducing frame_drop_limit_max to >replace the fixed constant 3? > >frame_drop_limit_max should be set in relation to num_frame_buffers, e. g. > > frame_drop_limit_max = min(3, num_frame_buffers - 3 - 2 - 2); > >The three constants reserve buffers for decoding, displaying and buffer >fluctuation. So for the above case, frame_drop_limit_max will be 1. I've >tried that and it works properly, e. g. for watching the ASTRA HD demo loop. > > > I'm not familiar with that code, so I can't really comment. Anyone else? >Bye. > > /Thomas |
From: Reinhard N. <rn...@gm...> - 2006-12-30 00:23:55
|
Hi, Thomas Hellström wrote: >> xxmc typically allocates 8 buffers for frames. This can be switched to >> 15, but my GF6600GT fails to allocate more than 10 frames. >> >> Question 1: where does this number 8 come from? >> Question 2: what do you think about making it configureable / >> autodetectable? > > IIRC, The number 8 comes from the Nvidia hardware I had accessible when > the code was > written. I think it's a good idea to make it configureable. > Autodetecting might be a bit harder to do, and I'm not sure that it will > be completely robust on all supported hardware. If we autodetect, we at > least need to provide an upper limit on the number of frames. Hhm, don't know what I counted yesterday: after having implemented a range value from 8 to 15, I realized that I cannot allocate more than 8 frames. So this NVIDIA limit is still true and I dropped that implementation again. >> frame_drop_limit typically has a value of 3 so there must be 4 frames in >> the buffer or the decoder will be informed to drop some frames. >> >> In the above scenario it is hardly possible to always have these 4 >> frames in the buffer as 3 frames are in use while decoding and 1-2 >> frames are in use while displaying. >> >> That's why activating the bob_deinterlacer caused frame drops over and >> over on my machine although CPU load didn't change at all. >> >> Question 3: what do you think about introducing frame_drop_limit_max to >> replace the fixed constant 3? >> >> frame_drop_limit_max should be set in relation to num_frame_buffers, >> e. g. >> >> frame_drop_limit_max = min(3, num_frame_buffers - 3 - 2 - 2); >> >> The three constants reserve buffers for decoding, displaying and buffer >> fluctuation. So for the above case, frame_drop_limit_max will be 1. I've >> tried that and it works properly, e. g. for watching the ASTRA HD demo >> loop. > > I'm not familiar with that code, so I can't really comment. Anyone else? I once again had a look into this with concern to xxmc. Basically the above code in it's original form tries to detect that the CPU has not enough power to decode the stream. The test is as simple as this: when the decoder pushes a decoded frame into the video output buffer a test is made how far this frame is ahead in time. If it is less than 4 times it's duration then it is assumed that the decoder will hardly be able to supply frames with a time stamp in the future, so it has to drop decoding of some frames to decode and deliver soon a frame with fitting time stamp. As more complex scenes or different frame types seem to take longer to decode, it is likely that a decoded frame is only 3 or 2 times it's duration ahead in time. Telling the decoder to drop a frame each time this happens would make the stream unwatchable. Therefore a further test checks whether there are at least 4 frames in the video output queue and in this case the decoder will not be asked to drop some frames. So, xxmc supplies only 8 frames, 3 will be used in the decoder (e. g. to hold the related I, P and B frames at the same time) and 1 will be used in the output device while displaying the frame, resulting in a maximum of 4 frames which can be in the video output queue when the decoder wants to push a further decoded frame into this queue. After that the decoder has to wait for the video output device to free one frame so that it can get it and continue decoding. With deactivated bob deinterlacer and a large enough input buffer I can see that the number of available frames in the video output queue are typically 4. But with activated bob deinterlacer I can see this number drops to 3 or even lower, depending on the complexity of the scene. It seems like the decoder has to wait too long to get a frame to continue decoding. Having a look into xxmc I found that drawing the frame takes about 0.3 ms when the bob deinterlacer is disabled. But it takes sometimes more than 25 ms to paint the frame twice in the bob deinterlacer implementation as the second draw has to be delayed by the halve frame duration which is in this case 20 ms. This means the decoder must be able to decode at twice the normal speed to keep the video output queue filled and this seems to be not possible for complex scenes. As a solution might be rather complex I'd first like to hear your thoughts about the following idea: initially one frame should be taken away from the frame pool. As a result the maximum number of available frames in the video output queue will decrease by one so the above detection code must be modified by one (but that's not complicated). The xxmc frame drawing code must be changed so that it puts this spare frame into the frame pool just before the usleep(). This will let the decoder get a frame while the current frame is still to be displayed a second time. Furthermore, when the current frame would be disposed to the frame pool, it will be taken away to be the new spare one. I think it should be possible to implement this only by modifying the xxmc implementation, isn't it? Bye. -- Dipl.-Inform. (FH) Reinhard Nissl mailto:rn...@gm... |
From: Reinhard N. <rn...@gm...> - 2007-01-04 23:54:16
|
Hi, Thomas Hellström wrote: > In theory, the decoding thread should be able to work during the > video-out usleep(), > but might be prevented to do so by the context_reader_lock being held. > Please try the attached patch to see if that improves the situation. > > Also the long usleep() before displaying the next field is meaningless > on some hardware since the hardware doesn't allow a new frame to be > displayed before the next vblank (to stop tearing). A configurable > option to shorten the usleep() could be useful here. A short usleep() > will always be needed to avoid busy-wait loops in the driver. > Unichrome's don't have interrupts for this. I've tested your patch on my EPIA MII6000E, but do not see a noticeable gain. I've instrumented the code (ax1.txt) with gettimeofday() and the attached result ax3.txt shows that after several seconds of smooth replay (dt ~ 25 ms) XvMCPutSurface() needs extremely longer to process (dt9, dt4). With the fixed usleep() of 18 ms (dt6) the total processing time (dt) exceeds 30 ms which results in a lack of free frames for the decoder. The decoder my even not be able to work at full speed when it has to wait for the above mentioned lock which is locked during dt3, dt4 and dt9. Result ax4.txt from a different test run also shows that the issue first appears at the second XvMCPutSurface() (dt9), but this time XvMCSyncSurface() seems to consume some time too, resulting in a total of more than 30 ms. As I wrote, prior to your changes I've used a while loop with shorter usleeps() and using gettimeofday() to determine the time passed so far. The result was that the time slept was shorter when XvMCSyncSurface() and/or XvMCPutSurface() took longer to process. As a result it didn't happen that often that the total time spent in xxmc_display_frame() exceeded 35 ms. But it happened repeatedly. top -d1 -H -p `pidofproc xine` showed me that before the issue, the video decoder thread took about 25 % CPU time and the audio decoder thread about 20 % and the system was about 35 % idle. When the above issue happens, then the video out thread consumes almost the whole remaining CPU time. Could it be that this issue is the result of having the X server running at a refresh rate of 50 Hz? Bye. -- Dipl.-Inform. (FH) Reinhard Nissl mailto:rn...@gm... |
From: Petri H. <phi...@cc...> - 2007-01-05 07:09:48
|
On Fri, 5 Jan 2007, Reinhard Nissl wrote: > Thomas Hellstr=F6m wrote: > >> In theory, the decoding thread should be able to work during the >> video-out usleep(), In my experience small usleeps are always busy-wait, and video-out tries=20 to increase its priority with nice(-2) when started. > But it happened repeatedly. top -d1 -H -p `pidofproc xine` showed me > that before the issue, the video decoder thread took about 25 % CPU tim= e > and the audio decoder thread about 20 % and the system was about 35 % > idle. When the above issue happens, then the video out thread consumes > almost the whole remaining CPU time. > > Could it be that this issue is the result of having the X server runnin= g > at a refresh rate of 50 Hz? I've got several reports having similar problems with xine-lib +=20 vdr-xineliboutput when running tvtime deinterlacer to 50Hz Xv output.=20 If data is fed at ~50Hz (25Hz interlaced) but little bit faster than vide= o card can consume it, deinterlacer prevents decoder to drop frames. When=20 frames are fed to Xv driver faster than it can consume, it ends up=20 busy-waiting next buffer slot from hardware causing CPU usage of X server= =20 to rise to ->100% during few seconds. Then, input buffer overflows, engin= e=20 is resetted and all starts from start again ... I belive XvMC might have similar problems with frame dropping if it can=20 not drop frames ... ? You could try to change engine sync medhod to audio resampling and modify= =20 metronom to run 1...2% slower to verify this. Or change display refresh to 51Hz... - Petri |
From: <th...@tu...> - 2007-01-05 09:26:46
|
Petri Hintukainen wrote: > On Fri, 5 Jan 2007, Reinhard Nissl wrote: > >> Thomas Hellstr=F6m wrote: >> >>> In theory, the decoding thread should be able to work during the >>> video-out usleep(), >> > > In my experience small usleeps are always busy-wait, and video-out=20 > tries to increase its priority with nice(-2) when started. These usleeps are quite long and should really be sleeps. > >> But it happened repeatedly. top -d1 -H -p `pidofproc xine` showed me >> that before the issue, the video decoder thread took about 25 % CPU ti= me >> and the audio decoder thread about 20 % and the system was about 35 % >> idle. When the above issue happens, then the video out thread consumes >> almost the whole remaining CPU time. >> >> Could it be that this issue is the result of having the X server runni= ng >> at a refresh rate of 50 Hz? > > > I've got several reports having similar problems with xine-lib +=20 > vdr-xineliboutput when running tvtime deinterlacer to 50Hz Xv output.=20 > If data is fed at ~50Hz (25Hz interlaced) but little bit faster than=20 > video > card can consume it, deinterlacer prevents decoder to drop frames.=20 > When frames are fed to Xv driver faster than it can consume, it ends=20 > up busy-waiting next buffer slot from hardware causing CPU usage of X=20 > server to rise to ->100% during few seconds. Then, input buffer=20 > overflows, engine is resetted and all starts from start again ... > > I belive XvMC might have similar problems with frame dropping if it=20 > can not drop frames ... ? Certainly, on Unichromes, the Xv / XvMC code doesn't allow displaying=20 frames faster than the refresh rate. The result will be a busy-wait in=20 the driver code and later a dropped frame in xine. This is not because=20 the video engine is too slow, but because the overlay hardware only=20 allows updating during vblank to avoid tearing. I'm not sure how nvidia=20 XvMC handles this. > > You could try to change engine sync medhod to audio resampling and=20 > modify metronom to run 1...2% slower to verify this. Or change display=20 > refresh > to 51Hz... > > > - Petri /Thomas |
From: Reinhard N. <rn...@gm...> - 2006-12-30 07:37:13
|
Hi, Reinhard Nissl wrote: > As a solution might be rather complex I'd first like to hear your > thoughts about the following idea: initially one frame should be taken > away from the frame pool. As a result the maximum number of available > frames in the video output queue will decrease by one so the above > detection code must be modified by one (but that's not complicated). The > xxmc frame drawing code must be changed so that it puts this spare frame > into the frame pool just before the usleep(). This will let the decoder > get a frame while the current frame is still to be displayed a second > time. Furthermore, when the current frame would be disposed to the frame > pool, it will be taken away to be the new spare one. I've to add some information to explain the effect of the change: One may think what shall be the benefit of this change as frames are still displayed every 40 ms and put back to the frame pool every 40 ms? The difference is the phase at which these actions happen. Before the change the phase offset is more than 20 ms (> 180 °) and afterwards it is almost 0 ms (~ 0 °). Using more frame buffers is actually no solution to this issue as it seems to protect you from buffer underruns (= frame drops) but complex streams my still result in a buffer underrun as not every decoder is able to operate at twice the normal speed. This makes me think of not touching xxmc in this regard, but to put this functionality into the video out loop of xine-engine as this phase offset concerns all implementations -- decoders and output devices. Bye. -- Dipl.-Inform. (FH) Reinhard Nissl mailto:rn...@gm... |
From: Reinhard N. <rn...@gm...> - 2006-12-31 15:27:44
|
Hi, Reinhard Nissl wrote: > Using more frame buffers is actually no solution to this issue as it > seems to protect you from buffer underruns (= frame drops) but complex > streams my still result in a buffer underrun as not every decoder is > able to operate at twice the normal speed. > > This makes me think of not touching xxmc in this regard, but to put this > functionality into the video out loop of xine-engine as this phase > offset concerns all implementations -- decoders and output devices. Well, while talking to myself, I have to correct some statements from the previous emails. Concerning the phase issue: it is already addressed in the video out loop as it keeps a reference to the current frame (in that context named last_frame) which is released immediately before showing the next frame. Furthermore, xxmc keeps references (in that context named recent_frames) to the last two frames. These references are updated just before showing the current frame so releasing the no longer needed reference is in correct phase too. This leads to the question, why holding these two references if they are never used? A comment says for deinterlacing but I cannot see any code yet. So by holding just one reference (or even no reference, as the video out loop does this already) an extra frame buffer could be freed up to and made available for the video output queue. So this means just 1 frame buffer is in use while displaying, 2 are in use by the decoder (I was wrong to assume 3 as the forward reference frame has already been pushed to the video output queue) and as a result 5 frame buffers can be ready for display while a 6th is pushed to the queue. After that the decoder has to wait for a frame to get disposed. Measuring how much time the decoder needs to decode a frame shows occasionally a value of 35 to 40 ms for the ASTRA HD 1080i demo loop. This means that a buffer fluctuation of at least 1 should be considered when determining frame_drop_limit_max. In my current experiments with xxmc's bob deinterlacer it turns out that the simple usleep() of almost halve the frame duration seems to be of major concern. Occasionally it happens that showing a frame deinterlaced takes up to 45 ms. As a result, the decoder gets a free buffer later than expected. Although the video out loop tries to handle this delay, xxmc doesn't know of that and takes at least further 20 ms to display the next frame deinterlaced. Finally the buffer fill level breaks in for a short time but the detection logic orders a frame drop already. At the moment I've replaced the single usleep() by a loop which uses gettimeofday() and usleep(10) to delay more precisely. The result is that it now takes roughly 25 ms to show the frame deinterlaced. CPU load on my P4 2.8 GHz HT hasn't changed much, but I still need to run tests on my EPIA 6000. Bye. -- Dipl.-Inform. (FH) Reinhard Nissl mailto:rn...@gm... |
From: <th...@tu...> - 2007-01-02 13:45:06
Attachments:
xxmc_timings.patch
|
Reinhard Nissl wrote: >Hi, > >Reinhard Nissl wrote: > > > >>Using more frame buffers is actually no solution to this issue as it >>seems to protect you from buffer underruns (= frame drops) but complex >>streams my still result in a buffer underrun as not every decoder is >>able to operate at twice the normal speed. >> >>This makes me think of not touching xxmc in this regard, but to put this >>functionality into the video out loop of xine-engine as this phase >>offset concerns all implementations -- decoders and output devices. >> >> > >Well, while talking to myself, I have to correct some statements from >the previous emails. > >Concerning the phase issue: it is already addressed in the video out >loop as it keeps a reference to the current frame (in that context named >last_frame) which is released immediately before showing the next frame. > >Furthermore, xxmc keeps references (in that context named recent_frames) >to the last two frames. These references are updated just before showing >the current frame so releasing the no longer needed reference is in >correct phase too. > >This leads to the question, why holding these two references if they are >never used? > > There are two reasons. 1) The deinterlacing hardware in some Unichromes needs the previous frame as a reference. Since there is no xvmc support for this hardware yet, this is not a good reason so one frame can be dropped. 2) We're always keeping the previous frame since we cannot be sure that the hardware is finished with it until the current frame is displayed. XvMCPutSurface makes sure the previous frame is displayed and tells the hardware to display the current frame as soon as possible. So if we want to go down to only one previous frame, we need to call xxmc_add_recent_frame() after the surface has been put at least once. I've done that in the attached patch. >A comment says for deinterlacing but I cannot see any code yet. So by >holding just one reference (or even no reference, as the video out loop >does this already) an extra frame buffer could be freed up to and made >available for the video output queue. > >So this means just 1 frame buffer is in use while displaying, 2 are in >use by the decoder (I was wrong to assume 3 as the forward reference >frame has already been pushed to the video output queue) and as a result >5 frame buffers can be ready for display while a 6th is pushed to the >queue. After that the decoder has to wait for a frame to get disposed. > >Measuring how much time the decoder needs to decode a frame shows >occasionally a value of 35 to 40 ms for the ASTRA HD 1080i demo loop. >This means that a buffer fluctuation of at least 1 should be considered >when determining frame_drop_limit_max. > >In my current experiments with xxmc's bob deinterlacer it turns out that >the simple usleep() of almost halve the frame duration seems to be of >major concern. Occasionally it happens that showing a frame deinterlaced >takes up to 45 ms. As a result, the decoder gets a free buffer later >than expected. Although the video out loop tries to handle this delay, >xxmc doesn't know of that and takes at least further 20 ms to display >the next frame deinterlaced. Finally the buffer fill level breaks in for >a short time but the detection logic orders a frame drop already. > >At the moment I've replaced the single usleep() by a loop which uses >gettimeofday() and usleep(10) to delay more precisely. The result is >that it now takes roughly 25 ms to show the frame deinterlaced. CPU load >on my P4 2.8 GHz HT hasn't changed much, but I still need to run tests >on my EPIA 6000. > > In theory, the decoding thread should be able to work during the video-out usleep(), but might be prevented to do so by the context_reader_lock being held. Please try the attached patch to see if that improves the situation. Also the long usleep() before displaying the next field is meaningless on some hardware since the hardware doesn't a low a new frame to be displayed before the next vblank (to stop tearing). A configurable option to shorten the usleep() could be useful here. A short usleep() will always be needed to avoid busy-wait loops in the driver. Unichrome's don't have interrupts for this. >Bye. > > /Thomas |