Hi there,
follow-up of the mailing list discussion.
On my machine, as soon as I try to encode a video, the fps drops to unacceptable values. The video encoding seems to work, at very low fps, as files are created and playable.
Certainly this is some issue with the encoder, my hardware is quite reasonable:
Resolution is 2560x1600
Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (12 cores, 24 threads)
NVIDIA Corporation GP104 [GeForce GTX 1080]
fgfs does not use the extra available threads when encoding, it sticks at 100% (but threads are enable, I do see 200% sometimes while running the normal sim).
during the build of simgear I see that:
Found FFmpeg: /usr/lib64/libavcodec.so;/usr/lib64/libavformat.so;/usr/lib64/libavutil.so;/usr/lib64/libswscale.so found components: AVCODEC AVFORMAT AVUTIL SWSCALE
which suggests that software scaling is in used? (it is ffmpeg 4.3.3), but other softs link to it seem to be able to use hardware acceleration.
ffmpeg -codecs | grep 264 gives:
DEV.LS h264 H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 (decoders: h264 h264_v4l2m2m h264_qsv ) (encoders: libx264 libx264rgb h264_qsv h264_v4l2m2m h264_vaapi )
Let me know If you want me to test specific thing
libswrescale is not only used for resizing the video, but also for colorspace conversions: https://ffmpeg.org/libswscale.html#Description
So, it'd be worth to find out in which color format FG is capturing the frames and which codec is using that colorspace to minimize the encoding impact on FPS as much as possible.
libswscale.so is used to convert FG's main window's RGB pixels into YUV format - all the video codecs require YUV as input. I think this is pretty standard - i've just copied the ffmpeg encoding examples, and i don't think this can be a significant source of slow-down.
I think it's suspicious that video encoding is not using your additional CPU cores here.
Is it possible that the fgfs process is being tied to a single core for example?
The FG video encoding code explicitly runs in its own thread so that FG can always get on with other things while we are encoding (because we wait for ffmpeg to make progress). And i think Ffmpeg also creates and runs its own threads - it certainly does for me, where video encoding changes
top
's cpu usage for fgfs from 50% to 380% (on a 4-core machine, i.e. it's using all available cpu cores).I might have a look at allowing the video encoding code to force the number of threads being used by ffmpeg, and we can see whether that changes things. But i'm a little suspicious of your setup here because the defaults should just work. (And you are also seeing a curious slowdown from Sentry).
- Jules
It is built with hardened compilation options, namely not higher than -O2 and sse2, threads enable, to be distributed under Mageia linux. So, it indeed may differ from the nightly build, but it is certainly closer to what distros will be distributing.
If you have a suspicion of a bug related to built options, I can provide full details of the cmake command use to do the build (simgear and/or flightgear)?
Another thing I've noted is that the default CRF (logged in the console) looks very low. At quality 1, it displays a crf of 3. For x264 codecs, this should be more around 20? Is log scale, so 3 means very close to lossless, maybe the bandwidth just become too large for 2K screen.
Last edit: eatdirt 2022-04-25
Thanks for this info.
Yes, please post your cmake command, just in case the compiler options are effecting things. Also would you be able to try a standard FG build?
However i was more concerned that there could be something about your system configuration that is restricting how many cores the FG process is allowed to use. Could you try running something like
recordmydesktop
and see what its CPU usage is intop
?The ideal test would be to try a different Linux distribution on the same or similar hardware, in case there's something odd about Mageia's ffmpeg package, but i appreciate that this might not be possible.
Thanks,
- Jules
I've just built and tested recordmydesktop, I did not know this one. It takes at most 1% of CPU during the recording, then 100% after pressing CTRL+C when dumping the ogv file.
For ffmpeg, I've never tested calling the libs, but I'am using regularly the ffmpeg binary for encoding 4K images into 4K lossless films (x264). All the 24 cores are full steam when doing so.
I can also have a look to the simgear code, maybe that would give me some clue on what could be wrong with my ffmpeg install, if any.
Log of simgear build here:
https://curl.irmp.ucl.ac.be/~chris/upload/fg/tmp/log_simgear.txt
I'll test a nightly official build as soon as I can!
Last edit: eatdirt 2022-04-25
I've noticed something weird with 2020.4.0, that could be related.
One of the nasal module for the SpaceShuttle uses an extra-thread to do some calculations. With 2020.4.0, I have a huge fps drop when this thread is running. The problem may not be in ffmpeg but the way these threads are implemented? Did something change in our threading implementation, like some extra locking?
James will have a much better idea that me about potentially relevant changes on next.
The only thing i can think of is that properties have been made thread-safe on next, using read/write locks. Various were made to ensure that this didn't slow things down measurably (which for example lead to a change to not call parent nodes' listeners for value-changed events by default), so i'd hope that this isn't causing the problem. I certainly can't see how it could cause a large slow-down to threaded code.
One can temporarily disable the read/write locking by setting environmental variable SG_PROPERTY_LOCKING=0. Or at runtime by setting /sim/property-locking/active=false. Assuming this doesn't fix the problem, please don't keep this setting, as it opens us up to potential SEGVs in code that uses properties.
Thanks,
- Jules
I'm also pretty clueless about Nasal threads. But if the code was using them and property locking is impacting performance, this more or less means the previous was massively thread-unsafe, becuase it's exactly this case whcih requires properties to be locked.
It's worth trying the test Jules suggests to disable locking, of course - to establish if properpty locking is really the issue or not.
All right, thanks for the feedback, I did the test and property locking is not the cause, the fps drops is about the same when I switch on video recording (or when that nasal space shuttle (STS) thread runs).
I have not been able to test the nightly appimage, it segfaults when I start it, in dbus. That's most probably some incompatible dbus library versionning between the nightly build and my system, so nothing fancy there.
Looking for ressources, I've tried the FG system monitor, but that looks normal. "events" take most of the resources for the STS, "gui" for the UFO, but fps drop is huge for both when video is on (from more than 100 to 10 for the UFO).
When starting videos, perf top shows that most resources are taken by the encoder. What may not be normal is that these are not using the extra threads efficiently. This is top (H engaged to see threads) for the fgfs process when video encoding is running
Only 2 threads are active, and seems to share a single core. The other guys sleep. Perf top shows this, so encoding indeed burns most of the ressources:
I don't have this specific issue with other threaded programs. When I stop the video encoding, I am getting that:
PS: When running the nasal STS module, the new thread properly takes 100% (fgfs at 200%), I am getting that. Most of the calls are allocating and freeing integer, but that may be just nasal specifics.
Ok that's looking like there's some non-trivial thread CPU affinity going on.
Could you run this command, with
PID
replaced by the pid of the fgfs process, and post the output for the three cases - normal, running STS thread, and video encoding?for i in `ls /proc/PID/task/`; do taskset -p $i; done
Interestingly, for me this is showing some threads tied to particular CPUs (i've only got 4), which i wasn't expecting. For basic fgfs, i see:
So the main thread 17992 is tied to cpu 0 (
1
), other threads are the same or run on all cpus (f
) or are tied to cpu 1 (2
).Then if i start video encoding, the extra threads look like:
So all but one of the ffmpeg threads run on all cores.
I think that new threads default to inheriting their parent's affinity. But
openscenegraph/src/OpenThreads/pthreads/PThread.cpp
seems to be making calls topthread_setaffinity_np()
orsched_setaffinity()
.I haven't investigated further yet.
- Jules
Ok, here we go without running the video encoding:
and running it, the same!
But, I am interested in your comment about openscenegraph, I had assume that osg threads were unrelated to simgear thread. Maybe I should toy with this?
--prop:/sim/rendering/multithreading-mode=
Same thing with:
--prop:/sim/rendering/multithreading-mode=DrawThreadPerContext
no video:
video on, the same!
The interesting part is that there is an extra affinity, to cpu 4, which is used, my frame rate just doubled, instead of going from 100 to 10, it went from 100 to 20. So that's it, we should understand why this affinity are set by default!
Just wanted to add that I suspect it is dramatic for me alone due to the resolution. For instance, even with 1 thread, when I resized the window to 800x600, the fps goes down but remain usable, like 20 fps. But full screen, that is a down to 4-5. Since the bandwitdh required scales as the square of the resolution, that goes down fast with high res!
It's very odd that activating video encoding only creates one new thread for you, and it's not surprising that encoding your large window using just this single thread slows things down a lot - Flightgear has to wait for the encoding of the current frame to complete before if can move to the next thread.
One other thing - i think that new threads default to inheriting the affinities of their parent thread, so OSG's affinity settings on startup for the main thread and others, are likely to effect the affinities of newly-created threads later on.
I've just pushed a change to the flightgear and simgear repositories that gives control over whether we tell OSG to tie threads to cpu cores. Running with
--prop:bool:/sim/thread-cpu-affinity=false
will disable these affinities; it would be interesting to see what difference this makes on your machine.However unless this effects how many threads ffmpeg uses, i can't see it helping your slow-down.
I guess there's an outside chance that, when starting encoding, ffmpeg looks at the affinity of the current thread and, if it's set to a single core, figures that there's no point in creating more than one (similarly tied) new thread for encoding, because only one will get to run at a time.
Thanks,
- Jules
Ok, that definitely unlocked things.
I'll start with the sure thing: This guy unset:
--prop:/sim/rendering/multithreading-mode
with this:
--prop:bool:/sim/thread-cpu-affinity=false
for the UFO.
Not encoding:
Switching on x265 encoding, fps from 100 to ~25, but that works. Clearly ffmpeg is now forking new threads and uses them:
For x264, the same happens, but the threads are not very much efficient, the fps goes the same, from 100 to ~25, I guess that's normal, x264 being less demanding.
That also suggests that 25 fps is not really due to encoding but something else. Independently of the fps, moving the view while encoding is quite laggy, but that is not as catastrophic as before.
Weird thing is that one thread appears to be locked to core 2?
NB: The unsure things: I've been seeing segfaults if the multithreading-mode is changed, like this
--prop:/sim/rendering/multithreading-mode=DrawThreadPerContext
when stopping x264 encoding after a change from full screen to windowed mode. The rest is the same, CPU affinity is free up to number 2 and ffmpeg succeeds to fork!
PS: It is still not perfect, the lag while encoding is quite hard, but I suspect the main issue is solved, so feel free to close the bug. I'll explore on my side what osg is doing with thread, that looks ugly, but there may be some compilation options I've missed (osg-3.6)
Thanks for the help!
Seems to be an old story, long to read, but that's exactly what we are facing.
https://osg-users.openscenegraph.narkive.com/685ht2bm/singlethreaded-leading-to-whole-application-just-running-on-one-core
Last edit: eatdirt 2022-04-30
Thanks again for trying things, good to see changing affinities is helping.
I've just pushed some more changes:
Tweak the handling of
/sim/thread-cpu-affinity
at startup - set to "none" to turn off thread affinities (same as previous handling offalse
), "osg" to allow OSG to set affinities but then attempt to remove main thread's affinity (not sure this works very well), or unset/"" for default behaviour (previously unset or true).New property
/sim/affinity-control
allows crude runtime control of thread affinities on Linux - setting to "clear" remembers the current affinities and removes all threads' affinities; "revert" restores the remembered affinities.New property
/sim/video/log_sws_scale_stats
- set to true before starting encoding to enable periodic statistics written to terminal about average time spent insws_scale()
each frame. We callsws_scale()
on a single thread before sending the resulting YUV data to the encoder, so this will tell us whether it is the bottleneck in your case with a very large window.- Jules
Hello there,
I've played with the new options. Concerning
/sim/thread-cpu-affinity
set to 'none' or 'osg' does the same for me, it has, as before, one thread locked onto core 2 (see previous posts).
However, resetting the affinity in sim with
works, this affinity to core 2 disappears. Setting "revert" also works, affinity to core 2 comes back. Starting with default OSG affinity settings in which all threads are on core 1 & 2, this affinity-control prop works as expected.
It works only in sim though, I've tried to set it on the command line but then it does nothing, I suspect threads are not created yet.
For the video encoding, this is the log I am getting for x264. This is the UFO at BIKF, the scenery was quite heavy with rain and looking at the city (30fps without encoding, 13 with encoding on on multithreads).
the same with x265:
As far as I've tested, without video encoding, I do not see any difference between OSG affinity set by default and OSG affinity removed. I am pretty sure the folks at OSG know their business, and I am old enough to remember that, at the beginning of multicore processors, OS scheduling on threads was really a problem. But this was quite some time ago :)
Thanks for testing things.
I've seen the same thing as you with no difference between 'none' and 'osg'; see recent post on flightgear-devel@.
I wouldn't expect
/sim/affinity-control
to work on the command-line; it's only for investigation purposes anyway and is limited in what it can do - if OSG creates a thread after 'clear', the new thread will have no cpu affinity and 'revert' will not change its affinity because we won't have any remembered affinity setting to restore.Looking at your figures, the reduction of frame rate from 30fps to 13fps implies extra time per frame of
1/13-1/30 = 0.044s
. (So upper bound for framerate when recording is1/0.044 = 22.9fps
.)Of the 0.044s per frame, it looks like 0.014s is from
sws_scale()
, i.e. about 1/3 of the video encoding slowdown is fromsws_scale()
(running in a single-thread), and the remaining 2/3 is from the actual encoder (running on multiple threads).If we were to somehow eliminate
sws_scale()
's delay (e.g. by running in multiple threads, giving each thread a portion of the image to convert from RGB to YUV, or by using the GPU somehow), the lower bound for encoding time per frame would be2/3*0.044=0.029s
, which would reduce 30fps frame rate to1/(1/30+0.029) = 16.0fps
.I notice you were ok for me to close this bug in an earlier post. The bug is mostly understood now, and fixing things properly will require wider discussion and experimentation. Also i find it much easier to communicate via emails to the mailing list than a web-based interface, so i'll close it after this post.
Thanks,
- Jules
Yea, bug is closed. Thanks for the fixes!