SD in not able to use the multiple processors / cores in modern PCs, as it is currently a one-only-threaded program, as far as the CPU and the graphics and simulation engines are concerned (the network code actually uses multiple threads, but only for asynchronous communications, and it seems that the OpenAL sound library also uses multiple threads internally).
As the CPU is generally the bottleneck for SD (apart from the cases where the GPU is really weak), this means that the gamer can't enjoy as many AI opponents or rich looking tracks as he could expect with his multi core computer.
The central place for analysing CPU consumption while racing in SD is the race engine main loop callback, and particularly its core located in the ReUpdate() and ReOneStep() functions in raceengine.cpp.
In the normal display mode (RM_DISP_MODE_NORMAL), the code that in run right after the display function, at the end of each event loop (see guieventloop.cpp) can be summed up that way :
// svn 2402 // raceengine.h RCM_MAX_DT_SIMU = 0.002 RCM_MAX_DT_ROBOTS = 0.02 // raceengine.cpp global tSimu // Current simulation "real" time (initialized at race start, only incremented by RCM_MAX_DT_SIMU steps) global tRob // Current robot time (initialized at race start, only incremented by RCM_MAX_DT_SIMU steps) global tRobLast // Last robot update time (initialized at race start, fed with tRob after robots update) // Loop as many times as necessary for tSimu to make up the time // (tSimu is always late unless the computer can achieve 1/RCM_MAX_DT_SIMU = 500 !ReUpdate calls per second !). while (clock() - tSimu > RCM_MAX_DT_SIMU) { tRob += RCM_MAX_DT_SIMU if (tRob - tRobLast >= RCM_MAX_DT_ROBOTS) // Only RCM_MAX_DT_ROBOTS/RCM_MAX_DT_SIMU times per loop. { for each robot racing robot->rbDrive(...) tRobLast = tRob } tSimu += RCM_MAX_DT_SIMU SimuItf.update(...) } GraphicItf.refresh(...) // Note: The displayed frame rate is computed here (1 more frame each time this function is called).
This difficult to interpret dispatching scheme actually results in (see code profiling for proofs about that) :
The current code profiling task already gives us big facts about the normal (usual) racing mode (with 3D graphics) :
On the other hand, from the gamer's point of view :
Finally,
In any case, we plan to use the threading API of the SDL library, as the simplest choice. We'll have to check first if the offered API fits our simple needs and works well, but we are quite confident about it.
As for the remainder, the first idea that comes is using multiple threads for simultaneously executing robot->rbDrive(), SimuItf.update() and GraphicItf.refresh().
The bad thing is that these 3 (or more) game components heavily use shared data !
But the good thing is that, at each step :
Note: The first assertion has one exception : the car priv.collision bit field, that is sometimes reset by the graphics engine. We need to check if this is a real write, that is some kind of data acknowledgment, or simply something local to the graphics engine. And find a solution in the first (bad) case.
In this scenario, as a workaround for the concurrent reads / writes to the shared race engine data (race state + situation), - the graphics thread works on the situation of the previous time step, - while the other thread (the situation updater) is preparing the situation for the current time step.
So, the race engine dispatching scheme is mainly such :
Of course, this is no use if the target computer has only 1 core / processor : in this case, the dispatching code will detect this and run everything under the main and only thread (just as before).
CPU affinity : In order to keep the 2 threads from moving from their CPU/core to another too often, we'll have to check if it is possible to stick them where they are at startup (these changes may have a bad impact on performances, as they probably imply cache resets !). We are quite confident about this as APIs exist for that in Windows and Linux at least.
This scenario is quite similar to the 1st one, except that a new thread takes care of the mirror in the graphics engine. There should not be any concurrent data access between the 2 graphic threads (as they only read the situation), and the solution found in the scenario 1 in order to solve this issue still apply.
This should normally bring us some FPS on tri (or more) core / processor configurations, provided the graphics middleware supports this in the expected efficient way.
This scenario is quite similar to the 1st one, except that 2 separate threads take care of
Of course, this is no use if the target computer has less than 3 cores / processors.
Only scenario 1 was implemented and tested (1 thread for the graphics engine, 1 for the remainder).
Here are the frame rates gains (or losses) we get after implementing the first scenario.
All tests sessions are run in the following common conditions :
Warnings:
Feel free, everyone, to add lines for your configurations, as things could to be quite different from one to another.
Additionnal tests condition details :
Configuration / Scenario (frames per second) | SGL1 | SGM1 | SGH1 | ||||||
(CPU affinity Off) | Mono | Dual | Gain | Mono | Dual | Gain | Mono | Dual | Gain |
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb | 102 | 97 | -5% | 69 | 78 | +13% | 62 | 72 | +16% |
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) | 167 | 152 | -9% | 57 | 53 | -7% | 39 | 47 | +20% |
WinXP 32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) (1) | - | - | - | - | - | - | 28 | 48 | +71% |
Notes: (1) Quite varying frame rates from 1 race to the other, don't know why ; sometimes uses the 2 cores at 50% each, sometimes only 1 at 100% ; each figure is a mean value computed from 5 or 6 races. Could not toggle off Sync 2 VBlank, so 1) only SGH1 tested, 2) the given figure for the dual-threaded mode may be a little bit under-estimated (but not much).
Configuration / Scenario (frames per second) | SGL1 | SGM1 | SGH1 | ||||||
(CPU affinity On) | Mono | Dual | Gain | Mono | Dual | Gain | Mono | Dual | Gain |
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb | - | - | - | - | - | - | % | ||
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) (1) | - | - | - | - | - | - | 37.5 | 47 | +25% |
WinXP 32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) (1) | - | - | - | - | - | - | 39 | 50 | +28% |
Notes: (1) CPU affinity is clearly good for performances under Windows XP, especially in mono-thread mode, while it seems no use under Linux.
Eager to see what's happening under Windows Vista / 7, which are told to handle multi-threaded apps in a from far better way.
Additionnal tests condition details :
Configuration / Scenario (frames per second) | SGL1 | SGM1 | SGH1 | ||||||
(CPU affinity Off) | Mono | Dual | Gain | Mono | Dual | Gain | Mono | Dual | Gain |
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb (1,2,3) (driver 6.14.11.7804) | - | - | - | - | - | - | 62 | 75/81 | +21/+31/% |
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 768 Mb (driver 6.14.11.8260) | - | - | - | - | - | - | 68 | 84 | +24% |
Notes:
Configuration / Scenario (frames per second) | SGL1 | SGM1 | SGH1 | ||||||
(CPU affinity On) | Mono | Dual | Gain | Mono | Dual | Gain | Mono | Dual | Gain |
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb (driver 6.14.11.7804) (1) | - | - | - | - | - | - | 60/70 | 73/81 | +4/+35% |
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 768 Mb (driver 6.14.11.8260) (2) | - | - | - | - | - | - | 68 | 81 | +19% |
Notes:
Even if we no more plan to use Simu V3, equivalent test sessions give interesting results :
Additionnal tests condition details :
Configuration / Scenario (frames per second) | SGH1 | CPU Affinity | Off | SGH1 | CPU Affinity | On |
Mono | Dual | Gain | Mono | Dual | Gain | |
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb | 72 | 92 | ||||
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 768 Mb | ||||||
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) | 25 | 57 | +128% | |||
Win XP32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) | 19 | 55 | +190% |
Comments on Simu V3 figures :
1.4.0 : As a reference, here are the figures we get with (mono-thread) 1.4.0 in the same conditions :
Physics engine | V2 | V2 | V2 | V3 | V3 | V3 |
---|---|---|---|---|---|---|
Configuration / Scenario (frames per second) | SGL1 | SGM1 | SGH1 | SGL1 | SGM1 | SGH1 |
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb | - | - | 62 | - | - | 71 |
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 512 Mb | - | - | - | - | - | - |
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) | - | - | 44.5 | - | - | 52.5 |
Win XP32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) | - | - | 34 | - | - |
1) Reported by Ocirne94 :
CPU: Intel Core 2 Quad Q9400 @ 2.66 GHz GPU: nVidia GeForce GTS 250 OS: Kubuntu 10.04 64-bit Speed Dreams r2526, on Manton Speedway, 35 drivers (Supercars, 36GP, gp1600, F1).
Simuv3: 24.7 FPS with multi-threading, 9.3 without.
Simuv2: 25.1 FPS with multi-threading, 19.0 without.
2) Reported by Xavier
CPU: AMD Atlon 64x2 7550 (dual core) GPU: nVidia GeForce 210 OS: Linux Mandriva 2010.1 64 bits Speed Dreams r2544, on e-track-6, 28 drivers (all simplix_trb1, usr_trb1).
Simuv2 : 25.3 FPS with multi-threading (from 25 to 40) with mirror and 45.0 fps without mirror, 6.5 fps without multithread and 13.2 fps without mirror
Simuv3 : 6.2 FPS with multi-threading (from 5 to 19) with mirror and 8.5 FPS without mirror, 0.3 FPS without multithread and 3.2 FPS without mirror
On My ACER Notebook: CPU: Intel Core2 T5800 2.0 Ghz (dual core) GPU: nVidia GeForce Mobile 9600 GT OS: Linux Mandriva 2010.1 64 bits Speed Dreams r2545, on e-track-6, 28 drivers (all simplix_trb1, usr_trb1).
Simuv2 : 25.3 FPS with multithreading (from 19 to 35) with mirror and 37.3 FPS without mirror, 4.5 fps without multi-thread and 11.2 FPS without mirror
Simuv3 : 0.2 FPS with multithreading (from 1 to 8) with mirror and 1.1 FPS without mirror, 0.1 FPS without multi-thread and 0.9 FPS without mirror
On My Top Level PC: CPU: Intel I7 3.07 Ghz (quad core with hyper-threading (8 logical cores)) GPU: nVidia GeForce GTX 285 OS: XP Pack3 32 bits Speed Dreams r2545, on e-track-6, 28 drivers (all simplix_trb1, usr_trb1).
Simuv2 : 43.3 FPS with multi-threading (from 30 to 60) with mirror and 120 FPS without mirror, 35.2 fps without multi-thread and 70 FPS without mirror
Simuv3 : 45.5 FPS with multi-threading (from 43 to 70) with mirror and 90.2 FPS without mirror, 23.2 FPS without multi-thread and 62 FPS without mirror
Note : ALL my tests are run with ssgasky and dynamic sky activated.
Only kept here as the log of our thoughts.
After running some code profiling, has become partly nonsense.
The basic idea is that we can imagine using multiple threads for simultaneously executing robot->rbDrive(), SimuItf.update() and GraphicItf.refresh().
Different multithreading scenarios can be chosen, from the simplest to the most elaborate one ; some examples :
To decide which is the best, some ideas :
BUT the real difficult thing is here to make sure that the functions we plan to run simultaneously can really be executed at the same time while keeping the data they share in a consistent state.
The bad thing we have to avoid here is one function writing non atomic data while one other function is reading it at the same time.
This can only be achieved through a very careful and self disciplined analysis of the data streams (read/write) among the robots and the physics and graphics engines.
In any case, we plan to use the threading API of the SDL library, as the simplest choice. We'll have to check first if the offered API fits our simple needs and works well, but we are quite confident about it.
The 2 main tasks to achieve here are :
We can achieve this through 2 progressive steps, each of them being a possible end-step for the multi threading task if time comes to be lacking, things don't work as expected or become too complicated.
Commit: [r2526]
Commit: [r2544]
Commit: [r2545]
Wiki: CodeProfiling
Wiki: Index
Wiki: MeetingJune2010
Wiki: MeetingMay2010b
Wiki: TheWayToRelease2