PreparingMultithreading

Multithreading

  1. The problem
  2. Analysis of the main dispatching scheme
  3. Facts and figures
  4. Implementation scenarios
  5. Implementation results
  6. Old material (archives)

The problem

SD in not able to use the multiple processors / cores in modern PCs, as it is currently a one-only-threaded program, as far as the CPU and the graphics and simulation engines are concerned (the network code actually uses multiple threads, but only for asynchronous communications, and it seems that the OpenAL sound library also uses multiple threads internally).

As the CPU is generally the bottleneck for SD (apart from the cases where the GPU is really weak), this means that the gamer can't enjoy as many AI opponents or rich looking tracks as he could expect with his multi core computer.

Analysis of the main dispatching scheme

The central place for analysing CPU consumption while racing in SD is the race engine main loop callback, and particularly its core located in the ReUpdate() and ReOneStep() functions in raceengine.cpp.

In the normal display mode (RM_DISP_MODE_NORMAL), the code that in run right after the display function, at the end of each event loop (see guieventloop.cpp) can be summed up that way :

// svn 2402
// raceengine.h
RCM_MAX_DT_SIMU = 0.002
RCM_MAX_DT_ROBOTS = 0.02

// raceengine.cpp
global tSimu    // Current simulation "real" time (initialized at race start, only incremented by RCM_MAX_DT_SIMU steps)
global tRob     // Current robot time (initialized at race start, only incremented by RCM_MAX_DT_SIMU steps)
global tRobLast // Last robot update time (initialized at race start, fed with tRob after robots update)

// Loop as many times as necessary for tSimu to make up the time 
// (tSimu is always late unless the computer can achieve 1/RCM_MAX_DT_SIMU = 500 !ReUpdate calls per second !).
while (clock() - tSimu > RCM_MAX_DT_SIMU)
{
   tRob += RCM_MAX_DT_SIMU
   if (tRob - tRobLast >= RCM_MAX_DT_ROBOTS) // Only RCM_MAX_DT_ROBOTS/RCM_MAX_DT_SIMU times per loop.
   {
      for each robot racing   
         robot->rbDrive(...)
      tRobLast = tRob
   }

   tSimu += RCM_MAX_DT_SIMU
   SimuItf.update(...)
}

GraphicItf.refresh(...) // Note: The displayed frame rate is computed here (1 more frame each time this function is called).

This difficult to interpret dispatching scheme actually results in (see code profiling for proofs about that) :

  • in an actual mean robots update rate of about 1 / RCM_MAX_DT_ROBOTS (50 Hz), whatever the number of drivers, simulation engine, track, ...).
  • in an actual mean simu update rate of about 1/ RCM_MAX_DT_SIMU (500 Hz), whatever the number of drivers, simulation engine, track, ...).
  • in an actual variable graphics update rate (of course, something has to suffer from more drivers, heavier simulation engine, ... etc ...).

Facts and figures

The current code profiling task already gives us big facts about the normal (usual) racing mode (with 3D graphics) :

  1. CPU sharing :
    1. the main CPU eater is generally the graphics engine (from 30 to 80 %, when moving from few opponents/Simu V3 to many opponents Simu V2),
    2. the second one is the physics engine,
    3. but with Simu V3 and many opponents (AI drivers), things can come to the contrary,
    4. then come the robots (AI drivers code), with nearly never more than 10 %.
  2. Even on a moderately powerful video card, Speed Dreams is not limited by the GPU, but by the CPU.
  3. Dispatching scheme :
    1. the actual mean robots update rate is about 1 / RCM_MAX_DT_ROBOTS (50 Hz), whatever the number of drivers, simulation engine, track, ...).
    2. the actual mean simu update rate of about 1/ RCM_MAX_DT_SIMU (500 Hz), whatever the number of drivers, simulation engine, track, ...).
    3. the graphics update rate (gives the FPS figure) is automatically adapted to the 2 above constraints, thus decreasing when more AI drivers / Simu V3 is used.

On the other hand, from the gamer's point of view :

  1. Frame rates above 60 or 75 Hz can generally be considered pure luxury,
    • because of human's eye / brain limited capabilities,
    • because of the vast majority of flat panel screen actual refresh rate of 60 or 75 Hz.
  2. The game should handle more AI opponents without loosing frames per second.
  3. Speed Dreams only uses one core / processor, whereas most PC now come with 2 or more.

Finally,

  1. the current simu update rate of 500 Hz means that each simu step corresponds to a maximum car displacement of 0.2 m (the distance covered in 2 ms by a car racing at 360 km/h). Is this needed ?
  2. the current robots update rate of 50 Hz means that each robot step corresponds to a maximum car displacement of 2 m. Isn't this a bit too much ?

Implementation scenarios

In any case, we plan to use the threading API of the SDL library, as the simplest choice. We'll have to check first if the offered API fits our simple needs and works well, but we are quite confident about it.

As for the remainder, the first idea that comes is using multiple threads for simultaneously executing robot->rbDrive(), SimuItf.update() and GraphicItf.refresh().

The bad thing is that these 3 (or more) game components heavily use shared data !

But the good thing is that, at each step :

  • The graphics engine only reads this data ...
  • ... that was just written by the robots and physics engine.

Note: The first assertion has one exception : the car priv.collision bit field, that is sometimes reset by the graphics engine. We need to check if this is a real write, that is some kind of data acknowledgment, or simply something local to the graphics engine. And find a solution in the first (bad) case.

Scenario 1: 1 thread for the graphics engine, 1 for the remainder

In this scenario, as a workaround for the concurrent reads / writes to the shared race engine data (race state + situation), - the graphics thread works on the situation of the previous time step, - while the other thread (the situation updater) is preparing the situation for the current time step.

So, the race engine dispatching scheme is mainly such :

  • the main thread continues to run the event loop (GfelEventLoop), including our main target here : the raceengine::ReUpdate function,
  • raceengine::ReUpdate is modified in order to :
    • delegate the robots + simu update loop to a dedicated "Simu + robots" thread,
    • but work on a copy of the race engine data (state + situation),
    • this copy is made at the beginning of each call by the event loop, and is the only moment when the 2 threads have to synchronise, that is to wait one for the other, in order to get consistent data)
    • keep the graphics refresh call as is,
    • take care of the displaying task for all the race messages in order to keep any graphics-related code in the main thread,
  • the dedicated "Simu + robots" thread is the actual owner of the race engine data (state + situation),
  • it takes care of updating this data at the same pace as before (different for the robots and the simu engine),
  • it works on the next "current step" at the same time the main thread displays (a copy of) the "previous step",
  • it does not display anything : when needed, racing messages are queued into the race engine data for the main thread to display them later.

Of course, this is no use if the target computer has only 1 core / processor : in this case, the dispatching code will detect this and run everything under the main and only thread (just as before).

CPU affinity : In order to keep the 2 threads from moving from their CPU/core to another too often, we'll have to check if it is possible to stick them where they are at startup (these changes may have a bad impact on performances, as they probably imply cache resets !). We are quite confident about this as APIs exist for that in Windows and Linux at least.

Scenario 2: 2 threads for the graphics engine, 1 for the physics engine

This scenario is quite similar to the 1st one, except that a new thread takes care of the mirror in the graphics engine. There should not be any concurrent data access between the 2 graphic threads (as they only read the situation), and the solution found in the scenario 1 in order to solve this issue still apply.

This should normally bring us some FPS on tri (or more) core / processor configurations, provided the graphics middleware supports this in the expected efficient way.

Scenario 3: 1 thread for the graphics engine, 1 for the robots, 1 for the physics engine

This scenario is quite similar to the 1st one, except that 2 separate threads take care of

  1. all the robots,
  2. the physics engine.

Of course, this is no use if the target computer has less than 3 cores / processors.

Implementation results

Only scenario 1 was implemented and tested (1 thread for the graphics engine, 1 for the remainder).

SGL1, SGM1, SGH1 scenarios

Here are the frame rates gains (or losses) we get after implementing the first scenario.

All tests sessions are run in the following common conditions :

  • Release (optimized) build with OPTION_DEBUG = ON,
  • Simu V2 physics engine,
  • Gaming scenarios : SGL1, SGM1, SGH1 (see [CodeProfiling] for details).

Warnings:

  • It is very important to set the front and rear view and the car choosen for the front view exactly as stated in the gaming scenarios, because the frame rates are very sensitive to the presence / absence of the rear view as well as the mean number of cars that can be seen in these 2 views.
  • As well as ensuring the current camera is always the one of the 2nd driver (same reasons as above : keep as constant as possible the same display power needs).

Feel free, everyone, to add lines for your configurations, as things could to be quite different from one to another.

Summer 2010

Additionnal tests condition details :

  • SD trunk svn 2522 (CPU affinity Off) and svn 2559 ((CPU affinity On),
  • Anti-aliasing forced at startup (not customizable).
Configuration / Scenario (frames per second) SGL1 SGM1 SGH1
(CPU affinity Off) Mono Dual Gain Mono Dual Gain Mono Dual Gain
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb 102 97 -5% 69 78 +13% 62 72 +16%
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) 167 152 -9% 57 53 -7% 39 47 +20%
WinXP 32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) (1) - - - - - - 28 48 +71%

Notes: (1) Quite varying frame rates from 1 race to the other, don't know why ; sometimes uses the 2 cores at 50% each, sometimes only 1 at 100% ; each figure is a mean value computed from 5 or 6 races. Could not toggle off Sync 2 VBlank, so 1) only SGH1 tested, 2) the given figure for the dual-threaded mode may be a little bit under-estimated (but not much).

Configuration / Scenario (frames per second) SGL1 SGM1 SGH1
(CPU affinity On) Mono Dual Gain Mono Dual Gain Mono Dual Gain
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb - - - - - - %
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) (1) - - - - - - 37.5 47 +25%
WinXP 32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) (1) - - - - - - 39 50 +28%

Notes: (1) CPU affinity is clearly good for performances under Windows XP, especially in mono-thread mode, while it seems no use under Linux.

Eager to see what's happening under Windows Vista / 7, which are told to handle multi-threaded apps in a from far better way.

Early 2011

Additionnal tests condition details :

  • SD trunk svn 3269 (Warning: Affinity setting code broken ... WIP).
  • Anti-aliasing still forced at startup (not customizable).
  • Multi-texturing On (same configuration as in previous revisions, but now configurable).
Configuration / Scenario (frames per second) SGL1 SGM1 SGH1
(CPU affinity Off) Mono Dual Gain Mono Dual Gain Mono Dual Gain
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb (1,2,3) (driver 6.14.11.7804) - - - - - - 62 75/81 +21/+31/%
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 768 Mb (driver 6.14.11.8260) - - - - - - 68 84 +24%

Notes:

  • (1) Quite varying frame rates from 1 race to the other, don't know why ; sometimes uses the 2 cores at 50% each, sometimes only 1 at 100%.
  • (2) No explanation on the improved dual-threading figures (Were the graphics driver / OS updated ?).
  • (3) Quite varying frame rates from 1 race to the other in dual-threaded mode, don't know why ... same as in Summer 2010.
Configuration / Scenario (frames per second) SGL1 SGM1 SGH1
(CPU affinity On) Mono Dual Gain Mono Dual Gain Mono Dual Gain
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb (driver 6.14.11.7804) (1) - - - - - - 60/70 73/81 +4/+35%
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 768 Mb (driver 6.14.11.8260) (2) - - - - - - 68 81 +19%

Notes:

  • (1) Quite varying frame rates from 1 race to the other in dual-threaded mode, whereas affinity On ! Very strange !?.
  • (2) Not that better than with affinity Off (?) : because of XP SP3 better multi-threading management ? => Affinity On no more seems to give better / more stable frames rates (?)

Summer 2010 with Simu V3

Even if we no more plan to use Simu V3, equivalent test sessions give interesting results :

Additionnal tests condition details :

  • SD svn 2522 (CPU affinity Off) and svn 2559 ((CPU affinity On),
Configuration / Scenario (frames per second) SGH1 CPU Affinity Off SGH1 CPU Affinity On
Mono Dual Gain Mono Dual Gain
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb 72 92
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 768 Mb
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) 25 57 +128%
Win XP32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) 19 55 +190%

Comments on Simu V3 figures :

  • they can't be compared to the ones got with Simu V2, as the racers very often exit from the track or collide, resulting in very different and varying number of cars in the front view as well as in the rear view ; actually a lower mean number, which explains the higher mean frame rate.
  • but the gain between the mono and the dual thread situation can be compared, and it shows that the multi-threading code is much more efficient with a high CPU demanding physics engine than with a lower demanding one, especially on medium to low-end CPUs.

Reference data with 1.4.0

1.4.0 : As a reference, here are the figures we get with (mono-thread) 1.4.0 in the same conditions :

Physics engine V2 V2 V2 V3 V3 V3
Configuration / Scenario (frames per second) SGL1 SGM1 SGH1 SGL1 SGM1 SGH1
WinXP 32 SP2, Intel Core 2 Duo E8400 3.0GHz, nVidia Quadro FX 1700 512 Mb - - 62 - - 71
WinXP 32 SP3, Intel Xeon W3520 2.6GHz, nVidia Quadro FX 1800 512 Mb - - - - - -
Linux 64 2.6.31, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 185.18.36) - - 44.5 - - 52.5
Win XP32 SP2, AMD Athlon 64x2 4600+ 2.4GHz, nVidia 8800 GT 512 Mb (driver 195.62) - - 34 - -

Other tests

1) Reported by Ocirne94 :

CPU: Intel Core 2 Quad Q9400 @ 2.66 GHz GPU: nVidia GeForce GTS 250 OS: Kubuntu 10.04 64-bit Speed Dreams r2526, on Manton Speedway, 35 drivers (Supercars, 36GP, gp1600, F1).

Simuv3: 24.7 FPS with multi-threading, 9.3 without.

Simuv2: 25.1 FPS with multi-threading, 19.0 without.

2) Reported by Xavier

CPU: AMD Atlon 64x2 7550 (dual core) GPU: nVidia GeForce 210 OS: Linux Mandriva 2010.1 64 bits Speed Dreams r2544, on e-track-6, 28 drivers (all simplix_trb1, usr_trb1).

Simuv2 : 25.3 FPS with multi-threading (from 25 to 40) with mirror and 45.0 fps without mirror, 6.5 fps without multithread and 13.2 fps without mirror

Simuv3 : 6.2 FPS with multi-threading (from 5 to 19) with mirror and 8.5 FPS without mirror, 0.3 FPS without multithread and 3.2 FPS without mirror

On My ACER Notebook: CPU: Intel Core2 T5800 2.0 Ghz (dual core) GPU: nVidia GeForce Mobile 9600 GT OS: Linux Mandriva 2010.1 64 bits Speed Dreams r2545, on e-track-6, 28 drivers (all simplix_trb1, usr_trb1).

Simuv2 : 25.3 FPS with multithreading (from 19 to 35) with mirror and 37.3 FPS without mirror, 4.5 fps without multi-thread and 11.2 FPS without mirror

Simuv3 : 0.2 FPS with multithreading (from 1 to 8) with mirror and 1.1 FPS without mirror, 0.1 FPS without multi-thread and 0.9 FPS without mirror

On My Top Level PC: CPU: Intel I7 3.07 Ghz (quad core with hyper-threading (8 logical cores)) GPU: nVidia GeForce GTX 285 OS: XP Pack3 32 bits Speed Dreams r2545, on e-track-6, 28 drivers (all simplix_trb1, usr_trb1).

Simuv2 : 43.3 FPS with multi-threading (from 30 to 60) with mirror and 120 FPS without mirror, 35.2 fps without multi-thread and 70 FPS without mirror

Simuv3 : 45.5 FPS with multi-threading (from 43 to 70) with mirror and 90.2 FPS without mirror, 23.2 FPS without multi-thread and 62 FPS without mirror

Note : ALL my tests are run with ssgasky and dynamic sky activated.


Old material

Only kept here as the log of our thoughts.

After running some code profiling, has become partly nonsense.

Possible multi-threading scenarios

The basic idea is that we can imagine using multiple threads for simultaneously executing robot->rbDrive(), SimuItf.update() and GraphicItf.refresh().

Different multithreading scenarios can be chosen, from the simplest to the most elaborate one ; some examples :

  • parallelism only for the robots :
    • 1 thread for each robot->rbDrive(),
    • after the robots have been executed, sequentially execute SimuItf.update() and GraphicItf.refresh()
  • full parallelism with 2 threads :
    • 1 thread for sequentially executing all the robot->rbDrive(),
    • 1 thread for sequentially executing SimuItf.update() and GraphicItf.refresh()
  • full parallelism with nb robots + 2 threads :
    • 1 thread for each robot->rbDrive(),
    • 1 thread for SimuItf.update()
    • 1 thread for GraphicItf.refresh()
  • full parallelism with nb robots / (nb cores - 1) + 1 threads :
    • 1 thread for each group of nb robots / (nb cores - 1) robot->rbDrive(),
    • 1 thread for sequentially executing SimuItf.update() and GraphicItf.refresh()
  • ... etc ...

To decide which is the best, some ideas :

  • we'll have to do some code profiling to get enough information about each function CPU consumption
  • the actual number of available cores has to be taken into account ideally
  • why not a dynamic scenario with as many threads as we have cores and a dynamic dispatching of the functions to run based on run-time statistics about their actual CPU consumption ...

BUT the real difficult thing is here to make sure that the functions we plan to run simultaneously can really be executed at the same time while keeping the data they share in a consistent state.

The bad thing we have to avoid here is one function writing non atomic data while one other function is reading it at the same time.

This can only be achieved through a very careful and self disciplined analysis of the data streams (read/write) among the robots and the physics and graphics engines.

Implementation scenarios

In any case, we plan to use the threading API of the SDL library, as the simplest choice. We'll have to check first if the offered API fits our simple needs and works well, but we are quite confident about it.

The 2 main tasks to achieve here are :

  • implement the function dispatching scenario
  • put locks in the right places everywhere data is shared in read/write mode between functions that we plan to be run at the same time.

We can achieve this through 2 progressive steps, each of them being a possible end-step for the multi threading task if time comes to be lacking, things don't work as expected or become too complicated.

  1. The first and simplest possible scenario :
    • think again about RCM_MAX_DT_SIMU and RCM_MAX_DT_ROBOTS values : have they a real impact on actual frame rates we can experience on normal computers, even powerfull ones (we already know that they actually define the maximum frame rate) ? We could even think again about optimizing the whole "call frequency" scheme for robot->rbDrive, SimuItf.update and GraphicItf.refresh (introduce sensitivity to actual frame rate ?).
    • using as many threads as actual cores to execute the robots, but then sequentially execute SimuItf.update() and GraphicItf.refresh(),
    • putting locks only in read/write shared data among the robots (as parallelism only among the robots).
  2. More complicated :
    • full parallelism among the robots, physics engine and graphics engine
    • implies introducing more locks to read/write shared data among the robots and physics and graphics engines

Related

Commit: [r2526]
Commit: [r2544]
Commit: [r2545]
Wiki: CodeProfiling
Wiki: Index
Wiki: MeetingJune2010
Wiki: MeetingMay2010b
Wiki: TheWayToRelease2

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.