[Algorithms] Multi-core efficiency item.
Brought to you by:
vexxed72
From: Kelly B. <kb....@sb...> - 2006-07-29 22:46:26
|
Hi folks, I have a pattern which I've found to be exceptionally useful in multi-core programming and wanted to see if anyone had any other ideas and/or suggestions for creating a better solution. Currently I'm calling this an "InvertedSemaphore" since currently I'm using a semaphore with modified behavior. Perhaps there is an existing concept with similar behavior, though I've searched around and not found it yet. Anyway, the concept is pretty simple and the inverted nature should make sense with a little description: Basically a semaphore maintains a count and as long as the count is not zero, any thread locking the semaphore continues to run. Now, what I want is the opposite of this in terms that as long as the semaphore has a non-zero count, the thread blocks. So, if I initialize the semaphore with "4", three threads will lock and block and only the fourth thread will lock and continue. Using platform abstractions, here's the general implementation (Abbreviations and lack of error checking for brevity): ------------------ struct InvSemInternal { volatile u32 mCount; Semaphore::tHandle mSemaphore; u32 mMax; }; InvSem::tHandle InvSem::Create( const u32 count ) { InvSemInternal* pInvSem = new InvSemInternal; if( pInvSem ) { pInvSem->mCount = count-1; pInvSem->mMax = count; pInvSem->mSemaphore = Semaphore::Create( count ); if( pInvSem->mSemaphore != Semaphore::kInvalid ) return (InvSem)pInvSem; delete pInvSem; } return kInvalid; } const u32 InvSem::Lock( InvSem::tHandle handle ) { InvSemInternal* pInvSem = (InvSemInternal*)handle; const u32 result = Interlocked::Decrement( &pInvSem->mCount ); if( result!=0 ) Semaphore::Wait( pInvSem->mSemaphore ); return result; } void InvSem::Release( InvSem::tHandle handle ) { InvSemInternal* pInvSem = (InvSemInternal*)handle; pInvSem->mCount = pInvSem->mMax-1; Semaphore::Release( pInvSem->mSemaphore, pInvSem->mMax-1 ); } ------------------ I believe the above describes the basic logic properly.. As it is not copy/paste, please ignore typo's.. :) The intended usage is pretty simple. I need to process a bunch of things which are not interdependent and use as many CPU's as possible. Each thread uses a lockless accessor to the "things to execute" to retrieve what it will execute and as such distributing the tasks is self balancing (against the underlying OS thread scheduling, affinity masking 1:1 is only a benefit depending on OS and usage intentions) up till the end where the inverted semaphore comes into play. The inverted semaphore is used to allow "one" of the threads to act as the final "clean up" thread for the tasks. But it must be the "LAST" thread still executing on the tasks. Hence, InvertedSemaphore in my current terminology. My reasons for posting all of this is somewhat OS specific unfortunately in terms that the efficiency of the generalized pattern is pretty poor on Win32/Win64 but screams on Linux/Unix variants (including OS X). But to become non-OS specific, what other methods can you think of to optimize the intended results? In general though the pattern is working very well, and OS's other than Win32/Win64 shows Amdahl's law as under 1% for simple tasks, where on Win32/Win64, the cost jumps to 3-4% in all tests. Given the diversity of game code, that performance difference is very notable and annoying to say the least. NOTE: All testing performed on quad opteron 2.6, and dual Xeon 2.4 (with and without hyperthreads) with a base measure on a P4 2Ghz single cpu machine. KB |