[Algorithms] Multi-core efficiency item.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi folks,

	I have a pattern which I've found to be exceptionally useful in
multi-core programming and wanted to see if anyone had any other ideas
and/or suggestions for creating a better solution.

	Currently I'm calling this an "InvertedSemaphore" since currently
I'm using a semaphore with modified behavior.  Perhaps there is an existing
concept with similar behavior, though I've searched around and not found it
yet.  Anyway, the concept is pretty simple and the inverted nature should
make sense with a little description:

	Basically a semaphore maintains a count and as long as the count is
not zero, any thread locking the semaphore continues to run.  Now, what I
want is the opposite of this in terms that as long as the semaphore has a
non-zero count, the thread blocks.  So, if I initialize the semaphore with
"4", three threads will lock and block and only the fourth thread will lock
and continue.

	Using platform abstractions, here's the general implementation
(Abbreviations and lack of error checking for brevity):

------------------
struct InvSemInternal
{
	volatile u32		mCount;
	Semaphore::tHandle	mSemaphore;
	u32				mMax;
};

InvSem::tHandle InvSem::Create( const u32 count )
{
	InvSemInternal*	pInvSem = new InvSemInternal;
	if( pInvSem )
	{
		pInvSem->mCount = count-1;
		pInvSem->mMax = count;
		pInvSem->mSemaphore = Semaphore::Create( count );

		if( pInvSem->mSemaphore != Semaphore::kInvalid )
			return (InvSem)pInvSem;

		delete pInvSem;
	}

	return kInvalid;
}

const u32 InvSem::Lock( InvSem::tHandle handle )
{
	InvSemInternal*	pInvSem = (InvSemInternal*)handle;

	const u32 result = Interlocked::Decrement( &pInvSem->mCount );
	if( result!=0 )
		Semaphore::Wait( pInvSem->mSemaphore );
	return result;
}

void InvSem::Release( InvSem::tHandle handle )
{
	InvSemInternal*	pInvSem = (InvSemInternal*)handle;

	pInvSem->mCount = pInvSem->mMax-1;
	Semaphore::Release( pInvSem->mSemaphore, pInvSem->mMax-1 );
}
------------------

	I believe the above describes the basic logic properly..  As it is
not copy/paste, please ignore typo's..  :)

	The intended usage is pretty simple.  I need to process a bunch of
things which are not interdependent and use as many CPU's as possible.  Each
thread uses a lockless accessor to the "things to execute" to retrieve what
it will execute and as such distributing the tasks is self balancing
(against the underlying OS thread scheduling, affinity masking 1:1 is only a
benefit depending on OS and usage intentions) up till the end where the
inverted semaphore comes into play.  The inverted semaphore is used to allow
"one" of the threads to act as the final "clean up" thread for the tasks.
But it must be the "LAST" thread still executing on the tasks.  Hence,
InvertedSemaphore in my current terminology.

	My reasons for posting all of this is somewhat OS specific
unfortunately in terms that the efficiency of the generalized pattern is
pretty poor on Win32/Win64 but screams on Linux/Unix variants (including OS
X).  But to become non-OS specific, what other methods can you think of to
optimize the intended results?  In general though the pattern is working
very well, and OS's other than Win32/Win64 shows Amdahl's law as under 1%
for simple tasks, where on Win32/Win64, the cost jumps to 3-4% in all tests.
Given the diversity of game code, that performance difference is very
notable and annoying to say the least.

	NOTE: All testing performed on quad opteron 2.6, and dual Xeon 2.4
(with and without hyperthreads) with a base measure on a P4 2Ghz single cpu
machine.

KB