Re: [Algorithms] Multi-core efficiency item.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Kelly Brock wrote:
> games.  The system solves for this problem by using the inverted semaphore,
> all the tasks process the displacement portion and suspend, manual threading
> style kinda, the OS thread picks up the next task in the list and continues
> going.  This continues until the task list is empty and the OS threads hit
> the inverted semaphore.  The last thread through resets the task list and
> releases the semaphore, starting the process all over again except now it is
> calculating the lighting.
>
> 	So, the efficiency of the system is exceptionally high since there
> are no OS level locks in processing the task list except at the very end
> where it restarts the list.  Given a 256x256 grid this means there is a
> single OS lock for every 65536 tasks being executed. 
>   

N.B: The cost in an OS lock is entering the the kernel, not necessarily 
that it is a "lock". Suspending a thread (or releasing it) also require 
entering the kernel, and would have a similar cost.

> But to become non-OS specific, what other methods can you think of to
> optimize the intended results?  In general though the pattern is working
> very well, and OS's other than Win32/Win64 shows Amdahl's law as under 1%
> for simple tasks, where on Win32/Win64, the cost jumps to 3-4% in all tests.

I still don't see why you wouldn't just use a traditional set of worker 
threads, with a traditional set of work items. Picking up the "next 
vertex" in the grid doesn't need synchronization other than an 
interlocked increment, and the worker threads can all go back to waiting 
on the work queue when the task is done. To make sure all threads are 
done, you can use a rendez-vous(interlocked math and a single OS lock), 
or something akin to WaitForMultipleEvents() (I'd prefer the 
rendez-vous). Especially if you're OK with suspending the orignal thread 
while the others are working (or, perhaps, turning the system thread 
into one worker thread, to get even less OS enters).

So, to understand your solution, I'd like to understand the problem in a 
little more detail. Is the workflow something like:

"original requester can request that N worker threads do some work, and 
when all N threads are done, the original requester gets un-blocked again"?

If so, would that be expressed in the following sketch?

class WorkerPool {
  public:
    WorkerPool(int nThreads) {
      numThreads_ = nThreads;
      userEvent_ = ::CreateEvent(0,0,0,0);
      workerSemaphore_ = ::CreateSemaphore(0,0,nThreads,0);
      for (int i = 0; i < nThreads; ++i) {
        DWORD id = 0;
        ::CreateThread(0,65536,&ThreadFunc,this,0,&id);
      }
    }
    // Deleting the worker pool is left as an exercise for the reader.

    // User tells threads to do something
    void DoCommandSynchronous(Command * cmd) {
      cmd_ = cmd;
      numDone_ = 0;
      LONG pc = 0;
      ::ReleaseSemaphore(workerSemaphore_, numThreads_, &pc);
      // make sure threads are done
      ::WaitForSingleObject(userEvent_);
    }

    // Threads run, doing something
    static DWORD ThreadFunc(void * ptr) {
      ((WorkerPool *)ptr)->Run();
      return 0;
    }
    void Run() {
      for (;;) {
        WaitForSingleObject(workerSemaphore_);
        cmd_->ThreadWork();
        if (InterlockedIncrement(&numDone_) == numDone_) {
          cmd_->Cleanup();
          ::SetEvent(userEvent_);
        }
      }
    }
};

The threads would each figure out what vertices to work on inside 
ThreadWork() by using InterlockedIncrement() (or just reserve a chunk 
each, depending on the problem). "Command" would be one of your three 
lines in the example, processed for all vertices. Thus, you'll enter the 
kernel the minimal number of times: once per thread for blocking, once 
additionally for un-blocking all threads, and once additionally for 
un-blocking the original requester. If this still gives you 4% overhead, 
then that's the cost of doing Windows, I guess.

Note that interlocked operations are also "expensive" in the sense that 
they may need to go to the bus, rather than just run in cache (this 
varies between PPC and x86; the former has reservations with store 
conditional; the latter has actual bus lock; different cases run 
differently fast on the two). Thus, having the threads decide which 
chunks to work on up-front is likely more efficient (as well as more 
cache local for each thread), but this depends on the problem at hand.

Btw: if you want to pipeline the original thread with the workers, you 
can move the wait on userEvent_ to the beginning of the DoCommand 
function, and create the event in a signaled state up front.

Cheers,

                  / h+