Re: [Algorithms] Multi-core efficiency item.
Brought to you by:
vexxed72
From: Jon W. <hp...@mi...> - 2006-07-31 01:39:43
|
Kelly Brock wrote: > games. The system solves for this problem by using the inverted semaphore, > all the tasks process the displacement portion and suspend, manual threading > style kinda, the OS thread picks up the next task in the list and continues > going. This continues until the task list is empty and the OS threads hit > the inverted semaphore. The last thread through resets the task list and > releases the semaphore, starting the process all over again except now it is > calculating the lighting. > > So, the efficiency of the system is exceptionally high since there > are no OS level locks in processing the task list except at the very end > where it restarts the list. Given a 256x256 grid this means there is a > single OS lock for every 65536 tasks being executed. > N.B: The cost in an OS lock is entering the the kernel, not necessarily that it is a "lock". Suspending a thread (or releasing it) also require entering the kernel, and would have a similar cost. > But to become non-OS specific, what other methods can you think of to > optimize the intended results? In general though the pattern is working > very well, and OS's other than Win32/Win64 shows Amdahl's law as under 1% > for simple tasks, where on Win32/Win64, the cost jumps to 3-4% in all tests. I still don't see why you wouldn't just use a traditional set of worker threads, with a traditional set of work items. Picking up the "next vertex" in the grid doesn't need synchronization other than an interlocked increment, and the worker threads can all go back to waiting on the work queue when the task is done. To make sure all threads are done, you can use a rendez-vous(interlocked math and a single OS lock), or something akin to WaitForMultipleEvents() (I'd prefer the rendez-vous). Especially if you're OK with suspending the orignal thread while the others are working (or, perhaps, turning the system thread into one worker thread, to get even less OS enters). So, to understand your solution, I'd like to understand the problem in a little more detail. Is the workflow something like: "original requester can request that N worker threads do some work, and when all N threads are done, the original requester gets un-blocked again"? If so, would that be expressed in the following sketch? class WorkerPool { public: WorkerPool(int nThreads) { numThreads_ = nThreads; userEvent_ = ::CreateEvent(0,0,0,0); workerSemaphore_ = ::CreateSemaphore(0,0,nThreads,0); for (int i = 0; i < nThreads; ++i) { DWORD id = 0; ::CreateThread(0,65536,&ThreadFunc,this,0,&id); } } // Deleting the worker pool is left as an exercise for the reader. // User tells threads to do something void DoCommandSynchronous(Command * cmd) { cmd_ = cmd; numDone_ = 0; LONG pc = 0; ::ReleaseSemaphore(workerSemaphore_, numThreads_, &pc); // make sure threads are done ::WaitForSingleObject(userEvent_); } // Threads run, doing something static DWORD ThreadFunc(void * ptr) { ((WorkerPool *)ptr)->Run(); return 0; } void Run() { for (;;) { WaitForSingleObject(workerSemaphore_); cmd_->ThreadWork(); if (InterlockedIncrement(&numDone_) == numDone_) { cmd_->Cleanup(); ::SetEvent(userEvent_); } } } }; The threads would each figure out what vertices to work on inside ThreadWork() by using InterlockedIncrement() (or just reserve a chunk each, depending on the problem). "Command" would be one of your three lines in the example, processed for all vertices. Thus, you'll enter the kernel the minimal number of times: once per thread for blocking, once additionally for un-blocking all threads, and once additionally for un-blocking the original requester. If this still gives you 4% overhead, then that's the cost of doing Windows, I guess. Note that interlocked operations are also "expensive" in the sense that they may need to go to the bus, rather than just run in cache (this varies between PPC and x86; the former has reservations with store conditional; the latter has actual bus lock; different cases run differently fast on the two). Thus, having the threads decide which chunks to work on up-front is likely more efficient (as well as more cache local for each thread), but this depends on the problem at hand. Btw: if you want to pipeline the original thread with the workers, you can move the wait on userEvent_ to the beginning of the DoCommand function, and create the event in a signaled state up front. Cheers, / h+ |