There is a performance barrier at 128 threads where the runtimes
will increase significantly by adding just one more thread. This
problem only seems to occur when the time to compute the image
is less than the time to spawn the N requested threads. Hence you
can see the problem easily on the benchmarks on high
performance SMP machines.
The behavior was most easily observed on a large O3k, as well as
on a pmac running os x and compiled to allow additional threads
beyond the max avail cpu count.
The simple workaround is to get the tasks a little more work to do
or request fewer threads/procs since there is not enough work in
the first place.
Log in to post a comment.