From: Pekka J. <pek...@tu...> - 2011-10-30 16:22:14
|
On 10/30/2011 05:38 PM, Erik Schnetter wrote: > Another option could be to build a small C program that uses OpenMP; > the OpenMP run time contains logic that determines a good number of > threads to use. You would look at omp_max_threads(). I wouldn't like to introduce a library dependency just because of this. I'm sure there are OS-specific ways to figure out the count of cores and hardware threads per core in the different operating systems. Or just resort to some CPU info instruction set in the device, if available. After all, the current need of pocl is quite simple: if we want to exploit the task level parallelism provided by the device to the max while minimizing the threading overheads, it boils down to the number of hardware threads per core times the core count (or the number of WGs, whichever is smaller), doesn't it? If disk or network I/O was of concern there should be additional threads to hide the I/O latencies (at the OS level), but now we are mainly concerned on hiding the memory latencies because the kernels do not access files or the network like, for example, OpenMP loops in general can do. For memory latency hiding, only hardware threads can be of help, AFAIK. Additional consideration is the size of the local memory as each parallel WG needs a separate local memory space. Currently pocl just assumes the local memory malloc overhead (and the size) per thread is tolerable. In reality, for example on memory-tight embedded targets, this should also restrict the max number of parallel WG threads. If you can afford only one local memory "alive" at the same time, you can launch only one WG thread. BR, -- --Pekka |