From: Carlos S. de La L. <car...@ur...> - 2011-10-25 16:01:59
|
Hi all, I just commited rev. 45 with a multithreading device, similar to native but creates a thread for each workgroup. This device is also made default device. BR, Carlos |
From: Pekka J. <pek...@tu...> - 2011-10-30 11:34:08
|
On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote: > I just commited rev. 45 with a multithreading device, similar to native > but creates a thread for each workgroup. I committed a modification to the multithreading code on Friday. Now it creates a "sensible number" of threads for the multicore instead of blindly creating as many threads as there are WGs. However, parsing the /proc/cpuinfo to produce the number of hardware threads available in the processor is a bit flaky so (if you run Linux) please test that it returns a sensible number of threads for you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and compiling+running one of the examples. It should print out the "max thread count" for your (multi)processor before running the kernel. For Mac (and Windows) we need to figure out some other way to get the hardware thread count which defaults to 8 now. -- --Pekka |
From: Erik S. <esc...@pe...> - 2011-10-30 15:37:23
|
There is "hwloc", distributed on <http://www.open-mpi.org/>. This library determines the number of logical CPUs, as well as their association with various cache levels and NUMA properties. -erik 2011/10/30 Pekka Jääskeläinen <pek...@tu...>: > On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote: >> I just commited rev. 45 with a multithreading device, similar to native >> but creates a thread for each workgroup. > > I committed a modification to the multithreading code on Friday. > > Now it creates a "sensible number" of threads for the multicore > instead of blindly creating as many threads as there are WGs. > > However, parsing the /proc/cpuinfo to produce the number of hardware > threads available in the processor is a bit flaky so (if you run > Linux) please test that it returns a sensible number of threads for > you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and > compiling+running one of the examples. It should print out the "max > thread count" for your (multi)processor before running the kernel. For > Mac (and Windows) we need to figure out some other way to get the > hardware thread count which defaults to 8 now. > > -- > --Pekka > > > ------------------------------------------------------------------------------ > Get your Android app more play: Bring it to the BlackBerry PlayBook > in minutes. BlackBerry App World™ now supports Android™ Apps > for the BlackBerry® PlayBook™. Discover just how easy and simple > it is! http://p.sf.net/sfu/android-dev2dev > _______________________________________________ > Pocl-devel mailing list > Poc...@li... > https://lists.sourceforge.net/lists/listinfo/pocl-devel > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Erik S. <esc...@pe...> - 2011-10-30 15:38:46
|
Another option could be to build a small C program that uses OpenMP; the OpenMP run time contains logic that determines a good number of threads to use. You would look at omp_max_threads(). -erik 2011/10/30 Erik Schnetter <esc...@pe...>: > There is "hwloc", distributed on <http://www.open-mpi.org/>. This > library determines the number of logical CPUs, as well as their > association with various cache levels and NUMA properties. > > -erik > > 2011/10/30 Pekka Jääskeläinen <pek...@tu...>: >> On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote: >>> I just commited rev. 45 with a multithreading device, similar to native >>> but creates a thread for each workgroup. >> >> I committed a modification to the multithreading code on Friday. >> >> Now it creates a "sensible number" of threads for the multicore >> instead of blindly creating as many threads as there are WGs. >> >> However, parsing the /proc/cpuinfo to produce the number of hardware >> threads available in the processor is a bit flaky so (if you run >> Linux) please test that it returns a sensible number of threads for >> you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and >> compiling+running one of the examples. It should print out the "max >> thread count" for your (multi)processor before running the kernel. For >> Mac (and Windows) we need to figure out some other way to get the >> hardware thread count which defaults to 8 now. >> >> -- >> --Pekka >> >> >> ------------------------------------------------------------------------------ >> Get your Android app more play: Bring it to the BlackBerry PlayBook >> in minutes. BlackBerry App World™ now supports Android™ Apps >> for the BlackBerry® PlayBook™. Discover just how easy and simple >> it is! http://p.sf.net/sfu/android-dev2dev >> _______________________________________________ >> Pocl-devel mailing list >> Poc...@li... >> https://lists.sourceforge.net/lists/listinfo/pocl-devel >> > > > > -- > Erik Schnetter <esc...@pe...> > http://www.cct.lsu.edu/~eschnett/ > AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... > -- Erik Schnetter <esc...@pe...> http://www.cct.lsu.edu/~eschnett/ AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm... |
From: Pekka J. <pek...@tu...> - 2011-10-30 16:22:14
|
On 10/30/2011 05:38 PM, Erik Schnetter wrote: > Another option could be to build a small C program that uses OpenMP; > the OpenMP run time contains logic that determines a good number of > threads to use. You would look at omp_max_threads(). I wouldn't like to introduce a library dependency just because of this. I'm sure there are OS-specific ways to figure out the count of cores and hardware threads per core in the different operating systems. Or just resort to some CPU info instruction set in the device, if available. After all, the current need of pocl is quite simple: if we want to exploit the task level parallelism provided by the device to the max while minimizing the threading overheads, it boils down to the number of hardware threads per core times the core count (or the number of WGs, whichever is smaller), doesn't it? If disk or network I/O was of concern there should be additional threads to hide the I/O latencies (at the OS level), but now we are mainly concerned on hiding the memory latencies because the kernels do not access files or the network like, for example, OpenMP loops in general can do. For memory latency hiding, only hardware threads can be of help, AFAIK. Additional consideration is the size of the local memory as each parallel WG needs a separate local memory space. Currently pocl just assumes the local memory malloc overhead (and the size) per thread is tolerable. In reality, for example on memory-tight embedded targets, this should also restrict the max number of parallel WG threads. If you can afford only one local memory "alive" at the same time, you can launch only one WG thread. BR, -- --Pekka |