Thread: [Pocl-devel] Multithreading support commited

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi all,

I just commited rev. 45 with a multithreading device, similar to native
but creates a thread for each workgroup.

This device is also made default device.

BR,

Carlos

On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote:
> I just commited rev. 45 with a multithreading device, similar to native
> but creates a thread for each workgroup.

I committed a modification to the multithreading code on Friday.

Now it creates a "sensible number" of threads for the multicore
instead of blindly creating as many threads as there are WGs.

However, parsing the /proc/cpuinfo to produce the number of hardware
threads available in the processor is a bit flaky so (if you run
Linux) please test that it returns a sensible number of threads for
you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and 
compiling+running one of the examples. It should print out the "max
thread count" for your (multi)processor before running the kernel. For
Mac (and Windows) we need to figure out some other way to get the
hardware thread count which defaults to 8 now.

-- 
--Pekka

There is "hwloc", distributed on <http://www.open-mpi.org/>. This
library determines the number of logical CPUs, as well as their
association with various cache levels and NUMA properties.

-erik

2011/10/30 Pekka Jääskeläinen <pek...@tu...>:
> On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote:
>> I just commited rev. 45 with a multithreading device, similar to native
>> but creates a thread for each workgroup.
>
> I committed a modification to the multithreading code on Friday.
>
> Now it creates a "sensible number" of threads for the multicore
> instead of blindly creating as many threads as there are WGs.
>
> However, parsing the /proc/cpuinfo to produce the number of hardware
> threads available in the processor is a bit flaky so (if you run
> Linux) please test that it returns a sensible number of threads for
> you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and
> compiling+running one of the examples. It should print out the "max
> thread count" for your (multi)processor before running the kernel. For
> Mac (and Windows) we need to figure out some other way to get the
> hardware thread count which defaults to 8 now.
>
> --
> --Pekka
>
>
> ------------------------------------------------------------------------------
> Get your Android app more play: Bring it to the BlackBerry PlayBook
> in minutes. BlackBerry App World&#153; now supports Android&#153; Apps
> for the BlackBerry&reg; PlayBook&#153;. Discover just how easy and simple
> it is! http://p.sf.net/sfu/android-dev2dev
> _______________________________________________
> Pocl-devel mailing list
> Poc...@li...
> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

Another option could be to build a small C program that uses OpenMP;
the OpenMP run time contains logic that determines a good number of
threads to use. You would look at omp_max_threads().

-erik

2011/10/30 Erik Schnetter <esc...@pe...>:
> There is "hwloc", distributed on <http://www.open-mpi.org/>. This
> library determines the number of logical CPUs, as well as their
> association with various cache levels and NUMA properties.
>
> -erik
>
> 2011/10/30 Pekka Jääskeläinen <pek...@tu...>:
>> On 10/25/2011 07:01 PM, Carlos Sánchez de La Lama wrote:
>>> I just commited rev. 45 with a multithreading device, similar to native
>>> but creates a thread for each workgroup.
>>
>> I committed a modification to the multithreading code on Friday.
>>
>> Now it creates a "sensible number" of threads for the multicore
>> instead of blindly creating as many threads as there are WGs.
>>
>> However, parsing the /proc/cpuinfo to produce the number of hardware
>> threads available in the processor is a bit flaky so (if you run
>> Linux) please test that it returns a sensible number of threads for
>> you by enabling the #define DEBUG_MAX_THREAD_COUNT in pthread.c and
>> compiling+running one of the examples. It should print out the "max
>> thread count" for your (multi)processor before running the kernel. For
>> Mac (and Windows) we need to figure out some other way to get the
>> hardware thread count which defaults to 8 now.
>>
>> --
>> --Pekka
>>
>>
>> ------------------------------------------------------------------------------
>> Get your Android app more play: Bring it to the BlackBerry PlayBook
>> in minutes. BlackBerry App World&#153; now supports Android&#153; Apps
>> for the BlackBerry&reg; PlayBook&#153;. Discover just how easy and simple
>> it is! http://p.sf.net/sfu/android-dev2dev
>> _______________________________________________
>> Pocl-devel mailing list
>> Poc...@li...
>> https://lists.sourceforge.net/lists/listinfo/pocl-devel
>>
>
>
>
> --
> Erik Schnetter <esc...@pe...>
> http://www.cct.lsu.edu/~eschnett/
> AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...
>

-- 
Erik Schnetter <esc...@pe...>
http://www.cct.lsu.edu/~eschnett/
AIM: eschnett247, Skype: eschnett, Google Talk: sch...@gm...

On 10/30/2011 05:38 PM, Erik Schnetter wrote:
> Another option could be to build a small C program that uses OpenMP;
> the OpenMP run time contains logic that determines a good number of
> threads to use. You would look at omp_max_threads().

I wouldn't like to introduce a library dependency just because of this.
I'm sure there are OS-specific ways to figure out the count of cores and
hardware threads per core in the different operating systems. Or just
resort to some CPU info instruction set in the device, if available.

After all, the current need of pocl is quite simple: if we want to exploit
the task level parallelism provided by the device to the max while minimizing
the threading overheads, it boils down to the number of hardware threads
per core times the core count (or the number of WGs, whichever is smaller),
doesn't it?

If disk or network I/O was of concern there should be additional threads to
hide the I/O latencies (at the OS level), but now we are mainly concerned on
hiding the memory latencies because the kernels do not access files or the
network like, for example, OpenMP loops in general can do. For memory latency
hiding, only hardware threads can be of help, AFAIK.

Additional consideration is the size of the local memory as each
parallel WG needs a separate local memory space. Currently pocl just
assumes the local memory malloc overhead (and the size) per thread is
tolerable. In reality, for example on memory-tight embedded targets, this
should also restrict the max number of parallel WG threads. If you can afford
only one local memory "alive" at the same time, you can launch only one
WG thread.

BR,
-- 
--Pekka