Re: [ViennaCL-devel] Influence of the number of kernels on the behavior of a program

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

Just to let you know that the problem was indeed fixed using:

opencl/kernels/vector.hpp::avbv:
opencl_source(template,statements, binding_policy)

in opencl/vector_operations.hpp::avbv
enqueue(template, program, statements, binding_policy)

This is not the safest thing ever, since template, statements,
binding_policy have to be the same in opencl_source and enqueue. But since
these should remain internal and since we know what we're doing, this does
not seem like a big issue to me. The risk with a "safer" interface is to
make the code generation process less clear.

The compilation time on my laptop is 1.8seconds, while for the original
vector_float_double-test-opencl on the master branch it is slightly more
than 2 seconds. Sounds like having one kernel per flip/reciprocal ends up
being beneficial on the NVidia SDK. Having multiple simpler kernels sounds
like a better approach ; I'll remember that.

Philippe

2014-05-29 1:18 GMT+02:00 Philippe Tillet <phi...@gm...>:

> Hey,
>
> There is no such generator objects. Actually, the generation of the source
> for each kernel is independent. The interface of the generator is now much
> clearer and really only requires
>
> generate_opencl_source(template, list of statements corresponding to the
> template, throws an exception if they cannot be packed together). The
> generator should just generate and does not try to somewhat retrieve some
> information we already know :p
>
> I feel like I should add another parameter, symbolic_mapping_options,
> which would govern how true objects are mapped to symbolic objects. We can
> have multiple policies (same object handle to the same symbolic object, all
> objects to different symbolic objects).
>
> so that, calling:
> generate_opencl_source(template,statements,ALL_UNIQUE)
> enqueue(template, program, ALL_UNIQUE)
>
> Would safely do what we want (generate and enqueue x = y + z even when
> passed x = y + x for example).
>
> I'll do benchmarks on the multiple policies (once it works), and report,
> to know if it's worth it to add an additional parameter for the user to
> tune.
>
> Philippe
>
> 2014-05-28 16:59 GMT+02:00 Karl Rupp <ru...@iu...>:
>
> Hey,
>>
>>
>>  There is one program of around 1000 kernels. For comparison, the current
>>> vector operations program take 2seconds to compile on my laptop. Since
>>> the 1.6seconds are including all the flip, reciprocal, but not the x =
>>> a*x + b*y -like operations, I suspect that this would be better in every
>>> aspect to have multiple kernels for float/double too.
>>>
>>
>> It would be certainly more uniform and save us quite some special cases,
>> yes. So, let me ask the question differently then: How large is the
>> performance difference of 'non-optimized' versus 'optimized' for these
>> vector kernels? For BLAS level 1 operations I'd expect that it is in the
>> order of 10 percent, if at all (of course assuming that we pick suitable
>> device-specific profiles). Even for the matrix-vector product it shouldn't
>> get above ~30%. I'd also expect that the older the hardware, the more
>> pronounced the effect.
>>
>> Another suggestion: What if use the 'non-optimized' kernels by default,
>> but provide users with a switch to explicitly enable the optimized kernels?
>> This way users are made aware of the trade-off if they're hunting for the
>> last bit of performance, while the average user isn't annoyed by large
>> jit-overhead.
>
>
>>
>>
>>  I suspect that it
>>> is faster for the compiler to have multiple simple kernel, rather than
>>> one more complicated (the compiler probably has a hard time optimizing
>>> the conditional statements if it cannot assume that the condition will
>>> have the same value for all the threads). I really think that having
>>> separate kernels for flip/reciprocal is the way to go. Plus, it allows
>>> us to have the same implementation for all the numeric types, which is
>>> much, much better.
>>>
>>
>> It is a trade-off. If we can keep compilation times sufficiently low,
>> then we can follow this route. ~1 second jit-overhead is kind of a magical
>> threshold: Make it .2 and nobody will really notice, but make it 5 and
>> everybody will complain.
>>
>>
>>               - The generator interprets differently x = a*y + b*z, x =
>>>         a*y + b*x,
>>>              x = a*x + b*y, etc...
>>>
>>>
>>>     Hmm, this needs fixing. I see two possible paths:
>>>       - Only query the kernel sources from the generator, but manage
>>>     them in a separate entity just like it is done now. This way one can
>>>     deal with the necessary extra logic outside the generator, just like
>>>     it is done now in viennacl/linalg/opencl/__kernels/*.hpp
>>>
>>>
>>>
>>> I don't see it as a "bug", though.  x = a*y + b*z and x = a*y + b*x are
>>> two different expression trees. It eventually leads to two equivalent
>>> kernels (2N reads, N write) because of the triviality of avbv kernels,
>>> not because of the expression tree. I think it's normal that it is
>>> interpreted differently, but we should have a way to control it to allow
>>> for a simpler behavior.
>>>
>>
>> It's not a bug in the generator, but it's a bug in the way we handle the
>> kernels then. If jit-compilation overhead is becoming an issue, we have to
>> take appropriate measures to fight it, and not just hide behind the
>> academic barrel and justify excessive overhead via 'but the implementation
>> is nice and abstract'. Providing a flag in order to control the behavior
>> sounds like a legitimate approach here.
>>
>>
>>
>>  There are many problems with handling the enqueueing manually, without
>>> using the generator, unfortunately:
>>> - We shouldn't have to modify vector_operations.hpp when we modify a
>>> given kernel template. For example, the number of kernels required by
>>> the template (1 for AXPY, 2 for DOT, 1 for GEMV/GEMM) should remain
>>> entirely internal. If we ever find out that GEMV is better off with 2
>>> kernels on CPUs, then we shouldn't have to modify anything else than the
>>> GEMV profile. Similarly, if for some reason we realize that a kernel
>>> could be optimized by having an additional argument, we shouldn't have
>>> to modify the viennacl core. While the generation and the enqueueing
>>> should be clearly separated, it is fundamental that they remain
>>> encapsulated.
>>>
>>
>> Agreed.
>>
>>
>>
>>
>>               - Each avbv requires 2 kernel, because we need one fallback
>>>         when the
>>>              offset is not a multiple of the simd_width. There are some
>>>         trick on
>>>              AMD implementations to avoid doing this, but I know no
>>>         portable trick.
>>>
>>>
>>>     Do you have performance numbers on this? As this is heavily
>>>     memory-bandwidth limited, it may not make any notable difference
>>>     overall.
>>>     Btw: Could you please explain in a sentence or two what this new
>>>     simd_width parameter is about? I know what SIMD is, yet I want to
>>>     make sure that we are talking about the same thing.
>>>
>>>
>>> Yes, the name will change. I should call it vector_width, to conform
>>> with the OpenCL standards. It's about using float4* instead of float*,
>>> for example. It does make a huge bandwidth difference to load float4*
>>> rather than float* on my AMD HD5850. I guess you observed that as well
>>> when you auto-tuned avbv.
>>>
>>
>> Yes, I observed that. I also observed that without these vector types one
>> can still get fairly close to peak performance if the right work group
>> sizes/dimensions are chosen, particularly on newer hardware.
>>
>>
>>
>>
>>  The problem is that using float4* restricts pointer arithmetics so the
>>> offset is forced to be a multiple of 4. On AMD hardware, one may safely
>>> do
>>>
>>> union ptr{
>>>     float* fp;
>>>     float4* f4p;
>>> }
>>>
>>> ptr.f4p = my_float4_ptr;
>>> ptr.fp += offset.
>>>
>>> To handle all offsets, but it does not sounds to me like a reasonable
>>> portable solution...
>>>
>>
>> You certainly know this thread: ;-)
>> http://www.khronos.org/message_boards/showthread.php/
>> 8751-Is-it-safe-to-cast-a-pointer-to-vector-to-pointer-to-scalar
>>
>> The endianness should not be a problem here, as both float* and float4*
>> have the same size. My interpretation of the standard is that you can
>> savely go from float4 to float, but the other way only works within the
>> alignment guarantees. A more practical problem is that of performance: If a
>> compiler sees such constructs, it may quickly switch to worst-case
>> assumptions, not providing much benefit.
>>
>>
>>
>>  As a sketch of how it is implemented, it does something like this
>>>
>>> in linald/opencl/kernels/vector.hpp:generate_avbv:
>>> source.append(device_specific::generate::opencl_sources(
>>> database::get<T>(database::axpy),
>>> scheduler::preset::axpy(&x, &y, &alpha, reciprocal_a, flip_a, &z, &beta,
>>> reciprocal_b, flip_b));
>>>
>>> And in linalg/opencl/vector_operations.hpp:avbv
>>> device_specific::enqueue(database::get<T>(database::axpy),
>>> kernels::vector<T>::program_name, scheduler::preset::axpy(&x, &y,
>>> &alpha, reciprocal_a, flip_a, &z, &beta, reciprocal_b, flip_b));
>>>
>>
>> Thanks!
>>
>>
>>
>>  The whole problem could be sorted out by adding a bool parameter to
>>> opencl_sources() and enqueue(), (to tell the generator to ignore if x, y
>>> and z are the same objects/point to the same memory handle in the same
>>> statement). Initially, this feature was added to identify handles when
>>> there are multiple statements, in order to load z only once in things
>>> like:
>>>
>>> s = reduce<max>(z) - reduce<min>(z)
>>> {x = y + z, y = x + z}
>>> x = y + z + z
>>>
>>> It turns out that in the last case the interest is limited, but it's
>>> just a special case!
>>>
>>
>> Hmm, actually I find it better to provide the boolean flag to a generator
>> object rather than passing it each time when I want to enqueue. The reason
>> is that the generator knows which kernels have been compiled earlier,
>> whereas that knowledge isn't usually available in the routines taking care
>> of the enqueue().
>>
>> Best regards,
>> Karli
>>
>>
>

Re: [ViennaCL-devel] Influence of the number of kernels on the behavior of a program

Linear algebra and solver library using CUDA, OpenCL, and OpenMP

Re: [ViennaCL-devel] Influence of the number of kernels on the behavior of a program