From: Philippe T. <phi...@gm...> - 2014-05-30 03:39:41
|
Hi, Just to let you know that the problem was indeed fixed using: opencl/kernels/vector.hpp::avbv: opencl_source(template,statements, binding_policy) in opencl/vector_operations.hpp::avbv enqueue(template, program, statements, binding_policy) This is not the safest thing ever, since template, statements, binding_policy have to be the same in opencl_source and enqueue. But since these should remain internal and since we know what we're doing, this does not seem like a big issue to me. The risk with a "safer" interface is to make the code generation process less clear. The compilation time on my laptop is 1.8seconds, while for the original vector_float_double-test-opencl on the master branch it is slightly more than 2 seconds. Sounds like having one kernel per flip/reciprocal ends up being beneficial on the NVidia SDK. Having multiple simpler kernels sounds like a better approach ; I'll remember that. Philippe 2014-05-29 1:18 GMT+02:00 Philippe Tillet <phi...@gm...>: > Hey, > > There is no such generator objects. Actually, the generation of the source > for each kernel is independent. The interface of the generator is now much > clearer and really only requires > > generate_opencl_source(template, list of statements corresponding to the > template, throws an exception if they cannot be packed together). The > generator should just generate and does not try to somewhat retrieve some > information we already know :p > > I feel like I should add another parameter, symbolic_mapping_options, > which would govern how true objects are mapped to symbolic objects. We can > have multiple policies (same object handle to the same symbolic object, all > objects to different symbolic objects). > > so that, calling: > generate_opencl_source(template,statements,ALL_UNIQUE) > enqueue(template, program, ALL_UNIQUE) > > Would safely do what we want (generate and enqueue x = y + z even when > passed x = y + x for example). > > I'll do benchmarks on the multiple policies (once it works), and report, > to know if it's worth it to add an additional parameter for the user to > tune. > > Philippe > > 2014-05-28 16:59 GMT+02:00 Karl Rupp <ru...@iu...>: > > Hey, >> >> >> There is one program of around 1000 kernels. For comparison, the current >>> vector operations program take 2seconds to compile on my laptop. Since >>> the 1.6seconds are including all the flip, reciprocal, but not the x = >>> a*x + b*y -like operations, I suspect that this would be better in every >>> aspect to have multiple kernels for float/double too. >>> >> >> It would be certainly more uniform and save us quite some special cases, >> yes. So, let me ask the question differently then: How large is the >> performance difference of 'non-optimized' versus 'optimized' for these >> vector kernels? For BLAS level 1 operations I'd expect that it is in the >> order of 10 percent, if at all (of course assuming that we pick suitable >> device-specific profiles). Even for the matrix-vector product it shouldn't >> get above ~30%. I'd also expect that the older the hardware, the more >> pronounced the effect. >> >> Another suggestion: What if use the 'non-optimized' kernels by default, >> but provide users with a switch to explicitly enable the optimized kernels? >> This way users are made aware of the trade-off if they're hunting for the >> last bit of performance, while the average user isn't annoyed by large >> jit-overhead. > > >> >> >> I suspect that it >>> is faster for the compiler to have multiple simple kernel, rather than >>> one more complicated (the compiler probably has a hard time optimizing >>> the conditional statements if it cannot assume that the condition will >>> have the same value for all the threads). I really think that having >>> separate kernels for flip/reciprocal is the way to go. Plus, it allows >>> us to have the same implementation for all the numeric types, which is >>> much, much better. >>> >> >> It is a trade-off. If we can keep compilation times sufficiently low, >> then we can follow this route. ~1 second jit-overhead is kind of a magical >> threshold: Make it .2 and nobody will really notice, but make it 5 and >> everybody will complain. >> >> >> - The generator interprets differently x = a*y + b*z, x = >>> a*y + b*x, >>> x = a*x + b*y, etc... >>> >>> >>> Hmm, this needs fixing. I see two possible paths: >>> - Only query the kernel sources from the generator, but manage >>> them in a separate entity just like it is done now. This way one can >>> deal with the necessary extra logic outside the generator, just like >>> it is done now in viennacl/linalg/opencl/__kernels/*.hpp >>> >>> >>> >>> I don't see it as a "bug", though. x = a*y + b*z and x = a*y + b*x are >>> two different expression trees. It eventually leads to two equivalent >>> kernels (2N reads, N write) because of the triviality of avbv kernels, >>> not because of the expression tree. I think it's normal that it is >>> interpreted differently, but we should have a way to control it to allow >>> for a simpler behavior. >>> >> >> It's not a bug in the generator, but it's a bug in the way we handle the >> kernels then. If jit-compilation overhead is becoming an issue, we have to >> take appropriate measures to fight it, and not just hide behind the >> academic barrel and justify excessive overhead via 'but the implementation >> is nice and abstract'. Providing a flag in order to control the behavior >> sounds like a legitimate approach here. >> >> >> >> There are many problems with handling the enqueueing manually, without >>> using the generator, unfortunately: >>> - We shouldn't have to modify vector_operations.hpp when we modify a >>> given kernel template. For example, the number of kernels required by >>> the template (1 for AXPY, 2 for DOT, 1 for GEMV/GEMM) should remain >>> entirely internal. If we ever find out that GEMV is better off with 2 >>> kernels on CPUs, then we shouldn't have to modify anything else than the >>> GEMV profile. Similarly, if for some reason we realize that a kernel >>> could be optimized by having an additional argument, we shouldn't have >>> to modify the viennacl core. While the generation and the enqueueing >>> should be clearly separated, it is fundamental that they remain >>> encapsulated. >>> >> >> Agreed. >> >> >> >> >> - Each avbv requires 2 kernel, because we need one fallback >>> when the >>> offset is not a multiple of the simd_width. There are some >>> trick on >>> AMD implementations to avoid doing this, but I know no >>> portable trick. >>> >>> >>> Do you have performance numbers on this? As this is heavily >>> memory-bandwidth limited, it may not make any notable difference >>> overall. >>> Btw: Could you please explain in a sentence or two what this new >>> simd_width parameter is about? I know what SIMD is, yet I want to >>> make sure that we are talking about the same thing. >>> >>> >>> Yes, the name will change. I should call it vector_width, to conform >>> with the OpenCL standards. It's about using float4* instead of float*, >>> for example. It does make a huge bandwidth difference to load float4* >>> rather than float* on my AMD HD5850. I guess you observed that as well >>> when you auto-tuned avbv. >>> >> >> Yes, I observed that. I also observed that without these vector types one >> can still get fairly close to peak performance if the right work group >> sizes/dimensions are chosen, particularly on newer hardware. >> >> >> >> >> The problem is that using float4* restricts pointer arithmetics so the >>> offset is forced to be a multiple of 4. On AMD hardware, one may safely >>> do >>> >>> union ptr{ >>> float* fp; >>> float4* f4p; >>> } >>> >>> ptr.f4p = my_float4_ptr; >>> ptr.fp += offset. >>> >>> To handle all offsets, but it does not sounds to me like a reasonable >>> portable solution... >>> >> >> You certainly know this thread: ;-) >> http://www.khronos.org/message_boards/showthread.php/ >> 8751-Is-it-safe-to-cast-a-pointer-to-vector-to-pointer-to-scalar >> >> The endianness should not be a problem here, as both float* and float4* >> have the same size. My interpretation of the standard is that you can >> savely go from float4 to float, but the other way only works within the >> alignment guarantees. A more practical problem is that of performance: If a >> compiler sees such constructs, it may quickly switch to worst-case >> assumptions, not providing much benefit. >> >> >> >> As a sketch of how it is implemented, it does something like this >>> >>> in linald/opencl/kernels/vector.hpp:generate_avbv: >>> source.append(device_specific::generate::opencl_sources( >>> database::get<T>(database::axpy), >>> scheduler::preset::axpy(&x, &y, &alpha, reciprocal_a, flip_a, &z, &beta, >>> reciprocal_b, flip_b)); >>> >>> And in linalg/opencl/vector_operations.hpp:avbv >>> device_specific::enqueue(database::get<T>(database::axpy), >>> kernels::vector<T>::program_name, scheduler::preset::axpy(&x, &y, >>> &alpha, reciprocal_a, flip_a, &z, &beta, reciprocal_b, flip_b)); >>> >> >> Thanks! >> >> >> >> The whole problem could be sorted out by adding a bool parameter to >>> opencl_sources() and enqueue(), (to tell the generator to ignore if x, y >>> and z are the same objects/point to the same memory handle in the same >>> statement). Initially, this feature was added to identify handles when >>> there are multiple statements, in order to load z only once in things >>> like: >>> >>> s = reduce<max>(z) - reduce<min>(z) >>> {x = y + z, y = x + z} >>> x = y + z + z >>> >>> It turns out that in the last case the interest is limited, but it's >>> just a special case! >>> >> >> Hmm, actually I find it better to provide the boolean flag to a generator >> object rather than passing it each time when I want to enqueue. The reason >> is that the generator knows which kernels have been compiled earlier, >> whereas that knowledge isn't usually available in the routines taking care >> of the enqueue(). >> >> Best regards, >> Karli >> >> > |