Optimise CPU implementation
C++ library for sorting and searching in OpenCL applications
Brought to you by:
bmerry
Particularly for scanning, it seems like a serial implementation may be best (bandwidth-limited). A further tweak may be to work in parallel, but on small pieces at a time that fit in L2.