libjpeg-turbo / Patches (closed - use GitHub) / #40 OpenCL decoding support

Peixuan Zhang - 2013-07-04

All,
The attachment is the patch file for OpenCL decoding version. The base version is 990 in truck (offical svn.) please use the following command to test it:

djpeg.exe -bmp -outfile out.bmp in.jpg

It may need some tiny change to fully support output formats other than bmp. This version is only tested on Windows, we also wrote an OpenCL loader code for Linux, but we haven't tested on Linux. I'm waiting for any advise for this patch.

opencl-decoding.patch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-07-04

Thanks for the submission. Upon cursory first glance, I have several comments:

(1) Is it necessary to include the CL/ directory? That seems to contain some common headers that should be provided with an external OpenCL toolkit.

(2) The header for the new C files refers one to joclext.inc for licensing conditions, but that file contains no licensing conditions. If you wish to use the same license as libjpeg, then you should include the text "For conditions of distribution and use, see the accompanying README file" in your .c files. See the existing source code for examples.

(3) libjpeg-turbo already has a mechanism for loading different algorithm implementations (such as SSE2 or MMX, for instance.) I'd really like the OpenCL stuff to be implemented similarly. I will look at the code more closely in the coming weeks and see if I can provide a more specific suggestion here.

(4) Pursuant to the above, the detection of OpenCL should occur behind the scenes, just like the SIMD technologies, so it should not be necessary to export jocl_cl_is_available() and jpeg_is_opencl_decompress(). Adding new functions to a well-defined API/ABI like libjpeg is a tricky proposition at best.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peixuan Zhang - 2013-07-04

Thanks for the reply.
(1) This version use dynamic loading OpenCL dll/so file, so we don't use any lib/so file when building the code. I added CL directory is because all the developer,even who don't install any OpenCL sdk, could also build the CL version.
(2) We will wrote the license later, thanks a lot.
(3) The work flow of OpenCL version is different from CPU/SIMD version, CPU and GPU are working in parallel, so it's diffcult to fully copy the code style of SIMD version. But we could do as much as possilble, I will wait for you advise, and I think your advise will be very important.
(4) I agree, and we will try to fixed it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Siarhei Siamashka - 2013-07-04

Regarding the work flow change, perhaps the CPU implementation could also make use of the pipelined huffman and SIMD processing in separate threads with relatively little effort (assuming that this already has proven to provide good results for OpenCL)? Almost every system has a multi-core processor nowadays, This seems to be a very promising optimization if done right.

Just curious, how much of the performance improvement are you typically getting with OpenCL?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-07-04
  
  I think the SIMD is a type of parallel, and if we only use CPU, the pipeline may not good.
  About the performance, just use a 4096x3200 image:
  Intel i7 3520M SIMD: 85.3 ms
  AMD A10M 4600M SIMD: 148.8 ms
  AMD A10M 4600M GPU OpenCL no pipeline: 73.6 ms
  AMD A10M 4600M GPU OpenCL with pipeline: 53.2 ms.
  Intel i7 3520M CPU OpenCL with pipeline: 90.0 ms. (It shows if only use CPU, the pipeline's performance is not good)
  Intel i7 3520M GPU OpenCL with pipeline: 54.6582 ms.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Siarhei Siamashka - 2013-07-04

Thanks for the numbers! I'll also try to test the patch in Linux on my system a bit later.

If I understand the idea of your patch correctly, you keep the main CPU thread doing just huffman (which can't be efficiently parallelized) and offload the rest of processing to "something else". This "something else" in your case happens to be OpenCL running on the GPU. Or it can be CPU OpenCL like you did for testing/benchmarking purposes. However we don't really need to run OpenCL on the CPU (with the OpenCL framework overhead and less than perfect compiler generated code for IDCT). Instead we can use a lightweight worker thread, using the current SSE2 assembly optimized SIMD code.

I just see the following two logical parts in your patch:
1. Make offloading possible
2. Implement OpenCL backend for offloading

After this is done, somebody else may try to provide a way to replace the OpenCL backend with a CPU worker thread (may be useful for the multi-core systems without OpenCL), while reusing some of your work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peixuan Zhang - 2013-07-17

Is there any progress for check the code? We have also improved some code and will commit it soon.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-07-19

Here are the things I need addressed before I can start integrating the code:

(1) Licensing and copyright notice cleanup.

(a) If MulticoreWare, Inc. actually owns your work (which is generally the case if all of the developers working on the project are salaried employees of the company), then the MulticoreWare copyright notice is the only one that needs to be in the new files. You can add additional lines to indicate the individual authors, but the additional authors' names should not be prefaced by "Copyright".

(b) The date of copyright for each file should be the year in which the file in question was modified (or a range of years if the file was modified in multiple years.)

(c) You seem to have copied/pasted some language from the copyright headers of the SIMD extensions that is not applicable to what you're doing. If you intend your work to be released under the same license as the rest of libjpeg, then you need to refer to that license in your headers by saying "For conditions of distribution and use, see the accompanying README file."

Getting this right is important, because we are only legally allowed to use your code under the terms of the open source license. The only way we know that you truly intend to release the code into the open source is that you properly document your intent in the copyright headers. Thus, I must have a new patch with these issues corrected before I can check any of the code into the repository.

(2) In previous comments, you list numbers obtained with and without the pipeline. How does one disable/enable that feature for testing?

(3) You have extended the jpeg_decompress_struct structure to accommodate some additional fields used only by the OpenCL code. Unfortunately, extending that structure breaks ABI compatibility, so those new fields need to be eliminated or moved into one of the existing opaque structures. Please feel free to open a discussion regarding issues that you feel might prevent this from happening.

(4) As previously mentioned, programs that use libjpeg-turbo should not be aware of OpenCL. Thus, the modifications to djpeg.c and wrbmp.c need to be eliminated, as well as the additional API functions that were specific to OpenCL. Please feel free to open a discussion regarding issues that you feel might prevent this from happening.

(5) Am I misreading, or does the code always create enough space for a 4096 x 4096 image? That seems wasteful and prone to errors (from running out of video memory.)

(6) Why does jocl_cl_init() set the thread priority of the calling thread to time critical?

(7) I can't personally make the code work. It segfaults pretty quickly after compiling the OpenCL kernels, in line 185 of jdcoefct.c. I also ported the code to Linux, and it segfaults in the same place on that platform. I'm not familiar enough with how the code is supposed to work to say what could be causing the segfault. Perhaps it failed to allocate memory?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-07-21
  
  Thanks a lot for the reply, and I will fix these issue ASAP.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-07-22
  
  Hello All,
  This is the newest patch base on libjpeg-turbo r991.
  The following changes have been submitted:
  
  1. For OpenCL version, we use a new jpeg_natural_order (transpose the results), so the OpenCL kernel function "IDCT" should be faster than before. 2. Improve all the kernel funciton, so this patch is faster then last patch. 3. Fixed some bugs and impove the code style. 4. Add license info to some files. 5. Remove all added API/ABI, so now the jpeglib.h files is the same as base version. 6. Add Linux dynamic loading library support.
  
  the things that need addressed from DRC:
  about 2,
  We defined a macro switch in jinclude.h (Line 25), if you want to use OpenCL version without the pipeline, please comment out it.
  about 5,
  The size of space we created can be defined by macro , and we think we could find a proper size for this space. But the space can not be created dynamicly according to images' size, that's because we have to write faked data to the zero copy buffers once, and also run kernels once before we start to do real decoding. By profiling, we found out that the time consumed by DMA (including zero copy) and kernel execution for the first time is extremely long (100ms for example for zero copy from CPU to GPU for 39 MB data), but the time consumed for executing the same operators on the same buffer or running the same kernel on the same buffer in the subsequent is much shorter. Besides, in some applications of libjpeg_turbo(such as chromium), we created the space at the time of decoding the first image, and the other images decoding doesn't need to create the space again, the performance will be improved.
  about 6,
  The thread priority of jocl_cl_init() is removed, it's to make sure the response speed of OpenCL API, but it's not very necessary.
  about 7,
  We are unable to reproduce this bugs, but the newest patch is committed, and if the code is still cannot work, could you give us more detail information?
  
  We expect more suggestions to further modify our code, thanks a lot.
  
  20130722OpenCL.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-07-29

The segfault seems to be due to a compilation error in the OpenCL kernels. I enabled debug output and got the following:

Platform:
profile: FULL_PROFILE
version: OpenCL 1.1 CUDA 4.2.1
name: NVIDIA CUDA
vendor: NVIDIA Corporation
extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
Device:
Name: Quadro 600
OpenCL Build Error:CL_BUILD_PROGRAM_FAILURE
:98:30: error: can't convert between vector values of different size ('short8' and 'int')
dataShort = dataShort + 128;
~~~~~~~~~ ^ ~~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peixuan Zhang - 2013-08-05

We have found this issue, and have fixed it, by using (short)128, but we also found other issues in NVidia platform, so we are still fixing bugs. We will give new pathc ASAP.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hello All,

 The attachment is the new patch, including the following improvements:

 1. Added OpenCL support in configure.ac and makefile.am, now we can use the parameter –with-opencl-dec to enable OpenCL support on Linux/Mac OSX.
 But I don't have a iMac or a Mac Pro, so I didn't fully test on Mac OSX.

 2. Tested and fixed bugs for Intel OpenCL, AMD OpenCL and NVidia OpenCL. We tested on AMD GPU and CPU, Intel Ivy bridge CPU and GPU and NV GTX680, all the platform are supported.
 However, because of NVidia platform doesn't support OpenCL 1.2, we can't support CPU/GPU pipeline on NVidia platform.

 3. Adding new kernels: IDCT-isslow, IDCT-float, and Up-sample-RGB/RGBA.

 4. Writing kernels to binary files when the first running, then it can load binary kernel from files since the second running, so we don't need to build OpenCL kernels every time.

 5. Improved the code style and fixed some tiny bugs.

 If there is any question, please tell me, and we will modify the code ASAP.

opencl_20130806.patch

Peixuan Zhang - 2013-08-06

Sorry, the patch is base on R996. I forgot to say that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-08-07

Getting a different error now. I am building on CentOS 5 using the latest nVidia OpenCL SDK and running on a Quadro 600.

[~/src/libjpeg-turbo.opencl/linux64]> sh ../configure --with-opencl-dec CFLAGS='-O3 -g'

[~/src/libjpeg-turbo.opencl/linux64]> make

[~/src/libjpeg-turbo.opencl/linux64]> ulimit -c 10000000

[~/src/libjpeg-turbo.opencl/linux64]> ./tjunittest

Platform:
profile: FULL_PROFILE
version: OpenCL 1.1 CUDA 4.2.1
name: NVIDIA CUDA
vendor: NVIDIA Corporation
extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
Device:
Name: Quadro 600
OpenCL is enabled.
Create kernel OK!
OpenCL error in err_code = jocl_clFinish(jocl_cl_get_command_queue()), Line 1034 in file ../jocldec.c
Error:CL_INVALID_COMMAND_QUEUE
RGB Top-Down -> 4:4:4 Q100 ... Done. 2.501965 ms
Result in test_enc_RGB_TD_444_Q100.jpg
Segmentation fault (core dumped)

[~/src/libjpeg-turbo.opencl/linux64]> gdb .libs/lt-tjunittest core.29757

Program terminated with signal 11, Segmentation fault.

0 decode_mcu_slow (cinfo=0x105f4d28, MCU_data=0x1080e9b8) at ../jdhuff.c:591

591 (*block)[0] = (JCOEF) s;
(gdb) print block
$1 = (JBLOCKROW) 0x0

Note that I disabled pipelining (by commenting out the "#define OPENCL_PIPELINE" line in jconfig.h.in) prior to building the code.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-08-07

Also note:

To make things easier on us, I created a SVN branch with your latest patch under branches/opencl. Please base any future patches off of that branch. Once I can verify that this actually works on Linux, then I will do some performance assessments. If those reveal that the performance is interesting enough to make me want to release this as an officially-supported feature, then I will start a thorough code review, clean up some cosmetic things, and do some more thorough testing (including on Windows and using other OpenCL implementations.) There are some areas that still concern me, probably the biggest being the static allocation of a 16-megapixel buffer. That could ultimately prove to be a show-stopper for this feature.

nVidia GPU support is the most interesting for me personally, because that's what most of my customers use, and I am mainly interested in this in the context of improving client-side performance for TurboVNC.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-08-08
  
  Attachment is a tiny patch, we fixed a bug's to support NVidia OpenCL on Ubuntu.
  We have test on GTX680 on Ubuntu x64, because I don't have a CentOS system which installed NV Card, so I haven't reproduced the bugs yet.
  
  I want to know more detial info: CentOS 5.9 or other version, x86 or x64? I will install an OS to do test.
  
  Thanks a lot.
  
  profile: FULL_PROFILE
  version: OpenCL 1.1 CUDA 4.2.1
  name: NVIDIA CUDA
  vendor: NVIDIA Corporation
  extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
  Device:
  Name: GeForce GTX 680
  
  patch-openl-20130808.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-08-08

It no longer segfaults, but it still doesn't work:

Platform:
profile: FULL_PROFILE
version: OpenCL 1.1 CUDA 4.2.1
name: NVIDIA CUDA
vendor: NVIDIA Corporation
extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_com
piler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
Device:
Name: Quadro 600
OpenCL is enabled.
Compiling OpenCL code successfully.
Create kernel OK!
OpenCL error in err_code = jocl_clFinish(jocl_cl_get_command_queue()), Line 1037
in file ../jocldec.c
Error:CL_INVALID_COMMAND_QUEUE
RGB Top-Down -> 4:4:4 Q100 ... Done. 2.468109 ms
Result in test_enc_RGB_TD_444_Q100.jpg
JPEG -> RGB Top-Down 2/1 ... OpenCL error in err_code = jocl_clEnqueueNDRangeKe
rnel(jocl_cl_get_command_queue(), jocldec_cl_kernel_use, 1, &offset_input, &glob
al_ws, &local_ws, 0, NULL, NULL), Line 1145 in file ../jocldec.c
Error:CL_OUT_OF_RESOURCES

Comp. r at 0,0 should be 255, not 0

I am running CentOS 5.9 64-bit, but I don't see how that would make any difference. The fact that I'm running a Quadro board might be the reason why we're observing different behavior.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-08-08
  
  Yes, maybe the Quadro borad is the reason, but to get a Quadro card is difficult for me now, so today I will try to test CentOS 5.9 with GTX680, to make sure the reason is not the OS.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-08-09
  
  I have tried on CentOS 5.9 x64 version with GTX680. The following is the step:
  1. Install the OS.
  2. Install the NVidia driver： NVIDIA-Linux-x86_64-319.32.run.
  3. Build it, but it failed when using SIMD, so I use --with-opencl-dec --without-simd.
  4. Test it, it works and is using OpenCL.
  
  I didn't install any other things.
  
  I don't think it is because I disable SIMD, but could you tell me how to fix it?
  
  checking whether the assembler (nasm -felf64 -DELF -Dx86_64) works... no
  configure: error: installation or configuration problem: assembler cannot create object files.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-08-09

Probably your version of NASM is too old. Make sure it is 2.0 or later.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-08-09
  
  Thanks a lot, and I have fixed it by using nasm-2.0.0.
  However, I still can't reproduce the bugs that DRC met.
  
  Currently we are only two differences in environment:
  1. The Graphics are different.
  2. My OS is only NV Drivers, DRC's may install other applications.
  
  Since I can't get a Quadro Card, does anyone can help me to test the code on your own machine? Thanks a lot!
  
  I also tested it on GTX460 on Windows, I catched the same bug, but this machine still don't have a clean OS installed. So I can't know what's the problem is.
  
  DRC,
  Could you help me to test this version on your machine using Windows or other Linux system? Thanks a lot.
  
  On CentOS:
  
  [root@test libjpeg-turbo-opencl]# ./djpeg -bmp -outfile re2.bmp h1v2_320x240.jpg
  Platform:
  profile: FULL_PROFILE
  version: OpenCL 1.1 CUDA 4.2.1
  name: NVIDIA CUDA
  vendor: NVIDIA Corporation
  extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
  Device:
  Name: GeForce GTX 680
  OpenCL is enabled.
  Compiling OpenCL code successfully.
  Create kernel OK!
  [root@test libjpeg-turbo-opencl]#
  
  On Windows,
  
  H:\standalone\0809\Release>djpeg -bmp -outfile re.bmp 1.jpg
  Platform:
  profile: FULL_PROFILE
  version: OpenCL 1.1 CUDA 4.2.1
  name: NVIDIA CUDA
  vendor: NVIDIA Corporation
  extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d
  9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_com
  piler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
  Device:
  Name: GeForce GTX 460
  OpenCL is enabled.
  Create kernel OK!
  OpenCL error in err_code = jocl_clFinish(jocl_cl_get_command_queue()), Line 1034
  in file ....\libjpeg_newest\jocldec.c
  Error:CL_INVALID_COMMAND_QUEUE
  OpenCL error in err_code = jocl_clEnqueueNDRangeKernel(jocl_cl_get_command_queue
  (), jocldec_cl_kernel_use, 1, &offset_input, &global_ws, &local_ws, 0, NULL, NUL
  L), Line 1146 in file ....\libjpeg_newest\jocldec.c
  Error:CL_MEM_OBJECT_ALLOCATION_FAILURE
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-08-09

It appears that you are seeing the same issue on your Windows machine that I am seeing on my Linux machine, so I would suggest that you diagnose that issue. It seems to me that, if you fixed that issue, there's a good chance that it would fix the issue I'm seeing. I think this is probably a driver-specific thing, not an O/S-specific thing.

In general, I should also comment that the error checking in your code needs work. There are several OpenCL commands that don't have error checks at all (one of those was leading to the initial segfault I was seeing, because jocl_cl_compile_and_build() was returning NULL.) I would expect that the code would check for every possible failure condition during initialization and avoid using OpenCL if anything fails, then after initialization, if any OpenCL failures are encountered, those should be fatal.

There are more things I need to do to the code to generally improve its elegance, but I first need to verify that the performance of it is interesting before I can spend any significant time on it at all. Thus, I really don't have the cycles to diagnose issues like these yet.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-08-13
  
  We are still solving this issue, and I will give you reply ASAP.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Peixuan Zhang - 2013-08-13
  
  A new patch, the issue has been fixed on GTX460. Please help me to test it.
  And if it works, please add this patch into branches, because it's a big patch, including a lot of changes.
  
  Thanks a lot!
  
  patch-opencl-20130813.patch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

OpenCL decoding support

SIMD-accelerated libjpeg-compatible JPEG codec library

Searches

Help

#40 OpenCL decoding support

Related

Discussion

0 decode_mcu_slow (cinfo=0x105f4d28, MCU_data=0x1080e9b8) at ../jdhuff.c:591