jdcolor null_convert more expensive than rgb_rgb_convert path
SIMD-accelerated libjpeg-compatible JPEG codec library
Brought to you by:
dcommander
At least on my system, the "fast-path" null color converter seems to do more harm than good.
Reading the JPEG with djpeg, I get this runtime and perf profile:
$ time ~/jpeg/bin/djpeg -outfile /dev/null test.jpg
real 0m2.951s
user 0m2.933s
sys 0m0.020s
$ perf record ~/jpeg/bin/djpeg -outfile /dev/null test.jpg
$ perf report -n --stdio --percent-limit 1
# Samples: 12K of event 'cycles'
# Event count (approx.): 8747033111
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. ....................................
#
32.89% 3937 djpeg libjpeg.so.62.1.0 [.] decode_mcu
27.71% 3329 djpeg libjpeg.so.62.1.0 [.] null_convert
19.23% 2311 djpeg libjpeg.so.62.1.0 [.] jsimd_idct_islow_sse2.column_end
8.79% 1057 djpeg libjpeg.so.62.1.0 [.] decompress_onepass
3.68% 438 djpeg libjpeg.so.62.1.0 [.] jsimd_idct_islow_sse2.columnDCT
3.19% 382 djpeg libjpeg.so.62.1.0 [.] jsimd_idct_islow_sse2
1.73% 209 djpeg libc-2.20.so [.] __memset_sse2
With this change:
--- jdcolor.c (revision 1519)
+++ jdcolor.c (working copy)
@@ -797,13 +797,7 @@
} else if (cinfo->jpeg_color_space == JCS_GRAYSCALE) {
cconvert->pub.color_convert = gray_rgb_convert;
} else if (cinfo->jpeg_color_space == JCS_RGB) {
- if (rgb_red[cinfo->out_color_space] == 0 &&
- rgb_green[cinfo->out_color_space] == 1 &&
- rgb_blue[cinfo->out_color_space] == 2 &&
- rgb_pixelsize[cinfo->out_color_space] == 3)
- cconvert->pub.color_convert = null_convert;
- else
- cconvert->pub.color_convert = rgb_rgb_convert;
+ cconvert->pub.color_convert = rgb_rgb_convert;
} else
ERREXIT(cinfo, JERR_CONVERSION_NOTIMPL);
break;
...djpeg produces identical output and improved performance:
$ time ~/jpeg/bin/djpeg -outfile /dev/null test.jpg
real 0m2.668s
user 0m2.643s
sys 0m0.026s
$ perf record ~/jpeg/bin/djpeg -outfile /dev/null test.jpg
$ perf report -n --stdio --percent-limit 1
# Samples: 10K of event 'cycles'
# Event count (approx.): 7842096154
#
# Overhead Samples Command Shared Object Symbol
# ........ ............ ....... ................. ....................................
#
36.18% 3898 djpeg libjpeg.so.62.1.0 [.] decode_mcu
21.43% 2317 djpeg libjpeg.so.62.1.0 [.] jsimd_idct_islow_sse2.column_end
19.28% 2087 djpeg libjpeg.so.62.1.0 [.] rgb_rgb_convert
9.63% 1044 djpeg libjpeg.so.62.1.0 [.] decompress_onepass
4.63% 496 djpeg libjpeg.so.62.1.0 [.] jsimd_idct_islow_sse2.columnDCT
3.93% 426 djpeg libjpeg.so.62.1.0 [.] jsimd_idct_islow_sse2
2.02% 217 djpeg libc-2.20.so [.] __memset_sse2
Confirmed that this is the case for 64-bit code, but the RGB-to-RGB conversion routine is slower with 32-bit code for some reason.
Thus, I adapted the basic algorithm that is used by the RGB-to-RGB conversion routine and created a more simplified version of it for NULL conversion. This is now used as a "fast path" whenever the number of components is 3 or 4. This proves to be significantly faster both with 64-bit and 32-bit code. The patch has been checked into trunk and branches/1.4.x. The overall speedup is about 5-20% for 64-bit compression, 10-30% for 64-bit decompression, 0-3% for 32-bit compression, and 3-12% for 32-bit decompression-- measured by patching turbojpeg.c to generate RGB JPEGs instead of YCbCr JPEGs and using tjbench with these images:
http://www.libjpeg-turbo.org/About/Performance
Also confirmed that compressing/decompressing CMYK images is sped up by the same amount (assuming that the JPEG image uses the CMYK colorspace, not the YCCK colorspace.)