Thread: [Lcms-user] LittleCMS Performance and Non-Intel Processors
An ICC-based CMM for color management
Brought to you by:
mm2
|
From: Noel C. <NCa...@Pr...> - 2017-07-30 05:35:41
|
Hi folks of the Little CMS mailing list, I'm just curious: Given that PCs and Macs are based on Intel chipsets nowadays... Do we have a feel for how much Little CMS is being used on other processor architectures? I ask because I'm considering submitting some optimizations I've made to the interpolation routines that speed things up for Intel-based systems. They're not Intel-specific, but are just optimization of the source code that pick up 5% to 20% in speed in the x64 testbed tests. Git code as of yesterday, as measured on my dual-Xeon Westmere workstation: P E R F O R M A N C E T E S T S ================================= 16 bits on CLUT profiles : 34.4828 MPixel/sec. 8 bits on CLUT profiles : 32.3232 MPixel/sec. 8 bits on Matrix-Shaper profiles : 66.6667 MPixel/sec. 8 bits on SAME Matrix-Shaper profiles : 120.301 MPixel/sec. 8 bits on Matrix-Shaper profiles (AbsCol) : 66.6667 MPixel/sec. 16 bits on Matrix-Shaper profiles : 34.4828 MPixel/sec. 16 bits on SAME Matrix-Shaper profiles : 137.931 MPixel/sec. 16 bits on Matrix-Shaper profiles (AbsCol) : 34.4828 MPixel/sec. 8 bits on curves : 88.8889 MPixel/sec. 16 bits on curves : 91.4286 MPixel/sec. 8 bits on CMYK profiles : 11.9314 MPixel/sec. 16 bits on CMYK profiles : 11.976 MPixel/sec. 8 bits on gray-to gray : 104.575 MPixel/sec. 8 bits on gray-to-lab gray : 105.263 MPixel/sec. 8 bits on SAME gray-to-gray : 105.263 MPixel/sec. My current code: P E R F O R M A N C E T E S T S ================================= 16 bits on CLUT profiles : 38.5542 MPixel/sec. 8 bits on CLUT profiles : 33.0579 MPixel/sec. 8 bits on Matrix-Shaper profiles : 66.1157 MPixel/sec. 8 bits on SAME Matrix-Shaper profiles : 121.212 MPixel/sec. 8 bits on Matrix-Shaper profiles (AbsCol) : 66.9456 MPixel/sec. 16 bits on Matrix-Shaper profiles : 38.5542 MPixel/sec. 16 bits on SAME Matrix-Shaper profiles : 142.857 MPixel/sec. 16 bits on Matrix-Shaper profiles (AbsCol) : 38.5542 MPixel/sec. 8 bits on curves : 89.3855 MPixel/sec. 16 bits on curves : 94.1176 MPixel/sec. 8 bits on CMYK profiles : 14.4796 MPixel/sec. 16 bits on CMYK profiles : 14.5587 MPixel/sec. 8 bits on gray-to gray : 125 MPixel/sec. 8 bits on gray-to-lab gray : 124.031 MPixel/sec. 8 bits on SAME gray-to-gray : 124.031 MPixel/sec. These translate to real product gains... For example, with a 100 megapixel 32 bit grayscale image our heavily multi-threaded transform time dropped from 1485 milliseconds to 968 milliseconds. Source rearrangement notwithstanding, if one were to create routines that would make use of the vector instructions virtually every Intel system already has (e.g., SSE2) the results could be markedly better still. I've been through converting all my own software to use vectors and the results were well worth the effort. We now run faster with 32 bit floating point than we used to with integer formats. There is also the further possibility of extending the Little CMS algorithms into the GPU for huge gains. I suppose the trouble with that would be figuring out what subsystem to use (OpenCL programs... OpenGL shaders... Vulkan? Others?) -Noel |
|
From: Lorenzo R. <lo...@ma...> - 2017-07-30 14:43:59
|
Noel, Did you try Intel C++ Compiler? (I’ts free for open source projects on Linux). In some programs I got 2:1 performance improvements. Best Regards, Lorenzo > On 30 Jul 2017, at 01:20, Noel Carboni <NCa...@Pr...> wrote: > > Hi folks of the Little CMS mailing list, > > I'm just curious: Given that PCs and Macs are based on Intel chipsets nowadays... > > Do we have a feel for how much Little CMS is being used on other processor architectures? > > I ask because I'm considering submitting some optimizations I've made to the interpolation routines that speed things up for Intel-based systems. They're not Intel-specific, but are just optimization of the source code that pick up 5% to 20% in speed in the x64 testbed tests. > > Git code as of yesterday, as measured on my dual-Xeon Westmere workstation: > > P E R F O R M A N C E T E S T S > ================================= > > 16 bits on CLUT profiles : 34.4828 MPixel/sec. > 8 bits on CLUT profiles : 32.3232 MPixel/sec. > 8 bits on Matrix-Shaper profiles : 66.6667 MPixel/sec. > 8 bits on SAME Matrix-Shaper profiles : 120.301 MPixel/sec. > 8 bits on Matrix-Shaper profiles (AbsCol) : 66.6667 MPixel/sec. > 16 bits on Matrix-Shaper profiles : 34.4828 MPixel/sec. > 16 bits on SAME Matrix-Shaper profiles : 137.931 MPixel/sec. > 16 bits on Matrix-Shaper profiles (AbsCol) : 34.4828 MPixel/sec. > 8 bits on curves : 88.8889 MPixel/sec. > 16 bits on curves : 91.4286 MPixel/sec. > 8 bits on CMYK profiles : 11.9314 MPixel/sec. > 16 bits on CMYK profiles : 11.976 MPixel/sec. > 8 bits on gray-to gray : 104.575 MPixel/sec. > 8 bits on gray-to-lab gray : 105.263 MPixel/sec. > 8 bits on SAME gray-to-gray : 105.263 MPixel/sec. > > > My current code: > > P E R F O R M A N C E T E S T S > ================================= > > 16 bits on CLUT profiles : 38.5542 MPixel/sec. > 8 bits on CLUT profiles : 33.0579 MPixel/sec. > 8 bits on Matrix-Shaper profiles : 66.1157 MPixel/sec. > 8 bits on SAME Matrix-Shaper profiles : 121.212 MPixel/sec. > 8 bits on Matrix-Shaper profiles (AbsCol) : 66.9456 MPixel/sec. > 16 bits on Matrix-Shaper profiles : 38.5542 MPixel/sec. > 16 bits on SAME Matrix-Shaper profiles : 142.857 MPixel/sec. > 16 bits on Matrix-Shaper profiles (AbsCol) : 38.5542 MPixel/sec. > 8 bits on curves : 89.3855 MPixel/sec. > 16 bits on curves : 94.1176 MPixel/sec. > 8 bits on CMYK profiles : 14.4796 MPixel/sec. > 16 bits on CMYK profiles : 14.5587 MPixel/sec. > 8 bits on gray-to gray : 125 MPixel/sec. > 8 bits on gray-to-lab gray : 124.031 MPixel/sec. > 8 bits on SAME gray-to-gray : 124.031 MPixel/sec. > > These translate to real product gains... For example, with a 100 megapixel 32 bit grayscale image our heavily multi-threaded transform time dropped from 1485 milliseconds to 968 milliseconds. > > Source rearrangement notwithstanding, if one were to create routines that would make use of the vector instructions virtually every Intel system already has (e.g., SSE2) the results could be markedly better still. I've been through converting all my own software to use vectors and the results were well worth the effort. We now run faster with 32 bit floating point than we used to with integer formats. > > There is also the further possibility of extending the Little CMS algorithms into the GPU for huge gains. I suppose the trouble with that would be figuring out what subsystem to use (OpenCL programs... OpenGL shaders... Vulkan? Others?) > > -Noel > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org <http://slashdot.org/>! http://sdm.link/slashdot_______________________________________________ <http://sdm.link/slashdot_______________________________________________> > Lcms-user mailing list > Lcm...@li... <mailto:Lcm...@li...> > https://lists.sourceforge.net/lists/listinfo/lcms-user <https://lists.sourceforge.net/lists/listinfo/lcms-user> |
|
From: Noel C. <NCa...@Pr...> - 2017-07-30 15:06:02
|
Hi Lorenzo, > Did you try Intel C++ Compiler? (It’s free for open source projects on Linux). > In some programs I got 2:1 performance improvements. No, I haven't run that one. At the moment we have really only one practical choice here: The Microsoft Visual Studio 2017 C++ compiler for Windows, though we may be looking into alternatives in the future. I have heard good things about Intel's compiler elsewhere as well. Thanks for the data point. For what it's worth, looking over Microsoft compiler-generated code, with some of the complex routines like multi-input/output interpolation the compiler is starved for registers and has to resort to storing intermediate compute products in RAM. Just small things like changing the way loops are managed to free up a register here and there make a noticeable difference in throughput, especially with e.g., the 32 bit floating point routines, where the channel data is 4 bytes each and the process is already quite RAM-bound when multi-threaded. It's possible using SSE instructions to facilitate things like doing 4 calculations simultaneously could speed things up further. I'm looking into that now. -Noel |
|
From: Boudewijn R. <bo...@va...> - 2017-07-30 16:05:36
|
On Sun, 30 Jul 2017, Noel Carboni wrote: > Source rearrangement notwithstanding, if one were to create routines > that would make use of the vector instructions virtually every Intel > system already has (e.g., SSE2) the results could be markedly better > still. I've been through converting all my own software to use vectors > and the results were well worth the effort. We now run faster with 32 > bit floating point than we used to with integer formats. We use the vc library for vectorization, and it's pretty amazing. Of course, it's C++, not C, but still... https://github.com/VcDevel -- Boudewijn Rempt | http://www.krita.org, http://www.valdyas.org |
|
From: Greg T. <gd...@le...> - 2017-07-31 01:12:25
Attachments:
signature.asc
|
"Noel Carboni" <NCa...@Pr...> writes: > I'm just curious: Given that PCs and Macs are based on Intel chipsets > nowadays... > > Do we have a feel for how much Little CMS is being used on other > processor architectures? I don't know, but I would guess that at least ARM is an important target. Also, if you're looking for code that generates warnings, I would also recommend building with clang, in addition to gcc and whatever MS compiler you have. It seems like each new compiler results in more warnings. |
|
From: Lorenzo R. <lo...@ma...> - 2017-07-31 01:19:41
|
Hi Noel, There’s a lot of Vector Libraries that wrap the usage of SSE instructions. I used one of that libraries in a project long time a go and I could not find it. But the performance improvement was great. Here’s an example I found: http://fastcpp.blogspot.com.br/2011/12/simple-vector3-class-with-sse-support.html <http://fastcpp.blogspot.com.br/2011/12/simple-vector3-class-with-sse-support.html> Best Regards, Lorenzo > On 30 Jul 2017, at 12:05, Noel Carboni <NCa...@Pr...> wrote: > > Hi Lorenzo, > > > Did you try Intel C++ Compiler? (It’s free for open source projects on Linux). > > In some programs I got 2:1 performance improvements. > > No, I haven't run that one. At the moment we have really only one practical choice here: The Microsoft Visual Studio 2017 C++ compiler for Windows, though we may be looking into alternatives in the future. I have heard good things about Intel's compiler elsewhere as well. Thanks for the data point. > > For what it's worth, looking over Microsoft compiler-generated code, with some of the complex routines like multi-input/output interpolation the compiler is starved for registers and has to resort to storing intermediate compute products in RAM. Just small things like changing the way loops are managed to free up a register here and there make a noticeable difference in throughput, especially with e.g., the 32 bit floating point routines, where the channel data is 4 bytes each and the process is already quite RAM-bound when multi-threaded. > > It's possible using SSE instructions to facilitate things like doing 4 calculations simultaneously could speed things up further. I'm looking into that now. > > -Noel |
|
From: Bob F. <bfr...@si...> - 2017-07-31 19:45:13
|
On Sun, 30 Jul 2017, Lorenzo Ridolfi wrote: > There’s a lot of Vector Libraries that wrap the usage of SSE > instructions. I used one of that libraries in a project long time a > go and I could not find it. But the performance improvement was > great. As a lcms user, I would definitely prefer if lcms has no external dependencies. The ideal situation is if the C code is written in such a way that modern optimizing compilers do the right thing by default and produce good code for any CPU. This should mean that the compilers automatically produce SSE code where they should if it is enabled. Bob -- Bob Friesenhahn bfr...@si..., http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ |
|
From: Lorenzo R. <lo...@ma...> - 2017-07-31 19:56:02
|
Hi Bob, I totally agree with you. Any architecture dependent code should be always optional. The vector library I used came in two versions, a generic one and a SSE version. The usage of the SSE version, IMHO, must be explicitly indicated by a flag during the build. Best Regards, Lorenzo > On 31 Jul 2017, at 16:45, Bob Friesenhahn <bfr...@si...> wrote: > > On Sun, 30 Jul 2017, Lorenzo Ridolfi wrote: > >> There’s a lot of Vector Libraries that wrap the usage of SSE instructions. I used one of that libraries in a project long time a go and I could not find it. But the performance improvement was great. > > As a lcms user, I would definitely prefer if lcms has no external dependencies. > > The ideal situation is if the C code is written in such a way that modern optimizing compilers do the right thing by default and produce good code for any CPU. This should mean that the compilers automatically produce SSE code where they should if it is enabled. > > Bob > -- > Bob Friesenhahn > bfr...@si..., http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ |
|
From: Marti M. <mar...@li...> - 2017-07-31 20:02:13
|
<div dir='auto'><div>Many time ago, the code had some inline assembly code. I removed every trace of assembly about 15 years ago and take the requirement of pure C99 code forever. Was a good idea. Worked great and survived aging. I think optimizations have to be done by arranging C code to help compiler, assembly is all but helpful in those cases.<div dir="auto"><br></div><div dir="auto">Regards</div><div dir="auto">Marti</div><br><div class="gmail_extra"><br><div class="gmail_quote">On Jul 31, 2017 21:45, Bob Friesenhahn <bfr...@si...> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">On Sun, 30 Jul 2017, Lorenzo Ridolfi wrote: <br> <br> > There’s a lot of Vector Libraries that wrap the usage of SSE <br> > instructions. I used one of that libraries in a project long time a <br> > go and I could not find it. But the performance improvement was <br> > great. <br> <br> As a lcms user, I would definitely prefer if lcms has no external <br> dependencies. <br> <br> The ideal situation is if the C code is written in such a way that <br> modern optimizing compilers do the right thing by default and produce <br> good code for any CPU. This should mean that the compilers <br> automatically produce SSE code where they should if it is enabled. <br> <br> Bob <br> -- <br> Bob Friesenhahn <br> bfr...@si..., http://www.simplesystems.org/users/bfriesen/ <br> GraphicsMagick Maintainer, http://www.GraphicsMagick.org/<br> ------------------------------------------------------------------------------ <br> Check out the vibrant tech community on one of the world's most <br> engaging tech sites, Slashdot.org! http://sdm.link/slashdot<br> _______________________________________________ <br> Lcms-user mailing list <br> Lcm...@li... <br> https://lists.sourceforge.net/lists/listinfo/lcms-user <br> </p> </blockquote></div><br></div></div></div> |
|
From: Noel C. <NCa...@Pr...> - 2017-07-31 21:13:18
|
> The ideal situation is if the C code is written in such a way that > modern optimizing compilers do the right > thing by default and produce good code for any CPU. This should mean > that the compilers automatically > produce SSE code where they should if it is enabled. Yes, a good thought. Unfortunately the compilers are not NEARLY there yet w/regard to using SSE instructions in the best way possible. And I'm not sure they're really going to get there... It would be difficult for C/C++, where the natural value is a single int, without very specific design considerations in the source code to take good advantage of the things SSE has to offer, which is primarily to carry multiple data items per register and to parallelize calculations on that data. The challenge is to design an application to take advantage of being able to do multiple calculations at once, and to carry related chunks of data in a big (e.g., __m128) register. Pixel manipulations CAN map well into this sort of thing... My Photoshop plug-ins, for example, now use SSE2 throughout and everything's stored chunky and in floating point (e.g., we put one RGBA pixel in an __m128). The floating point gives advantages in terms of overflow/underflow/loss-of-precision protection, and the parallel processing offsets the disadvantages of the increased memory bandwidth utilization for the long values. But of course the source code is less maintainable and less portable because of the SSE usage - it's less like C and more like embedded assembly (we use the Intel intrinsics such as _mm_mul_ps). We went into this with our eyes open and I'm glad we made the decisions we did. The interesting thing is that while we were refactoring our plug-ins, the use of SSE really didn't pay off in performance until we had the code embracing the concepts throughout. Everything got faster all at once near the end of the project. I can say from that experience that the one thing you absolutely DON'T want is to have is HALF an SSE implementation... Getting things into and out of XMM registers (i.e., during conversions) is inefficient. The application has to "think overall" in parallel and use a format (e.g., floating point) throughout that matches the SSE capabilities to work well. By the way, as an exercise to reinforce the above, I re-coded the LittleCMS floating point trilinear interpolation algorithm using SSE2 intrinsics. It ended up delivering the same performance as the C-coded version. Why not better? Because the table-based design of the Little CMS library doesn't suit parallel calculations so there were only limited things I could do. Let me be clear, I'm not suggesting redesigning Little CMS soup to nuts. Just throwing out a few thoughts and ideas. :) Regarding Marti's comment: > I think optimizations have to be done by arranging C code to help > compiler I agree and I'm finding by doing so and testing the results that there is still some additional performance to be had from rearranging the code to do things like put less broad requirements on the compiler to keep intermediate data for a long sequence of instructions (e.g., to reduce register starvation). By the way, the performance appeared to have dropped a fair bit between release 2.8 and what I downloaded from Git just the other day. I think I've got it all back and a little more at this point. -Noel |
|
From: Graeme G. <gr...@ar...> - 2017-07-31 23:13:24
|
Noel Carboni wrote: > By the way, as an exercise to reinforce the above, I re-coded the > LittleCMS floating point trilinear interpolation algorithm using SSE2 > intrinsics. It ended up delivering the same performance as the C-coded > version. Why not better? Because the table-based design of the Little > CMS library doesn't suit parallel calculations so there were only limited > things I could do. Simplex interpolation is generally faster since it touches fewer node points - something that increases in importance with higher input dimensions - but simplex isn't terribly parallelizable, since it involves a sort. Once the weighting of each nodes is known using simplex or multi-linear, paralleling the output dimensions calculations is a good speedup though. [ How much of a win vector CPU instructions would be is not something I've ever had time to explore in my color engine, and I've been content to stick to portable C code, while wringing what I can out of it. Exploiting GPU texture lookup hardware seems far simpler to code for, for maximum overall speed. ] Cheers, Graeme Gill. |