Thread: [Lcms-user] LittleCMS Performance and Non-Intel Processors

An ICC-based CMM for color management

Brought to you by: mm2

lcms-user

[Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Noel C. <NCa...@Pr...> - 2017-07-30 05:35:41

Hi folks of the Little CMS mailing list,

 

I'm just curious:  Given that PCs and Macs are based on Intel chipsets
nowadays...

 

Do we have a feel for how much Little CMS is being used on other
processor architectures?

 

I ask because I'm considering submitting some optimizations I've made to
the interpolation routines that speed things up for Intel-based systems.
They're not Intel-specific, but are just optimization of the source code
that pick up 5% to 20% in speed in the x64 testbed tests.

 

Git code as of yesterday, as measured on my dual-Xeon Westmere
workstation:

 

P E R F O R M A N C E   T E S T S

=================================

 

16 bits on CLUT profiles                     : 34.4828 MPixel/sec.

8 bits on CLUT profiles                      : 32.3232 MPixel/sec.

8 bits on Matrix-Shaper profiles             : 66.6667 MPixel/sec.

8 bits on SAME Matrix-Shaper profiles        : 120.301 MPixel/sec.

8 bits on Matrix-Shaper profiles (AbsCol)    : 66.6667 MPixel/sec.

16 bits on Matrix-Shaper profiles            : 34.4828 MPixel/sec.

16 bits on SAME Matrix-Shaper profiles       : 137.931 MPixel/sec.

16 bits on Matrix-Shaper profiles (AbsCol)   : 34.4828 MPixel/sec.

8 bits on curves                             : 88.8889 MPixel/sec.

16 bits on curves                            : 91.4286 MPixel/sec.

8 bits on CMYK profiles                      : 11.9314 MPixel/sec.

16 bits on CMYK profiles                     : 11.976 MPixel/sec.

8 bits on gray-to gray                       : 104.575 MPixel/sec.

8 bits on gray-to-lab gray                   : 105.263 MPixel/sec.

8 bits on SAME gray-to-gray                  : 105.263 MPixel/sec.

 

 

My current code:

 

P E R F O R M A N C E   T E S T S

=================================

 

16 bits on CLUT profiles                     : 38.5542 MPixel/sec.

8 bits on CLUT profiles                      : 33.0579 MPixel/sec.

8 bits on Matrix-Shaper profiles             : 66.1157 MPixel/sec.

8 bits on SAME Matrix-Shaper profiles        : 121.212 MPixel/sec.

8 bits on Matrix-Shaper profiles (AbsCol)    : 66.9456 MPixel/sec.

16 bits on Matrix-Shaper profiles            : 38.5542 MPixel/sec.

16 bits on SAME Matrix-Shaper profiles       : 142.857 MPixel/sec.

16 bits on Matrix-Shaper profiles (AbsCol)   : 38.5542 MPixel/sec.

8 bits on curves                             : 89.3855 MPixel/sec.

16 bits on curves                            : 94.1176 MPixel/sec.

8 bits on CMYK profiles                      : 14.4796 MPixel/sec.

16 bits on CMYK profiles                     : 14.5587 MPixel/sec.

8 bits on gray-to gray                       : 125 MPixel/sec.

8 bits on gray-to-lab gray                   : 124.031 MPixel/sec.

8 bits on SAME gray-to-gray                  : 124.031 MPixel/sec.

 

These translate to real product gains...  For example, with a 100
megapixel 32 bit grayscale image our heavily multi-threaded transform
time dropped from 1485 milliseconds to 968 milliseconds.

 

Source rearrangement notwithstanding, if one were to create routines
that would make use of the vector instructions virtually every Intel
system already has (e.g., SSE2) the results could be markedly better
still.  I've been through converting all my own software to use vectors
and the results were well worth the effort.  We now run faster with 32
bit floating point than we used to with integer formats.

 

There is also the further possibility of extending the Little CMS
algorithms into the GPU for huge gains.  I suppose the trouble with that
would be figuring out what subsystem to use (OpenCL programs...  OpenGL
shaders...  Vulkan?  Others?)

 

-Noel

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Lorenzo R. <lo...@ma...> - 2017-07-30 14:43:59

Noel,
Did you try Intel C++ Compiler? (I’ts free for open source projects on Linux). In some programs I got 2:1 performance improvements. 

Best Regards,
Lorenzo

> On 30 Jul 2017, at 01:20, Noel Carboni <NCa...@Pr...> wrote:
> 
> Hi folks of the Little CMS mailing list,
>  
> I'm just curious:  Given that PCs and Macs are based on Intel chipsets nowadays...
>  
> Do we have a feel for how much Little CMS is being used on other processor architectures?
>  
> I ask because I'm considering submitting some optimizations I've made to the interpolation routines that speed things up for Intel-based systems.  They're not Intel-specific, but are just optimization of the source code that pick up 5% to 20% in speed in the x64 testbed tests.
>  
> Git code as of yesterday, as measured on my dual-Xeon Westmere workstation:
>  
> P E R F O R M A N C E   T E S T S
> =================================
>  
> 16 bits on CLUT profiles                     : 34.4828 MPixel/sec.
> 8 bits on CLUT profiles                      : 32.3232 MPixel/sec.
> 8 bits on Matrix-Shaper profiles             : 66.6667 MPixel/sec.
> 8 bits on SAME Matrix-Shaper profiles        : 120.301 MPixel/sec.
> 8 bits on Matrix-Shaper profiles (AbsCol)    : 66.6667 MPixel/sec.
> 16 bits on Matrix-Shaper profiles            : 34.4828 MPixel/sec.
> 16 bits on SAME Matrix-Shaper profiles       : 137.931 MPixel/sec.
> 16 bits on Matrix-Shaper profiles (AbsCol)   : 34.4828 MPixel/sec.
> 8 bits on curves                             : 88.8889 MPixel/sec.
> 16 bits on curves                            : 91.4286 MPixel/sec.
> 8 bits on CMYK profiles                      : 11.9314 MPixel/sec.
> 16 bits on CMYK profiles                     : 11.976 MPixel/sec.
> 8 bits on gray-to gray                       : 104.575 MPixel/sec.
> 8 bits on gray-to-lab gray                   : 105.263 MPixel/sec.
> 8 bits on SAME gray-to-gray                  : 105.263 MPixel/sec.
>  
>  
> My current code:
>  
> P E R F O R M A N C E   T E S T S
> =================================
>  
> 16 bits on CLUT profiles                     : 38.5542 MPixel/sec.
> 8 bits on CLUT profiles                      : 33.0579 MPixel/sec.
> 8 bits on Matrix-Shaper profiles             : 66.1157 MPixel/sec.
> 8 bits on SAME Matrix-Shaper profiles        : 121.212 MPixel/sec.
> 8 bits on Matrix-Shaper profiles (AbsCol)    : 66.9456 MPixel/sec.
> 16 bits on Matrix-Shaper profiles            : 38.5542 MPixel/sec.
> 16 bits on SAME Matrix-Shaper profiles       : 142.857 MPixel/sec.
> 16 bits on Matrix-Shaper profiles (AbsCol)   : 38.5542 MPixel/sec.
> 8 bits on curves                             : 89.3855 MPixel/sec.
> 16 bits on curves                            : 94.1176 MPixel/sec.
> 8 bits on CMYK profiles                      : 14.4796 MPixel/sec.
> 16 bits on CMYK profiles                     : 14.5587 MPixel/sec.
> 8 bits on gray-to gray                       : 125 MPixel/sec.
> 8 bits on gray-to-lab gray                   : 124.031 MPixel/sec.
> 8 bits on SAME gray-to-gray                  : 124.031 MPixel/sec.
>  
> These translate to real product gains...  For example, with a 100 megapixel 32 bit grayscale image our heavily multi-threaded transform time dropped from 1485 milliseconds to 968 milliseconds.
>  
> Source rearrangement notwithstanding, if one were to create routines that would make use of the vector instructions virtually every Intel system already has (e.g., SSE2) the results could be markedly better still.  I've been through converting all my own software to use vectors and the results were well worth the effort.  We now run faster with 32 bit floating point than we used to with integer formats.
>  
> There is also the further possibility of extending the Little CMS algorithms into the GPU for huge gains.  I suppose the trouble with that would be figuring out what subsystem to use (OpenCL programs...  OpenGL shaders...  Vulkan?  Others?)
>  
> -Noel
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org <http://slashdot.org/>! http://sdm.link/slashdot_______________________________________________ <http://sdm.link/slashdot_______________________________________________>
> Lcms-user mailing list
> Lcm...@li... <mailto:Lcm...@li...>
> https://lists.sourceforge.net/lists/listinfo/lcms-user <https://lists.sourceforge.net/lists/listinfo/lcms-user>

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Noel C. <NCa...@Pr...> - 2017-07-30 15:06:02

Hi Lorenzo,

 

> Did you try Intel C++ Compiler? (It’s free for open source projects on Linux). 

> In some programs I got 2:1 performance improvements. 

 

No, I haven't run that one.  At the moment we have really only one practical choice here:  The Microsoft Visual Studio 2017 C++ compiler for Windows, though we may be looking into alternatives in the future.  I have heard good things about Intel's compiler elsewhere as well.  Thanks for the data point.

 

For what it's worth, looking over Microsoft compiler-generated code, with some of the complex routines like multi-input/output interpolation the compiler is starved for registers and has to resort to storing intermediate compute products in RAM.  Just small things like changing the way loops are managed to free up a register here and there make a noticeable difference in throughput, especially with e.g., the 32 bit floating point routines, where the channel data is 4 bytes each and the process is already quite RAM-bound when multi-threaded.

 

It's possible using SSE instructions to facilitate things like doing 4 calculations simultaneously could speed things up further.  I'm looking into that now.

 

-Noel

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Boudewijn R. <bo...@va...> - 2017-07-30 16:05:36

On Sun, 30 Jul 2017, Noel Carboni wrote:

> Source rearrangement notwithstanding, if one were to create routines
> that would make use of the vector instructions virtually every Intel
> system already has (e.g., SSE2) the results could be markedly better
> still.  I've been through converting all my own software to use vectors
> and the results were well worth the effort.  We now run faster with 32
> bit floating point than we used to with integer formats.

We use the vc library for vectorization, and it's pretty amazing. Of course,
it's C++, not C, but still...

https://github.com/VcDevel

-- 
Boudewijn Rempt | http://www.krita.org, http://www.valdyas.org

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Greg T. <gd...@le...> - 2017-07-31 01:12:25

Attachments: signature.asc

"Noel Carboni" <NCa...@Pr...> writes:

> I'm just curious:  Given that PCs and Macs are based on Intel chipsets
> nowadays...
>
> Do we have a feel for how much Little CMS is being used on other
> processor architectures?

I don't know, but I would guess that at least ARM is an important
target.

Also, if you're looking for code that generates warnings, I would also
recommend building with clang, in addition to gcc and whatever MS
compiler you have.  It seems like each new compiler results in more
warnings.

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Lorenzo R. <lo...@ma...> - 2017-07-31 01:19:41

Hi Noel,
There’s a lot of Vector Libraries that wrap the usage of SSE instructions. I used one of that libraries in a project long time a go and I could not find it. But the performance improvement was great. 

Here’s an example I found: 

http://fastcpp.blogspot.com.br/2011/12/simple-vector3-class-with-sse-support.html <http://fastcpp.blogspot.com.br/2011/12/simple-vector3-class-with-sse-support.html>

Best Regards,
Lorenzo


> On 30 Jul 2017, at 12:05, Noel Carboni <NCa...@Pr...> wrote:
> 
> Hi Lorenzo,
>  
> > Did you try Intel C++ Compiler? (It’s free for open source projects on Linux). 
> > In some programs I got 2:1 performance improvements. 
>  
> No, I haven't run that one.  At the moment we have really only one practical choice here:  The Microsoft Visual Studio 2017 C++ compiler for Windows, though we may be looking into alternatives in the future.  I have heard good things about Intel's compiler elsewhere as well.  Thanks for the data point.
>  
> For what it's worth, looking over Microsoft compiler-generated code, with some of the complex routines like multi-input/output interpolation the compiler is starved for registers and has to resort to storing intermediate compute products in RAM.  Just small things like changing the way loops are managed to free up a register here and there make a noticeable difference in throughput, especially with e.g., the 32 bit floating point routines, where the channel data is 4 bytes each and the process is already quite RAM-bound when multi-threaded.
>  
> It's possible using SSE instructions to facilitate things like doing 4 calculations simultaneously could speed things up further.  I'm looking into that now.
>  
> -Noel

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Bob F. <bfr...@si...> - 2017-07-31 19:45:13

On Sun, 30 Jul 2017, Lorenzo Ridolfi wrote:

> There’s a lot of Vector Libraries that wrap the usage of SSE 
> instructions. I used one of that libraries in a project long time a 
> go and I could not find it. But the performance improvement was 
> great.

As a lcms user, I would definitely prefer if lcms has no external 
dependencies.

The ideal situation is if the C code is written in such a way that 
modern optimizing compilers do the right thing by default and produce 
good code for any CPU.  This should mean that the compilers 
automatically produce SSE code where they should if it is enabled.

Bob
-- 
Bob Friesenhahn
bfr...@si..., http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Lorenzo R. <lo...@ma...> - 2017-07-31 19:56:02

Hi Bob,
I totally agree with you. Any architecture dependent code should be always optional. The vector library I used came in two versions, a generic one and a SSE version. The usage of the SSE version, IMHO, must be explicitly indicated by a flag during the build.

Best Regards,
Lorenzo

> On 31 Jul 2017, at 16:45, Bob Friesenhahn <bfr...@si...> wrote:
> 
> On Sun, 30 Jul 2017, Lorenzo Ridolfi wrote:
> 
>> There’s a lot of Vector Libraries that wrap the usage of SSE instructions. I used one of that libraries in a project long time a go and I could not find it. But the performance improvement was great.
> 
> As a lcms user, I would definitely prefer if lcms has no external dependencies.
> 
> The ideal situation is if the C code is written in such a way that modern optimizing compilers do the right thing by default and produce good code for any CPU.  This should mean that the compilers automatically produce SSE code where they should if it is enabled.
> 
> Bob
> -- 
> Bob Friesenhahn
> bfr...@si..., http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Marti M. <mar...@li...> - 2017-07-31 20:02:13

<div dir='auto'><div>Many time ago, the code had some inline &nbsp;assembly code. I removed every trace of assembly &nbsp;about 15 years ago and take the requirement of pure C99 code forever. Was a good idea. Worked great and survived aging. I &nbsp; think optimizations &nbsp;have to be done by arranging C code to help compiler, assembly is all but helpful in those cases.<div dir="auto"><br></div><div dir="auto">Regards</div><div dir="auto">Marti</div><br><div class="gmail_extra"><br><div class="gmail_quote">On Jul 31, 2017 21:45, Bob Friesenhahn &lt;bfr...@si...&gt; wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">On Sun, 30 Jul 2017, Lorenzo Ridolfi wrote:
<br>

<br>
&gt; There’s a lot of Vector Libraries that wrap the usage of SSE 
<br>
&gt; instructions. I used one of that libraries in a project long time a 
<br>
&gt; go and I could not find it. But the performance improvement was 
<br>
&gt; great.
<br>

<br>
As a lcms user, I would definitely prefer if lcms has no external 
<br>
dependencies.
<br>

<br>
The ideal situation is if the C code is written in such a way that 
<br>
modern optimizing compilers do the right thing by default and produce 
<br>
good code for any CPU.&nbsp; This should mean that the compilers 
<br>
automatically produce SSE code where they should if it is enabled.
<br>

<br>
Bob
<br>
-- 
<br>
Bob Friesenhahn
<br>
bfr...@si..., http://www.simplesystems.org/users/bfriesen/
<br>
GraphicsMagick Maintainer,&nbsp;&nbsp;&nbsp; http://www.GraphicsMagick.org/<br>
------------------------------------------------------------------------------
<br>
Check out the vibrant tech community on one of the world's most
<br>
engaging tech sites, Slashdot.org! http://sdm.link/slashdot<br>
_______________________________________________
<br>
Lcms-user mailing list
<br>
Lcm...@li...
<br>
https://lists.sourceforge.net/lists/listinfo/lcms-user
<br>
</p>
</blockquote></div><br></div></div></div>

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Noel C. <NCa...@Pr...> - 2017-07-31 21:13:18

> The ideal situation is if the C code is written in such a way that 
> modern optimizing compilers do the right
> thing by default and produce good code for any CPU.  This should mean 
> that the compilers automatically
> produce SSE code where they should if it is enabled.

Yes, a good thought.

Unfortunately the compilers are not NEARLY there yet w/regard to using SSE 
instructions in the best way possible.  And I'm not sure they're really 
going to get there...  It would be difficult for C/C++, where the natural 
value is a single int, without very specific design considerations in the 
source code to take good advantage of the things SSE has to offer, which 
is primarily to carry multiple data items per register and to parallelize 
calculations on that data.

The challenge is to design an application to take advantage of being able 
to do multiple calculations at once, and to carry related chunks of data 
in a big (e.g., __m128) register.

Pixel manipulations CAN map well into this sort of thing...  My Photoshop 
plug-ins, for example, now use SSE2 throughout and everything's stored 
chunky and in floating point (e.g., we put one RGBA pixel in an __m128). 
The floating point gives advantages in terms of 
overflow/underflow/loss-of-precision protection, and the parallel 
processing offsets the disadvantages of the increased memory bandwidth 
utilization for the long values.

But of course the source code is less maintainable and less portable 
because of the SSE usage - it's less like C and more like embedded 
assembly (we use the Intel intrinsics such as _mm_mul_ps).  We went into 
this with our eyes open and I'm glad we made the decisions we did.

The interesting thing is that while we were refactoring our plug-ins, the 
use of SSE really didn't pay off in performance until we had the code 
embracing the concepts throughout.  Everything got faster all at once near 
the end of the project.

I can say from that experience that the one thing you absolutely DON'T 
want is to have is HALF an SSE implementation...  Getting things into and 
out of XMM registers (i.e., during conversions) is inefficient.  The 
application has to "think overall" in parallel and use a format (e.g., 
floating point) throughout that matches the SSE capabilities to work well.

By the way, as an exercise to reinforce the above, I re-coded the 
LittleCMS floating point trilinear interpolation algorithm using SSE2 
intrinsics.  It ended up delivering the same performance as the C-coded 
version.  Why not better?  Because the table-based design of the Little 
CMS library doesn't suit parallel calculations so there were only limited 
things I could do.

Let me be clear, I'm not suggesting redesigning Little CMS soup to nuts. 
Just throwing out a few thoughts and ideas.  :)

Regarding Marti's comment:

> I think optimizations have to be done by arranging C code to help 
> compiler

I agree and I'm finding by doing so and testing the results that there is 
still some additional performance to be had from rearranging the code to 
do things like put less broad requirements on the compiler to keep 
intermediate data for a long sequence of instructions (e.g., to reduce 
register starvation).

By the way, the performance appeared to have dropped a fair bit between 
release 2.8 and what I downloaded from Git just the other day.  I think 
I've got it all back and a little more at this point.

-Noel

Re: [Lcms-user] LittleCMS Performance and Non-Intel Processors

From: Graeme G. <gr...@ar...> - 2017-07-31 23:13:24

Noel Carboni wrote:

> By the way, as an exercise to reinforce the above, I re-coded the 
> LittleCMS floating point trilinear interpolation algorithm using SSE2 
> intrinsics.  It ended up delivering the same performance as the C-coded 
> version.  Why not better?  Because the table-based design of the Little 
> CMS library doesn't suit parallel calculations so there were only limited 
> things I could do.

Simplex interpolation is generally faster since it touches
fewer node points - something that increases in importance
with higher input dimensions - but simplex isn't terribly parallelizable,
since it involves a sort. Once the weighting of each nodes
is known using simplex or multi-linear, paralleling the output
dimensions calculations is a good speedup though.

[ How much of a win vector CPU instructions would be is
  not something I've ever had time to explore in my color
  engine, and I've been content to stick to portable C code,
  while wringing what I can out of it.
  Exploiting GPU texture lookup hardware seems far simpler to
  code for, for maximum overall speed. ]

Cheers,

Graeme Gill.