Thread: [Lcms-user] Support of PreMultiplied RGBA image format...
An ICC-based CMM for color management
Brought to you by:
mm2
From: Sebastien L. <sl...@po...> - 2012-12-18 16:05:48
|
Hi, Back again :-) This WE, I implemented some basic premultiplied alpha support. I agree with Bob and John that Premultiplied alpha doesn't "make a lot of sense with a color management system since color perception varies with intensity and the colorspace may not be linear across the color channels". But such color management makes sense for images with binary transparencies (full opacity & full transparency) where only few pixels could be of intermediary transparency (usually the anti-aliased edges). LittleCMS 2.4 was not able to manage it correctly and the only way to manage such images was to: - turn them into unpremultiplied alpha - run cmsDoTransform (unoptimized way because even fully transparent pixels are processed) - turn back to premultiplied alpha It was way too slow, so I coded some limited support for premultiplied alpha. Works if : - In must be TYPE_BGRA_8 - Out must be TYPE_BGRA_8 too. - (there must be some internal matrix optimization, because my code (hack) is using MatShaper8Data) A new flag has been added : PREMUL_SH and it works only with TYPE_BGRA_8. But the code, right now, looks like some quick and dirty hack :-( First, because I'm not used to the internals of this library... And secondly, because, it doesn't respect the current pipeline. Because premultiplied alpha means extra computation and I need vey fast process, I had to merge the InputFormatters, xform and OutputFormatters into one routine. In littleCMS 2.4, ~100 CPU cycle was necessary to processed one pixel BGRA. The optimized code needs ~28 CPU cycles (unlinked alpha). And premultiplied alpha optimized code requires ~65 CPU cycles for the worst case (no alpha of 0 or 1.0). I have done some benchmarking too, using an ECI tagged image + 2 alpha masks + a random screen profile. The goal was to process a ECI image and turn it into the screen color space. Right now, the code is fast enough for my needs (for now). Of course, if someone can help me to integrate such code (the clean way) I would be happy to contribute seriously. In the futur, some real optimization could be done (no SSE or OpenCL vectorisation code) Modified code (derivated from LittleCMS 2.4) + test program can be found here : http://sebastienleon.com/info/littleCMS/littleCMS_PreMulAlphaHack.zip (I give all copyrights to Marti) Qt 4.x is required to build the test program. (Works on Mac/Linux/Windows, do : "qmake && make" and "./test") Best regards Sebastien Léon ----------------------------------------- LittleCMS Test/Hacks & simple benchmarking... Init OK... ******* Start TEST : LittleCMS 2.4 Legacy ******* (test 0 lasts 683725 KCycles). (test 1 lasts 686102 KCycles). (test 2 lasts 686626 KCycles). (test 3 lasts 683895 KCycles). Average Test lasts 685087 KCycles. Average CPU Cycle per pixel = 99.85. ------------------------------------------- ******* Start TEST : LittleCMS 2.4 + Unroll3BytesSkip1SwapExtFirst ******* (test 0 lasts 406031 KCycles). (test 1 lasts 405165 KCycles). (test 2 lasts 406182 KCycles). (test 3 lasts 404195 KCycles). Average Test lasts 405393 KCycles. Average CPU Cycle per pixel = 59.09. ------------------------------------------- ******* Start TEST : RGBAEngineWithAlphaIgnored ******* (test 0 lasts 189697 KCycles). (test 1 lasts 188906 KCycles). (test 2 lasts 191982 KCycles). (test 3 lasts 190995 KCycles). Average Test lasts 190395 KCycles. Average CPU Cycle per pixel = 27.75. ------------------------------------------- ******* Start TEST : PreMulEngineWithNoAlpha ******* (test 0 lasts 209824 KCycles). (test 1 lasts 208821 KCycles). (test 2 lasts 208711 KCycles). (test 3 lasts 207210 KCycles). Average Test lasts 208641 KCycles. Average CPU Cycle per pixel = 30.41. ------------------------------------------- ******* Start TEST : PreMulEngineWithPreMulAlpha_WorstCase ******* (test 0 lasts 444388 KCycles). (test 1 lasts 447862 KCycles). (test 2 lasts 443876 KCycles). (test 3 lasts 439628 KCycles). Average Test lasts 443939 KCycles. Average CPU Cycle per pixel = 64.70. ------------------------------------------- ******* Start TEST : PreMulEngineWithPreMulAlpha_SpriteCase ******* (test 0 lasts 132157 KCycles). (test 1 lasts 130319 KCycles). (test 2 lasts 131186 KCycles). (test 3 lasts 150089 KCycles). Average Test lasts 135938 KCycles. Average CPU Cycle per pixel = 19.81. ------------------------------------------- Work's done... |
From: Graeme G. <gr...@ar...> - 2012-12-18 22:43:03
|
Sebastien Leon wrote: > Back again :-) This WE, I implemented some basic premultiplied alpha > support. Something that occurred to me rather a long time ago while reading one of Jim Blinn's articles in CG&A, is that (at least in principle), the multi-dimensional nature of color lookups could be used to handle various matting operations. For instance, in theory the RGBA case could be handled as a normal color lookup using a 4D color table, which would take care of the alpha de-multipliying & re-multiplying using the standard mechanisms. In practice it may not be a good use of an extra dimension, since the tables blow up in size or the resolution and speed drops rapidly as you increase the number of input dimensions, and vector instructions are rather fast in modern CPU's. Graeme Gill. |
From: Sebastien L. <sl...@po...> - 2012-12-19 15:09:35
|
Hi Graeme, > in theory the RGBA case could be handled as a normal color lookup using a 4D color tabl Yes, you are right. That could be possible (even if this is beyond my skills as it would mean digging deeper into LittleCMS optimization engine) but as you pointed it out, in practice, it may not be a good idea. I'm thinking about the accuracy loss because of alpha normalization (only in 8 bits/channel). For a example, for an opacity of 20%, alpha would be 51 and all channels would be normalized in [0..51] range. In such discretized space, a conversion from sRGB to some screen color space could even by identity (or will be far from accurate)... Having a huge 4D Lut and not using a big part of it would be frustrating ;-) > and vector instructions are rather fast in modern CPU's. My tests shows that this extra work is as costly than the matrix calculation... So this is still some time but if I double this code with SSE version, I'm pretty sure that it won't be a problem anymore. Boudewijn Rempt gave me recently a link to a vectorization library (http://code.compeng.uni-frankfurt.de/projects/vc) and I doubt I could resist too long before I try vectorizing the premultiplied code ;-) Best regards Sebastien |