From: Richard H. <hug...@gm...> - 2013-05-13 21:01:40
|
Hi all, I'm trying to make my transform go fast. I've got a 1920x1080 RGB image being transformed from sRGB to the display profile. I've got a quad core processor on my development box, no shaders or GPU, and I'm trying to do the transform as quickly as possible. I figured the fastest way to do this would be to set up a threadpool with max_threads = 4. Then I have a few choices: * pop a thread from the pool for every line of the image, creating local state with p_in, p_out, width and stride * pop a thread from the pool for every n lines of the image, creating local state with p_in, p_out, width, stride and rows_to_process (where n = height / max_threads) I figured 4 threads should be ~4x faster than using 1 thread (in the second case we should only have 4 threads, so not much overhead), but no matter the value of max_threads or 'n' I can only achieve a ~1.9x speed-up. I've tried with and without cmsFLAGS_NOCACHE. Any pointers very welcome. Thanks, Richard |
From: Bob F. <bfr...@si...> - 2013-05-13 21:18:09
|
On Mon, 13 May 2013, Richard Hughes wrote: > > I figured 4 threads should be ~4x faster than using 1 thread (in the > second case we should only have 4 threads, so not much overhead), but > no matter the value of max_threads or 'n' I can only achieve a ~1.9x > speed-up. I've tried with and without cmsFLAGS_NOCACHE. Any pointers > very welcome. What specific CPU are you using? It would be good to share the ICC profile you are using for testing since it can make a difference. If lcms is only doing indexed lookups for the profile, then memory accesses may be the bottleneck rather than CPU. Are you sharing the same transform (created by one thread), or are you creating an independent transform for each thread (ideally created by the thread which uses it)? Creating the transform can consume considerable time so it can be useful to parallelize (even though it "wastes" CPU) and it help work better given whatever NUMA characteristics pertain to your hardware. Cache-line effects can be significant if there is accidental cache-line sharing (two cores sharing data in the same cache line). Padding structures to prevent false-sharing or using an aligned memory allocator can help surmount such problems. Cache line issues can be very hardware/OS specific and mysterious. Bob -- Bob Friesenhahn bfr...@si..., http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ |
From: Gerhard F. <nos...@gm...> - 2013-05-14 20:56:59
|
Am 13.05.2013 23:17, schrieb Bob Friesenhahn: > Are you sharing the same transform (created by one thread), or are you creating an independent transform for each thread (ideally created by the thread which uses it)? Creating the transform can consume considerable time so it can be useful to parallelize (even though it "wastes" CPU) and it help work better given whatever NUMA characteristics pertain to your hardware. Separate ones would likely allocate memory in a more NUMA-friendly way. For optimal performance the whole transform should fit into the L2 cache though (and then the backing memory access time is no longer so important). For instance, a 33 grid point 16-bit RGB -> RGB device link LUT needs 33^3*2*3=215622 bytes, which basically fits into the 256k L2 cache of a core i7 (but leaving not too much L2 cache for other stuff). I.e. if the grid resolution is is kept at moderate levels, there is a chance that the transform can be kept in L2 cache. L3 cache access time is about twice of L2 cache access time, AFAIK, and memory access time is about twice (or more) of L3 cache access time. That's of course a trade-off against transform quality... > Cache-line effects can be significant if there is accidental cache-line sharing (two cores sharing data in the same cache line). > Padding structures to prevent false-sharing or using an aligned memory allocator can help surmount such problems. Cache line issues can be very hardware/OS specific and mysterious. Profiling with e.g. OProfile or Intel VTune Amplifier, making use of the various performance counters of modern CPUs, is IMO essential in order to locate such issues/bottlenecks and to optimize the code (granted that there is still room for improvement). Best Regards, Gerhard |
From: Robin W. <rob...@ar...> - 2013-05-14 00:00:48
|
On 13/05/2013 22:01, Richard Hughes wrote: > I'm trying to make my transform go fast. I've got a 1920x1080 RGB > image being transformed from sRGB to the display profile. I've got a > quad core processor on my development box, no shaders or GPU, and I'm > trying to do the transform as quickly as possible. Before you dive into the complexities of multithreading etc, it would seem sensible to ensure you are getting the best possible performance out of the transform routine in the first place. LCMS has various different transform routines built in; by using a cunning scheme, it can pick the appropriate one at runtime. If there happens to be one in its repertoire that exactly fits your needs, it can run considerably faster than if it has to use a generic one. To speed ghostscripts use of lcms2, I wanted to optimise the transforms that we use as much as possible. But we use quite a lot of them, and I didn't want to have to hand write optimised ones for all of the different cases. So I implemented a system that uses a chameleonic header; set a few options with #defines, and include the header, and it makes the optimised transform function for you. You can find this as part of the copy of lcms2 in the Ghostscript source, or on my git repository: https://github.com/robinwatts/Little-CMS/tree/artifex I offered this code back to Marti for inclusion in stock lcms2, so that people could easily add their own optimised transforms, but he wasn't keen on taking it as is. He did however point out that I could recast the code slightly as a plugin for lcms 2. This gives the same benefits without polluting the internals of the library itself. I plan to do this at some point, but I have not got round to it yet. In the meantime, if anyone has any use for the existing code, please feel free. Robin |
From: Richard H. <hug...@gm...> - 2013-05-14 08:12:44
|
On 13 May 2013 22:17, Bob Friesenhahn <bfr...@si...> wrote: > What specific CPU are you using? I'm profiling on Intel i7 M620 @ 2.67GHz > It would be good to share the ICC profile you are using for testing since it > can make a difference. If lcms is only doing indexed lookups for the > profile, then memory accesses may be the bottleneck rather than CPU. I'm using this 2009 test profile I generated with ArgyllCMS: https://github.com/hughsie/colord/blob/master/data/tests/ibm-t61.icc?raw=true > Are you sharing the same transform (created by one thread), or are you > creating an independent transform for each thread (ideally created by the > thread which uses it)? One transform shared between threads. I can try to create multiple transforms (and also in each thread) if you think that will help things. > Cache-line effects can be significant if there is accidental cache-line > sharing (two cores sharing data in the same cache line). Padding structures > to prevent false-sharing or using an aligned memory allocator can help > surmount such problems. Cache line issues can be very hardware/OS specific > and mysterious. Which structure is sensitive to the padding? Thanks! Richard. |
From: <jc...@gm...> - 2013-05-14 08:18:04
|
On 14 May 2013 09:12, Richard Hughes <hug...@gm...> wrote: > On 13 May 2013 22:17, Bob Friesenhahn <bfr...@si...> wrote: >> What specific CPU are you using? > > I'm profiling on Intel i7 M620 @ 2.67GHz That's only a two-core CPU, I think, though each core can have two threads. You may or may not get much benefit from hyperthreading, depending on a great many factors. John |
From: Bob F. <bfr...@si...> - 2013-05-14 17:30:53
|
On Tue, 14 May 2013, Richard Hughes wrote: > >> Are you sharing the same transform (created by one thread), or are you >> creating an independent transform for each thread (ideally created by the >> thread which uses it)? > > One transform shared between threads. I can try to create multiple > transforms (and also in each thread) if you think that will help > things. With lcms2 it is reasonable (and safe) to create a transform for each thread. Just take care to use the APIs correctly. >> Cache-line effects can be significant if there is accidental cache-line >> sharing (two cores sharing data in the same cache line). Padding structures >> to prevent false-sharing or using an aligned memory allocator can help >> surmount such problems. Cache line issues can be very hardware/OS specific >> and mysterious. > > Which structure is sensitive to the padding? Thanks! Only structures which are updated by thread loops are a concern. I am not thinking of any structure in particular. Memory allocators which try to be memory efficient can get you in trouble if the allocator allocates several structures within the same cache line (this happened to me). The cache line sharing issue occurs when several allocations are in the same cache line. One thread updates its data which "dirties" the cache line so that the thread must re-retrieve the underlying cache line before it can write on it. The fetching of cache lines is very expensive so it is best to make sure that they do not become invalidated due to the writes of some other thread. Bob -- Bob Friesenhahn bfr...@si..., http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ |
From: Richard H. <hug...@gm...> - 2013-05-14 08:20:46
|
On 14 May 2013 00:32, Robin Watts <rob...@ar...> wrote: > Before you dive into the complexities of multithreading etc, it would > seem sensible to ensure you are getting the best possible performance > out of the transform routine in the first place. Makes sense. > LCMS has various different transform routines built in; by using a > cunning scheme, it can pick the appropriate one at runtime. If there > happens to be one in its repertoire that exactly fits your needs, it can > run considerably faster than if it has to use a generic one. Other than profiling, how do we know if it's chosen a built-in version rather than the generic version? > To speed ghostscripts use of lcms2, I wanted to optimise the transforms > that we use as much as possible. But we use quite a lot of them, and I > didn't want to have to hand write optimised ones for all of the > different cases. So I implemented a system that uses a chameleonic > header; set a few options with #defines, and include the header, and it > makes the optimised transform function for you. I see, https://github.com/robinwatts/Little-CMS/blob/artifex/src/cmsxform.h -- that looks pretty magic, is there no way to do that without the #preprocessor trickery? > He did however point out that I could recast the code slightly as a > plugin for lcms 2. This gives the same benefits without polluting the > internals of the library itself. I plan to do this at some point, but I > have not got round to it yet. Yes, that would be awesome. Thanks, Richard. |
From: Robin W. <rob...@ar...> - 2013-05-14 17:54:22
|
On 14/05/2013 09:20, Richard Hughes wrote: > Other than profiling, how do we know if it's chosen a built-in version > rather than the generic version? I initially used profiling, but then I added some code to capture what transforms were used. This is also on the same branch I pointed you at before. A direct link is: https://github.com/robinwatts/Little-CMS/commit/585c6191363a0989cb80e628529f985bf298c95a Should be straightforward (enable the 'GATHER_TRANSFORM_STATS' #define in the patch) and run your code. On exit it will dump the stats to stderr. > I see, https://github.com/robinwatts/Little-CMS/blob/artifex/src/cmsxform.h > -- that looks pretty magic, is there no way to do that without the > #preprocessor trickery? Not without expanding out every case separately - exactly what I wanted to avoid. You shouldn't need to understand the implementation in cmsxform.h though - you just need to look at cmsxform_extras.c for how to use it. >> He did however point out that I could recast the code slightly as a >> plugin for lcms 2. This gives the same benefits without polluting the >> internals of the library itself. I plan to do this at some point, but I >> have not got round to it yet. > > Yes, that would be awesome. The new lcms is due in June I think. We'll probably pull that into gs and recast our optimisations as a plugin lib then. HTH, Robin |
From: Sebastien L. <sl...@po...> - 2013-05-30 20:32:04
|
Hi Richard, Sorry to reply so late about this thread but I was away for few weeks. > I'm trying to make my transform go fast. I've got a 1920x1080 > RGB image being transformed from sRGB to the display profile. Ok, sounds very similar to what I have to do in my application. (I also had to manage premultiplied alpha, but just ignore this in my source) I'm not a lcms expert, so I simply optimized the code path (involved by the transform that my application always uses) by unrolling some critic loops. In other words, I made the code less generic but it is several times faster for my particular need. Warning : this is not suitable for everyone & I consider my modification as a dirty hack but it may be of any help to you... Note that only the TYPE_BGRA_8 format has been optimized (normal transformation + soft proofing transform). If you are using another format, you'll need to modify slightly the source to get the same performances, otherwise you'll see no difference with legacy lcms 2.4. I made a little bench to test to improvements (done on my Core2Duo). Here are the result : - littleCMS legacy code : 92.12 CPU Cycle per pixel transformed - littleCMS hacked : 27.75 CPU Cycle per pixel transformed So I get a x 3.3 boost. I also sliced the image and ran several thread, thanks to Qt threading model. Final improvement was ~ X3.4 with 4 physical CPU (I guess you could get a x1.8 boost for 2 physical CPU). As overall performances were ~ X10, I stopped to dig further and use this code daily. Note that I also rewrote the critic loop with SSE4 assembly code (just for fun). I found no real improvement because most of the work is about memory exchange... So I kept the basic C code... Modified code can be downloaded here : http://sebastienleon.com/info/littleCMS/littleCMS_PreMulAlphaHack.zip (as you may not use premultiplied alpha support, I suggest you to undefine the flag I added: CMS_PREMUL_ALPHA_SUPPORT) Hopes it could help you... (you can use the code I added without any restrictions). Best regards Sebastien Hughes wrote: > Hi all, > > I'm trying to make my transform go fast. I've got a 1920x1080 RGB > image being transformed from sRGB to the display profile. I've got a > quad core processor on my development box, no shaders or GPU, and I'm > trying to do the transform as quickly as possible. > > I figured the fastest way to do this would be to set up a threadpool > with max_threads = 4. Then I have a few choices: > > * pop a thread from the pool for every line of the image, creating > local state with p_in, p_out, width and stride > * pop a thread from the pool for every n lines of the image, creating > local state with p_in, p_out, width, stride and rows_to_process (where > n = height / max_threads) > > I figured 4 threads should be ~4x faster than using 1 thread (in the > second case we should only have 4 threads, so not much overhead), but > no matter the value of max_threads or 'n' I can only achieve a ~1.9x > speed-up. I've tried with and without cmsFLAGS_NOCACHE. Any pointers > very welcome. > > Thanks, > > Richard |