Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Linear time AdaptiveThreshold

2012-08-24
2013-03-27
1 2 3 > >> (Page 1 of 3)
  • What about a Linear time AdaptiveThreshold?

     
  • I published recently an original implementation of AdaptiveThreshold method  as patch for graphicsmagick 1.3.15, I think the patch can be applied to newer versions with very little modification.
    It is an original and complete rewrite of the GM function using dynamic programming on the lat method. It has other features and a bugfix of the original function… the goal is linear time in function of window size.
    I'd be glad to see some feedback at the patch tracking, we are using it in production with great results at the company I work.

     
  • I noticed that the patch was added to the tracker but have not yet looked at it. Improving LAT performance is of course very useful since the input images typically have large pixel dimensions, and LAT does quite a lot of computations.

    Bob

     
  • Humberto Pereira did some fine tuning in the code and removed a set of branches and mistakes I did while writing the code, He got 2x boost over the patch. :-D
    Need some tests and generalization, but soon there will be new version of the patch.

    Roberto

     
  • With the already posted patch I am not seeing a speed improvement for an input image of dimensions 1800x1185 and a -lat argument of '10x10-5%' (which works well for that particular image).  Instead I am seeing a performance loss.

    With the existing implementation, and with a Q16 build, I am getting 6.100 iter/s, 6.138 iter/cpu, but with the proposed implementation I am getting 3.456 iter/s 3.470 iter/cpu.  This is about the 2X difference you say that the updated patch will offer.  If your updated patch does offer the 2X then it should achieve similar performance to the existing implementation with just one thread.

    However, on a system with 16 cores (and 32 threads), I get 40.983 iter/s 3.457 iter/cpu with the existing implementation.

    The benchmark approach I am using is:

    gm convert somefile.tiff somefile.mpc
      gm benchmark -duration 30 convert somefile.mpc -lat '10x10-5%' null:

    In your own testing, what CPU type are you using, what quantum depth, what image dimensions, and what -lat argument?

    This illustrates how the current algorithm speeds up via OpenMP:

    gm benchmark -duration 5 -stepthreads 4 convert  somefile.mpc -lat '10x10-5%' null:
      Results: 1 threads 31 iter 5.05s user 5.07s total 6.114 iter/s 6.139 iter/cpu 1.00 speedup 1.000 karp-flatt
      Results: 4 threads 82 iter 15.51s user 5.00s total 16.400 iter/s 5.287 iter/cpu 2.68 speedup 0.164 karp-flatt
      Results: 8 threads 127 iter 25.96s user 5.03s total 25.249 iter/s 4.892 iter/cpu 4.13 speedup 0.134 karp-flatt
      Results: 12 threads 154 iter 32.75s user 5.01s total 30.739 iter/s 4.702 iter/cpu 5.03 speedup 0.126 karp-flatt
      Results: 16 threads 184 iter 38.81s user 5.00s total 36.800 iter/s 4.741 iter/cpu 6.02 speedup 0.111 karp-flatt
      Results: 20 threads 190 iter 43.36s user 5.02s total 37.849 iter/s 4.382 iter/cpu 6.19 speedup 0.117 karp-flatt
      Results: 24 threads 191 iter 49.42s user 5.00s total 38.200 iter/s 3.865 iter/cpu 6.25 speedup 0.124 karp-flatt
      Results: 28 threads 195 iter 54.45s user 5.00s total 39.000 iter/s 3.581 iter/cpu 6.38 speedup 0.126 karp-flatt
      Results: 32 threads 214 iter 60.44s user 5.01s total 42.715 iter/s 3.541 iter/cpu 6.99 speedup 0.116 karp-flatt

     
  • I do see that there is considerable speed-up with the new algorithm if I dramatically enlarge the -lat arguments to '100x100-5%'.  The difference is 0.090 iter/s for existing  vs 3.192 iter/s for new.  However, the choice of the best arguments is very image/appplication dependent.  The '10x10-5%' worked best for converting a color scan of an old parchment document (a birth certificate) to good readable text.  It would be good to improve the performance for small area dimensions as well.

     
  • There is a mistaken placed loading of q pointer at that patch, with re-loads it at each pixel of a line, my fault…
    Even with that, the improvment is perceptible with a image of approx 2000x3000 px, with -lat 35x35-9%

    The new implementation uses only one thread, and should show little gain with threads, the original uses thread parallelization features and still slower for large window >25px. We use 35px window for results we need.
    The issue with original implementation is the quadratic algorithm. Please look at user time…

    At my machine (I5-2450M at 2.5Ghz 4G ram) the results (withouting fixing loading of q pointer issue) are:

    000003 $ /home/roberto/bin/gm benchmark convert 001698.jpg -verbose -lat 35x35-9% test/001698.png
    001698.jpg JPEG 2324x3332+0+0 DirectClass 8-bit 2.200u 0:03 (3.4M pixels/s)
    001698.jpg=>test/9_001698.png PNG 2324x3332+0+0 DirectClass 8-bit 52.0K 0.040u 0:01
    Results: 1 threads 1 iter 2.24s user 2.25s total 0.444 iter/s (0.446 iter/s cpu)

    000003 $ /usr/bin/gm benchmark convert 001698.jpg -verbose -lat 35x35-9% test/001698.png
    001698.jpg JPEG 2324x3332+0+0 DirectClass 8-bit 42.350u 0:12 (683.7K pixels/s)
    001698.jpg=>test/001698.png PNG 2324x3332+0+0 DirectClass 8-bit 52.0K 0.230u 0:01 (123.1M pixels/s)
    Results: 1 iter 42.61s user 11.13s total 0.090 iter/s (0.023 iter/s cpu)

     
  • Sorry, I was using debian pre-compiled default… compiled both with same CC options and the results:

    000003 $ /home/roberto/bin/gm benchmark convert 001698.jpg -verbose -lat 35x35-9% test/001698.png

    001698.jpg JPEG 2324x3332+0+0 DirectClass 8-bit 2.160u 0:03 (3.4M pixels/s)
    001698.jpg=>test/9_001698.png PNG 2324x3332+0+0 DirectClass 8-bit 52.0K 0.040u 0:01
    Results: 1 threads 1 iter 2.20s user 2.21s total 0.452 iter/s (0.455 iter/s cpu)

    000003 $ /usr/bin/gm benchmark convert 001698.jpg -verbose -lat 35x35-9% test/001698.png

    001698.jpg JPEG 2324x3332+0+0 DirectClass 8-bit 10.500u 0:03 (2.7M pixels/s)
    001698.jpg=>test/9_001698.png PNG 2324x3332+0+0 DirectClass 8-bit 52.0K 0.210u 0:01
    Results: 4 threads 1 iter 10.73s user 2.83s total 0.353 iter/s (0.093 iter/s cpu)

    The original used 4 threads and with the patch I got faster results with only one thread, fixing the patch we get half the time for these image, and here at production server, dealing with million image files, we can paralelize process for images.

     
  • It is possible to obtain more speedup eliminating some ifs at the code, and it would be useful generalizing some -lat procedures to work per channel, just like hull function, we could use to implement variations of lat or other filters… It would require some deeper software engineering.

     
  • No doubt, anything is possible.  In the past I have discovered that it is sometimes faster (or as fast) to process all of the channels and then restore the original channels which should not have been modified.  This is because conditional code can severely interfere with the compiler's optimizer.

    GraphicsMagick lacks a generic way to specify the channels to be modified.

    Have you uploaded your improved version to the bug tracker yet?

    The new algorithm seems like a huge improvement over the simplistic approach which was used before.

     
  • The speedup at very small image files and small windowing is insignificant, but the speed loss for small files, if it exists are also so small and related mostly to memory alocation, with bigger windowing for large images, we get a considerable quality and speed gain, quadratic order.
    The openmp speedup are related with the quadratic algorithm used. It ends up locking other cores.

     
  • new version of patch sent…
    Had these results for small color image:

    000003 $ /home/roberto/bin/gm benchmark convert 0681_fig1.jpg -verbose -lat 35x35-9% test/0681_fig1.png
    0681_fig1.jpg JPEG 500x295+0+0 DirectClass 8-bit 0.020u 0:01
    writing raw profile: type=EXIF, length=5260
    writing raw profile: type=iptc, length=6456
    writing raw profile: type=APP1, length=4678
    0681_fig1.jpg=>000003/test/m_0681_fig1.png PNG 500x295+0+0 DirectClass 8-bit 24.0K 0.010u 0:01
    Results: 4 threads 1 iter 0.03s user 0.05s total 20.000 iter/s (33.333 iter/s cpu)

    000003 $ /usr/bin/gm benchmark convert 0681_fig1.jpg -verbose -lat 35x35-9% test/0681_fig1.png
    0681_fig1.jpg JPEG 500x295+0+0 DirectClass 8-bit 0.370u 0:01 (1.4M pixels/s)
    writing raw profile: type=EXIF, length=5260
    writing raw profile: type=iptc, length=6456
    writing raw profile: type=APP1, length=4678
    0681_fig1.jpg=>000003/test/t_0681_fig1.jpg.png PNG 500x295+0+0 DirectClass 8-bit 24.0K 0.080u 0:01
    Results: 4 threads 1 iter 0.45s user 0.13s total 7.692 iter/s (2.222 iter/s cpu)

    and for big grayscale image:
    /home/roberto/bin/gm benchmark convert 001698.jpg -verbose -lat 35x35-9% test/m_001698.png
    001698.jpg JPEG 2324x3332+0+0 DirectClass 8-bit 0.430u 0:01 (17.2M pixels/s)
    001698.jpg=>test/m_001698.png PNG 2324x3332+0+0 DirectClass 8-bit 52.0K 0.030u 0:01
    Results: 4 threads 1 iter 0.46s user 0.48s total 2.083 iter/s (2.174 iter/s cpu)

    /usr/local/bin/gm benchmark convert 001698.jpg -verbose -lat 35x35-9% test/t_001698.png
    001698.jpg JPEG 2324x3332+0+0 DirectClass 8-bit 10.440u 0:03 (2.6M pixels/s)
    001698.jpg=>test/t_001698.png PNG 2324x3332+0+0 DirectClass 8-bit 52.0K 0.220u 0:01 (123.1M pixels/s)
    Results: 4 threads 1 iter 10.68s user 2.87s total 0.348 iter/s (0.094 iter/s cpu)

     
  • One thing that I did not test is if the new implementation works for a "Q32" build.  In a Q32 build, pixel quantums are 32-bit integers.  This leads to interesting issues for overflow since an integer addition could overflow on the very first add.  It would be good if the algorithm could work completely for Q32 so that it can satisfy scientific applications.  Otherwise it would be necessary to reduce all of the input data to 16-bit sizes.

     
  • Yeah, have not tested neither for 16 or 32-bit quantum, it would be best way to verify if overflow avoidance works.
    Interesting things should happen. How can I try that?

     
  • This past weekend I tested via a Q16 build but did not try to break it.  To test, use this argument to configure

    -with-quantum-depth=16

      or

    -with-quantum-depth=32

     
  • Reducing input data should be easy with a macro, but it would sacrifice quality, 128-bit long long, would be ideal.
    We were considering use sse instructions and some loop unroling for processing, but do not know if it would be worth

     
  • At least one GraphicsMagick -lat user is using it inside a cell phone.  128 bit long long would be wonderful if it was universally supported, and fast.

    For my own builds of GraphicsMagick, I include these GCC warnings options in CFLAGS:

    -Wall -Winline -W -Wformat-security -Wpointer-arith -Wdisabled-optimization -Wmissing-noreturn -Wno-unknown-pragmas -Wdeclaration-after-statement

    These will definitely produce warnings with your code and I would need to figure out how to modify the code to eliminate them.  In particular, use of mixed sign in comparisons will produce warnings.  Besides issues with obscure bugs, I have noticed that GCC produces slower code if there is mixed sign in comparisons.

     
  • With latest patch, I see warnings like this:

    /home/bfriesen/src/graphics/GM/magick/effect.c: In function 'AdaptiveThresholdImage':
    /home/bfriesen/src/graphics/GM/magick/effect.c:158:5: warning: ISO C90 forbids mixed declarations and code
    /home/bfriesen/src/graphics/GM/magick/effect.c:171:19: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:189:15: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:216:15: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:227:23: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:270:17: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:271:20: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:271:34: warning: comparison between signed and unsigned integer expressions
    /home/bfriesen/src/graphics/GM/magick/effect.c:282:13: warning: ISO C90 forbids mixed declarations and code

     
  • With the patch, the Q32 build is definitely overflowing quite badly.  The result is mostly white and only a little bit of the original image remains.

     
  • I'll handle the warnings, and try to fix the 32-bit quantum issue… it should be possible to handle 32-bit quanto with 64-bit long.

     
  • 64-bit long is likely not good for some 32-bit systems.  Can you use a 'double' type instead for the Q32 build?

     
  • Which parameters you used to build with last patch?
    The warnings I got, with gnu98/gnu99, for ./configure defaults and ./configure -with-quantum-depth 32 were:

    Makefile:11220: aviso: impondo comandos para o alvo `PerlMagick/Magick.pm'
    Makefile:5529: aviso: ignorando comandos antigos para o alvo `PerlMagick/Magick.pm'
    Makefile:11220: aviso: impondo comandos para o alvo `PerlMagick/Magick.pm'
    Makefile:5529: aviso: ignorando comandos antigos para o alvo `PerlMagick/Magick.pm'
    magick/floats.c: In function '_Gm_convert_fp16_to_fp32':
    magick/floats.c:61: warning: cast from pointer to integer of different size
    magick/floats.c: In function '_Gm_convert_fp24_to_fp32':
    magick/floats.c:507: warning: cast from pointer to integer of different size

    With ansi/iso c90/iso c90:1994/ the warnings/errors in effects.c were related with comments, already fixed…
    but at the end I get:

    Makefile:11220: aviso: impondo comandos para o alvo `PerlMagick/Magick.pm'
    Makefile:5529: aviso: ignorando comandos antigos para o alvo `PerlMagick/Magick.pm'
    Makefile:11220: aviso: impondo comandos para o alvo `PerlMagick/Magick.pm'
    Makefile:5529: aviso: ignorando comandos antigos para o alvo `PerlMagick/Magick.pm'
    magick/floats.c: In function '_Gm_convert_fp16_to_fp32':
    magick/floats.c:61: warning: cast from pointer to integer of different size
    magick/floats.c: In function '_Gm_convert_fp24_to_fp32':
    magick/floats.c:507: warning: cast from pointer to integer of different size
    coders/mac.c: In function 'ReadMACImage':
    coders/mac.c:103: error: expected expression before '/' token
    coders/mac.c:103:27: warning: missing terminating ' character
    coders/mac.c:103: error: missing terminating ' character
    make: **  Erro 1

    What parameters did you use in ./configure?

     
  • I see that a C++-style comment crept in.  Not good.

    This is what I use for a typical build with GCC:

    ../GraphicsMagick/configure 'CFLAGS=-march=native -O2 -frename-registers -g -Wall -Winline -W -Wformat-security -Wpointer-arith -Wdisabled-optimization -Wmissing-noreturn -Wno-unknown-pragmas' 'CXXFLAGS=-march=native -O -g -Wall -Winline -W -Wextra -Wno-unknown-pragmas' 'LDFLAGS=-L/usr/local/lib -R/usr/local/lib' '-with-quantum-depth=32' '-disable-shared' '-enable-static' '-enable-openmp-slow' '-enable-silent-rules'

    I always configure and build from outside of the source tree so that several builds (e.g. Q8, Q16, Q32, 32-bit, 64-bit)  can be active at once.

     
  • I'm using the same structure used in original -lat for computing the sum of average… which is LongPixelPacket…

     
1 2 3 > >> (Page 1 of 3)