Menu

#57 Use intrinsic functions for bit counting to reduce footprint

closed-integrated
DRC
None
1
2016-01-14
2014-03-20
No

The patch will use blz/csr instructions instead of a lookup table to reduce footprint on mobile devices.

The clz/bsr instructions produce the same result as the lookup table "jpeg_nbits_table" with equivalent performance. Removing the table saves 64kB. It is not much, but we need to save all memory that is possible.

This patch contains macros for gcc/clang. Windows has _BitScanReverse, but I have no such machine to test on.

1 Attachments

Discussion

1 2 > >> (Page 1 of 2)
  • DRC

    DRC - 2014-03-21

    After extensive benchmarking, I find that this patch unfortunately decreases performance in some cases.

    The results are mixed. On my Dell/Intel quad-core Xeon test machine, which is still running CentOS 5 and GCC 4.1.2, I observe that the patch has no measurable performance impact. However, on both of my AMD machines (an older dual-core Athlon64 machine and a brand new A10 APU-equipped quad-core machine), both of which are running CentOS 6 and GCC 4.4.7, I observe that the patch slows performance by generally 3-7% on average but as much as 7-12% in the extreme. At least part of this seems to be due to the compiler and not the CPU. Using the older GCC 4.1.2 compiler on the same AMD A10-equipped machine produces better results, but there is still a measurable slow-down of up to 5%. This should be easily reproducible by running tjbench with artificial.ppm and nightshot_iso_100.ppm, both of which are available here: http://www.imagecompression.info/test_images/rgb8bit.zip

    I am not comfortable including this patch as-is. I would really prefer that something like this "just work" out of the box with no performance impact. Since it does have a measurable performance impact, then the only way I could include it right now would be via an optional configure switch, which is messy.

    Can you explain more about why 64k is so important? I mean, we have plenty of iPhone developers who have no problem with the code the way it is.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-21

    I will have to build the old gcc versions to see what crap they output.

    But in the end it is on arm we need the patch. Chromium's multiprocess architecture multiplies all memory usage by a factor of ten, thus 64k becomes 640k, which matters on low end devices.

     
  • DRC

    DRC - 2014-03-21

    OK, that makes sense. Let me test it on my Panda board. It may be that this is something we only enable by default on ARM for now, or perhaps I can introduce a configure switch for it. It might even make sense to introduce a more generic configure switch such as --with-lowmem that can be used to include this and other low-memory options in the future. There is a low-memory progressive decode patch, for instance, that I would like to include at some point if it could be modified in such a way as to avoid backward ABI incompatibility (I know how to do that-- it would just take some time, so I would need financial sponsorship for the project.)

    It would definitely be cleaner, though, if this could be enabled out of the box without introducing new configure switches.

    Another option might be to include Mozilla's patch that pre-allocates the table and moves it into a static const so it can be shared by multiple processes.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-24

    When reading the code I can't see that the table needs to be that big as it is now. With BITS_IN_JSAMPLE=8 we will only use the first 2k of the table, or do I miss something? So if we add a precompiled table of proper size it will just add 2kB to the binary, and we can skip the instrinsic instructions that appears to be software implemented on some chips.

    I attached a new patch here. The tests pass with BITS_IN_JSAMPLE=8, but it doesn't compile with BITS_IN_JSAMPLE=12 because of other errors.

     
  • DRC

    DRC - 2014-03-24

    Where are you getting that only the first 2k of the table is used? Looking at the ranges of the datatypes involved, block[0] and last_dc_val can be in the range of [-32768, 32767], so their difference can be in the range of [0, 65535].

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-25

    jchuff.h:
    / The legal range of a DCT coefficient is
    * -1024 .. +1023 for 8-bit data;
    * -16384 .. +16383 for 12-bit data.
    * Hence the magnitude should always fit in 10 or 14 bits respectively.
    /

    Is this comment wrong or not applicable?

     
  • DRC

    DRC - 2014-03-25

    Aha. OK, in the original libjpeg code, it actually performs a check to ensure this:

    if (nbits > MAX_COEF_BITS+1)
    ERREXIT(state->cinfo, JERR_BAD_DCT_COEF);

    where MAX_COEF_BITS=10 for 8-bit coefficients, so you are correct that a 2k-deep table should be sufficient.

    Let me run this by Mozilla. They have a similar patch that pre-allocates a 64k-deep table, and I'm sure they would be interested in optimizing that. I also want their peer review on it.

     
  • DRC

    DRC - 2014-03-25

    In Firefox, they implemented a similar patch, except that theirs is still allocating the full 64k-deep table:

    https://bugzilla.mozilla.org/show_bug.cgi?id=815473

    However, they apparently have some way of sharing the static table among multiple instances. It would make me feel more comfortable to adopt their patch, since it has been thoroughly tested.

    Is there any way you can adopt Mozilla's approach of sharing the table? It makes me uncomfortable to reduce the table to the bare minimum size, when there is no bounds checking on nbits. It may be the case that an array out-of-bounds is not possible because of the way the algorithm works, but I'd need to convince myself of that, or someone would need to convince me.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-26

    Android maps text and ro data segments into shared memory in the same way as most other modern operating systems. So yes, with the mozilla patch we will only get one copy of the table, and the unused 62k never be paged in at all. But there are other parameters to consider in performance tuning, and increasing binary size is also bad since more bytes will have to be downloaded and there will be less room for other apps on the flash memory.

    We are currently using the bsr/clz patch on android browsers. Performance is good enough and no table is needed. So the mozilla patch would just add another 64k to binary size and we would not apply the patch since it is not an improvement for us.

    My suggestion is "#ifdef __arm__" followed by clz/bsr.

     
  • DRC

    DRC - 2014-03-26

    Well, forgive me if, as the owner of a phone that has 64 gigabytes of memory and some apps that are literally hundreds of megabytes, I think worrying over 64k is splitting hairs a bit. I am not getting paid for my time dealing with this problem, so frankly, I want it to go away as quickly as possible, but I also want to make sure everyone is happy-- or at least as happy as possible.

    Using bsr/clz on ARM only seems like a reasonable approach. But does GCC always generate good intrinsic code for ARM? You had indicated above that sometimes GCC intrinsics are implemented in software.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-26

    I mean microcoded and not software implemented. gcc will generate bsr for x86 and clz for arm, but some chips implement the machine instruction in software (which is slow) in constrast to hardwired implementations that run in 1-3 cycles.

     
  • DRC

    DRC - 2014-03-26

    Are you aware of this being a problem for any ARM processors? Maybe I need to get Linaro in on this.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-26

    First I have to say that we compile with '-march=armv7 -mthumb'. ARMv6 has clz, but not in combination with thumb. Thus 'gcc -march=armv6 -mthumb' would generate library calls for clz which would be really slow.

    I tried looping over clz on four devices, listed from low-end to high-end:
    1. Sony Xperia [ST21i] - fast clz
    2. Samsung Galaxy S3 Mini [GT-I8190N] - slow clz
    3. Samsung Galaxy S4 Mini [GT-I9195] - fast clz
    4. Samsung Galaxy S4 [GT-I9505] - fast clz

    On the first two of the devices memory pressure is in general a much larger problem than run time. Background processes get killed by the oom killer when forground processes allocate memory. And starting new processes from scratch when the user switches tab is slower than anything else.

     
  • DRC

    DRC - 2014-03-26

    Benchmarks would be very helpful here. The only thing I can test is a Panda board, which may or may not have any relevance performance-wise to the actual devices. If you can build tjbench and run it against artificial.ppm and nightshot_iso_100.ppm from http://www.imagecompression.info/test_images/rgb8bit.zip, comparing "before" and "after" performance on the various Android devices, that would make me feel much better about implementing this. In this project, I treat performance like I treat quality, so any decrease in performance is treated as a bug.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-26

    When I run tjbench on my desktop I get small random variations in all figures. Is it the total run time or is it one particular figure that is most relevant for block encoding performance?

     
  • DRC

    DRC - 2014-03-26

    It will vary typically by as much as 2%. You can decrease that by increasing the benchmark time (tjbench -benchtime 10, for instance.) I'm not worried about small variations. I'm looking for whether the compress performance increases consistently and significantly relative to the decompress performance (decompress is the "control group" here, since we haven't modified anything that would affect that performance.)

     
  • DRC

    DRC - 2014-03-26

    Unfortunately, my Panda board isn't booting properly right now, and I don't have time to tinker with it. Thus, I'm going to have to rely on you and others to give me feedback as to the performance of this proposed modification.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-27

    I have run tjbench on the four devices, first on artificial and then on nightshot_iso_100 with 95 as parameter. I compiled two versions, one with CFLAGS="-marmv6 -O3" and one with CFLAGS="-march=armv7-a -mthumb -O3". The latter target matches what we actually use for release builds.

    On the first two devices (low-end), for target armv6 the lookup table is slightly faster than clz. For target armv7+thumb the intrinsic clz is slightly faster than the lookup table. Thus a draw.

    On the other two devices (high-end) the intrinsic clz is significantly faster (15-20%) both with armv6 and armv7+thumb as target.

    So for the production target the patched tjbench performs better on all tested devices. Attaching log files from the tests. "tbl" suffix is for the original code with lookup table in bss and "clz" suffix is for runs with the clz patch applied.

     
  • DRC

    DRC - 2014-03-27

    Great! That's good enough for me. Can you resubmit the original patch with the appropriate #ifdef so that it is only used on ARM? I will check that into trunk, and then I'm also going to go ahead and check in Mozilla's patch, even though it may only be used on x86. It'll give them and other mobile developers the option to easily evaluate both approaches.

     
  • DRC

    DRC - 2014-03-27
    • status: open --> open-accepted
     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-28

    New patch with defines for arm.

     
  • DRC

    DRC - 2014-03-28

    Can you explain what this does?

    if !defined thumb || defined thumb2

    We don't generally use -mthumb when building, for instance, iOS binaries or ARM Linux binaries.

     
  • Olle Liljenzin

    Olle Liljenzin - 2014-03-28

    It is to avoid calls to __builtin_clz on targets without a matching instruction, since that would cause gcc to generate slow library calls.

    If compiling without -mthumb, then __thumb__ is undefined and __builtin_clz is used.

    If compiling with -mthumb and -march=armv6, then __thumb__ is defined, __thumb2__ is undefined and __builtin_clz is not used.

    If compiling with -mthumb and -march=armv7 or later, then __thumb2__ is defined and __builtin_clz is used.

     
  • DRC

    DRC - 2014-03-28

    Checked in with minor wordsmithing modifications. Thanks!

     
  • DRC

    DRC - 2014-03-28
    • status: open-accepted --> closed-integrated
     
1 2 > >> (Page 1 of 2)

Log in to post a comment.