libjpeg-turbo / Patches (closed - use GitHub) / #57 Use intrinsic functions for bit counting to reduce footprint

DRC - 2014-03-21

After extensive benchmarking, I find that this patch unfortunately decreases performance in some cases.

The results are mixed. On my Dell/Intel quad-core Xeon test machine, which is still running CentOS 5 and GCC 4.1.2, I observe that the patch has no measurable performance impact. However, on both of my AMD machines (an older dual-core Athlon64 machine and a brand new A10 APU-equipped quad-core machine), both of which are running CentOS 6 and GCC 4.4.7, I observe that the patch slows performance by generally 3-7% on average but as much as 7-12% in the extreme. At least part of this seems to be due to the compiler and not the CPU. Using the older GCC 4.1.2 compiler on the same AMD A10-equipped machine produces better results, but there is still a measurable slow-down of up to 5%. This should be easily reproducible by running tjbench with artificial.ppm and nightshot_iso_100.ppm, both of which are available here: http://www.imagecompression.info/test_images/rgb8bit.zip

I am not comfortable including this patch as-is. I would really prefer that something like this "just work" out of the box with no performance impact. Since it does have a measurable performance impact, then the only way I could include it right now would be via an optional configure switch, which is messy.

Can you explain more about why 64k is so important? I mean, we have plenty of iPhone developers who have no problem with the code the way it is.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-21

I will have to build the old gcc versions to see what crap they output.

But in the end it is on arm we need the patch. Chromium's multiprocess architecture multiplies all memory usage by a factor of ten, thus 64k becomes 640k, which matters on low end devices.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-21

OK, that makes sense. Let me test it on my Panda board. It may be that this is something we only enable by default on ARM for now, or perhaps I can introduce a configure switch for it. It might even make sense to introduce a more generic configure switch such as --with-lowmem that can be used to include this and other low-memory options in the future. There is a low-memory progressive decode patch, for instance, that I would like to include at some point if it could be modified in such a way as to avoid backward ABI incompatibility (I know how to do that-- it would just take some time, so I would need financial sponsorship for the project.)

It would definitely be cleaner, though, if this could be enabled out of the box without introducing new configure switches.

Another option might be to include Mozilla's patch that pre-allocates the table and moves it into a static const so it can be shared by multiple processes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-24

When reading the code I can't see that the table needs to be that big as it is now. With BITS_IN_JSAMPLE=8 we will only use the first 2k of the table, or do I miss something? So if we add a precompiled table of proper size it will just add 2kB to the binary, and we can skip the instrinsic instructions that appears to be software implemented on some chips.

I attached a new patch here. The tests pass with BITS_IN_JSAMPLE=8, but it doesn't compile with BITS_IN_JSAMPLE=12 because of other errors.

reduced_table.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-24

Where are you getting that only the first 2k of the table is used? Looking at the ranges of the datatypes involved, block[0] and last_dc_val can be in the range of [-32768, 32767], so their difference can be in the range of [0, 65535].

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-25

jchuff.h:
/ The legal range of a DCT coefficient is
* -1024 .. +1023 for 8-bit data;
* -16384 .. +16383 for 12-bit data.
* Hence the magnitude should always fit in 10 or 14 bits respectively.
/

Is this comment wrong or not applicable?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-25

Aha. OK, in the original libjpeg code, it actually performs a check to ensure this:

if (nbits > MAX_COEF_BITS+1)
ERREXIT(state->cinfo, JERR_BAD_DCT_COEF);

where MAX_COEF_BITS=10 for 8-bit coefficients, so you are correct that a 2k-deep table should be sufficient.

Let me run this by Mozilla. They have a similar patch that pre-allocates a 64k-deep table, and I'm sure they would be interested in optimizing that. I also want their peer review on it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-25

In Firefox, they implemented a similar patch, except that theirs is still allocating the full 64k-deep table:

https://bugzilla.mozilla.org/show_bug.cgi?id=815473

However, they apparently have some way of sharing the static table among multiple instances. It would make me feel more comfortable to adopt their patch, since it has been thoroughly tested.

Is there any way you can adopt Mozilla's approach of sharing the table? It makes me uncomfortable to reduce the table to the bare minimum size, when there is no bounds checking on nbits. It may be the case that an array out-of-bounds is not possible because of the way the algorithm works, but I'd need to convince myself of that, or someone would need to convince me.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-26

Android maps text and ro data segments into shared memory in the same way as most other modern operating systems. So yes, with the mozilla patch we will only get one copy of the table, and the unused 62k never be paged in at all. But there are other parameters to consider in performance tuning, and increasing binary size is also bad since more bytes will have to be downloaded and there will be less room for other apps on the flash memory.

We are currently using the bsr/clz patch on android browsers. Performance is good enough and no table is needed. So the mozilla patch would just add another 64k to binary size and we would not apply the patch since it is not an improvement for us.

My suggestion is "#ifdef __arm__" followed by clz/bsr.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-26

Well, forgive me if, as the owner of a phone that has 64 gigabytes of memory and some apps that are literally hundreds of megabytes, I think worrying over 64k is splitting hairs a bit. I am not getting paid for my time dealing with this problem, so frankly, I want it to go away as quickly as possible, but I also want to make sure everyone is happy-- or at least as happy as possible.

Using bsr/clz on ARM only seems like a reasonable approach. But does GCC always generate good intrinsic code for ARM? You had indicated above that sometimes GCC intrinsics are implemented in software.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-26

I mean microcoded and not software implemented. gcc will generate bsr for x86 and clz for arm, but some chips implement the machine instruction in software (which is slow) in constrast to hardwired implementations that run in 1-3 cycles.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-26

Are you aware of this being a problem for any ARM processors? Maybe I need to get Linaro in on this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-26

First I have to say that we compile with '-march=armv7 -mthumb'. ARMv6 has clz, but not in combination with thumb. Thus 'gcc -march=armv6 -mthumb' would generate library calls for clz which would be really slow.

I tried looping over clz on four devices, listed from low-end to high-end:
1. Sony Xperia [ST21i] - fast clz
2. Samsung Galaxy S3 Mini [GT-I8190N] - slow clz
3. Samsung Galaxy S4 Mini [GT-I9195] - fast clz
4. Samsung Galaxy S4 [GT-I9505] - fast clz

On the first two of the devices memory pressure is in general a much larger problem than run time. Background processes get killed by the oom killer when forground processes allocate memory. And starting new processes from scratch when the user switches tab is slower than anything else.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-26

Benchmarks would be very helpful here. The only thing I can test is a Panda board, which may or may not have any relevance performance-wise to the actual devices. If you can build tjbench and run it against artificial.ppm and nightshot_iso_100.ppm from http://www.imagecompression.info/test_images/rgb8bit.zip, comparing "before" and "after" performance on the various Android devices, that would make me feel much better about implementing this. In this project, I treat performance like I treat quality, so any decrease in performance is treated as a bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-26

When I run tjbench on my desktop I get small random variations in all figures. Is it the total run time or is it one particular figure that is most relevant for block encoding performance?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-26

It will vary typically by as much as 2%. You can decrease that by increasing the benchmark time (tjbench -benchtime 10, for instance.) I'm not worried about small variations. I'm looking for whether the compress performance increases consistently and significantly relative to the decompress performance (decompress is the "control group" here, since we haven't modified anything that would affect that performance.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-26

Unfortunately, my Panda board isn't booting properly right now, and I don't have time to tinker with it. Thus, I'm going to have to rely on you and others to give me feedback as to the performance of this proposed modification.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-27

I have run tjbench on the four devices, first on artificial and then on nightshot_iso_100 with 95 as parameter. I compiled two versions, one with CFLAGS="-marmv6 -O3" and one with CFLAGS="-march=armv7-a -mthumb -O3". The latter target matches what we actually use for release builds.

On the first two devices (low-end), for target armv6 the lookup table is slightly faster than clz. For target armv7+thumb the intrinsic clz is slightly faster than the lookup table. Thus a draw.

On the other two devices (high-end) the intrinsic clz is significantly faster (15-20%) both with armv6 and armv7+thumb as target.

So for the production target the patched tjbench performs better on all tested devices. Attaching log files from the tests. "tbl" suffix is for the original code with lookup table in bss and "clz" suffix is for runs with the clz patch applied.

tjbench.tar.bz2

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-27

Great! That's good enough for me. Can you resubmit the original patch with the appropriate #ifdef so that it is only used on ARM? I will check that into trunk, and then I'm also going to go ahead and check in Mozilla's patch, even though it may only be used on x86. It'll give them and other mobile developers the option to easily evaluate both approaches.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-27

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-28

New patch with defines for arm.

clz-arm.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-28

Can you explain what this does?

if !defined thumb || defined thumb2

We don't generally use -mthumb when building, for instance, iOS binaries or ARM Linux binaries.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Olle Liljenzin - 2014-03-28

It is to avoid calls to __builtin_clz on targets without a matching instruction, since that would cause gcc to generate slow library calls.

If compiling without -mthumb, then __thumb__ is undefined and __builtin_clz is used.

If compiling with -mthumb and -march=armv6, then __thumb__ is defined, __thumb2__ is undefined and __builtin_clz is not used.

If compiling with -mthumb and -march=armv7 or later, then __thumb2__ is defined and __builtin_clz is used.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-28

Checked in with minor wordsmithing modifications. Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-28

status: open-accepted --> closed-integrated
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Use intrinsic functions for bit counting to reduce footprint

SIMD-accelerated libjpeg-compatible JPEG codec library

Searches

Help

#57 Use intrinsic functions for bit counting to reduce footprint

Discussion

if !defined thumb || defined thumb2