Some initial ARM NEON optimizations for libjpeg-turbo
SIMD-accelerated libjpeg-compatible JPEG codec library
Brought to you by:
dcommander
Includes the code for ARM NEON runtime detection and initial optimizations
for 'jpeg_idct_ifast' and YCbCr->RGB colorspace conversion.
A reasonably up to date version of GNU assembler is needed to compile
ARM assembly code, which is fine for linux and android. Making NEON assembly
code compile for iPhone may be a bit tricky, which is a rather common issue:
http://comments.gmane.org/gmane.comp.graphics.pixman/1052?set_lines=100000
The same patches are also available at:
http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/log/?h=sent/20110422-arm-neon-try1
Forgot to add a summary. These ARM NEON patches provide more than 2x overall speedup on ARM Cortex-A9 for decoding images with djpeg and using '-dct fast' option. But there is a significant performance improvement even with the default dct because of colorspace conversion optimization.
Questions:
Is there a way to do runtime detection of Neon without relying on /proc/cpuinfo? Relying on a platform-specific trick like that makes me nervous.
The inclusion of AM_PROG_AS in configure.ac needs to be conditional on the platform being == "arm", if possible.
To elaborate on the first question, with SSE2 and other x86 SIMD features, there is a special assembly instruction that can be issued to detect their presence in the CPU. Surely such a thing exists for ARM as well, otherwise how would the Linux kernel know how to populate /proc/cpuinfo?
See simd/jsimdcpu.asm.
It's not so easy. There is a set of coprocessor registers which can be read to obtain such information, but they are not accessible from userspace. A quote from: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html
"B5.1.2 General features of the CPUID registers
All of the CPUID registers are:
- 32-bit read-only registers
- accessible only in privileged modes"
This all makes runtime cpu features detection really ugly and system dependent on ARM because the OS kernel has to somehow provide such information to the userspace applications. I have summarized the current situation here, and it looks like using /proc/cpuinfo is the least problematic method at the moment : http://lists.arm.linux.org.uk/lurker/message/20110426.215344.2ccef634.en.html
I did not get any reply yet, but I will try to keep pushing the arm linux guys to have /proc/cpuinfo at least properly documented. Russell King is the maintainer of arm support in linux.
There are other methods. But all of them are system dependent and have their own problems.
It should be safe on this particular platform (linux/android). The other platforms will not get ARM NEON support until somebody provides the needed system dependent code.
Done
libjpeg-turbo-20110502-arm-neon-try2.tar.gz
So, if there is no way to probe this information at the assembly level, then how does the Linux kernel obtain the information which it puts into /proc/cpuinfo?
What concerns me is your statement, "the other platforms will not get ARM NEON support until somebody provides the needed system dependent code." So that means if someone else comes along and wants to put libjpeg-turbo on a different ARM-based O/S, then they have to introduce their own special sauce for detecting the processor. That leads to some pretty messy code.
It's better for everyone if a more generic solution is introduced first, because that means that the code probably doesn't have to be modified later on, which means the chances of breaking it on the platform you care about are reduced.
It does this in the following way: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/arm/include/asm/cputype.h;h=20ae96cc0020eeead866c3a2f983c3297a668034;hb=521cb40b0c44418a4fd36dc633f575813d59a43d#l27
But in the userspace you are going to get an undefined instruction exception. It's all the same for x86, there are some instructions which are privileged and not allowed to be used by the userspace code (such as modifying page tables, managing cache, interrupt descriptor tables, accessing ports, etc.). And for ARM just somebody "cleverly" decided to block direct access to the cpu features information for the userspace code and this is a privileged operation.
Yes, unfortunately that's the case. Fortunately there are not so many relevant operating systems to support (Linux/Android, iOS, ARM Windows), so it's not going to be too messy. Also I suspect that iOS does not need any runtime detection at all and the availability of NEON can be assumed by default for ARMv7.
Sure it's better for everyone. But what if a generic solution just does not exist? If it did, I guess everyone would have started using it by now.
After applying these patches, I can no longer configure on x86 systems using recent versions of autotools. I get the following error:
configure: error: conditional "am__fastdepCCAS" was never defined.
In a nutshell, autotools doesn't like the fact that AM_PROG_AS is invoked conditionally, automake is apparently detecting that one of the Makefile.am files (simd/Makefile.am, specifically) specifies an assembly language file as source code. AM apparently isn't smart enough to also detect that the assembly file is included only when WITH_ARM is true.
Here's the problem-- libjpeg-turbo can't assume a GNU compiler environment. On Solaris, for instance, we build with Sun Studio to get the best performance. I don't think AM_PROG_AS can safely be included in all cases, but it also seems to break the build if included conditionally.
http://www.gnu.org/s/hello/manual/automake/Public-Macros.html
"AM_PROG_AS Use this macro when you have assembly code in your project. This will choose the assembler for you (by default the C compiler) and set CCAS, and will also set CCASFLAGS if required. "
This does not say anything about requiring GNU compiler environment. Also pixman uses AM_PROG_AS (unconditionally) and still builds fine in Solaris and in a variety of other systems.
OK, patches have been checked in with minor modifications. Please test the trunk and make sure it builds and runs 'make test' successfully on ARM. If so, I will do a build on all of the supported x86 platforms to make sure nothing is broken there.
Thanks. SVN trunk builds and passes tests on ARM (--witout-simd too). NEON optimizations are also used provide decoding speedup.
Maybe it's too early to ask, but what are the plans on the way to 1.2? Will there be a new unstable libjpeg-turbo release soon which could be tested by linux distro maintainers on ARM?
And there are still many ARM NEON optimizations which can be additionally applied.
LJT 1.2 is slated for early 2012, but the beta may land this Fall if I can finish the re-licensing work in a shorter timeframe.