Menu

#7 Some initial ARM NEON optimizations for libjpeg-turbo

closed-integrated
nobody
None
5
2014-03-27
2011-04-22
No

Includes the code for ARM NEON runtime detection and initial optimizations
for 'jpeg_idct_ifast' and YCbCr->RGB colorspace conversion.

A reasonably up to date version of GNU assembler is needed to compile
ARM assembly code, which is fine for linux and android. Making NEON assembly
code compile for iPhone may be a bit tricky, which is a rather common issue:
http://comments.gmane.org/gmane.comp.graphics.pixman/1052?set_lines=100000

The same patches are also available at:
http://cgit.freedesktop.org/~siamashka/libjpeg-turbo/log/?h=sent/20110422-arm-neon-try1

Discussion

  • Siarhei Siamashka

    Forgot to add a summary. These ARM NEON patches provide more than 2x overall speedup on ARM Cortex-A9 for decoding images with djpeg and using '-dct fast' option. But there is a significant performance improvement even with the default dct because of colorspace conversion optimization.

     
  • DRC

    DRC - 2011-04-26

    Questions:

    Is there a way to do runtime detection of Neon without relying on /proc/cpuinfo? Relying on a platform-specific trick like that makes me nervous.

    The inclusion of AM_PROG_AS in configure.ac needs to be conditional on the platform being == "arm", if possible.

     
  • DRC

    DRC - 2011-04-26

    To elaborate on the first question, with SSE2 and other x86 SIMD features, there is a special assembly instruction that can be issued to detect their presence in the CPU. Surely such a thing exists for ARM as well, otherwise how would the Linux kernel know how to populate /proc/cpuinfo?

    See simd/jsimdcpu.asm.

     
  • Siarhei Siamashka

    To elaborate on the first question, with SSE2 and other x86 SIMD features,
    there is a special assembly instruction that can be issued to detect their
    presence in the CPU. Surely such a thing exists for ARM as well,otherwise
    how would the Linux kernel know how to populate /proc/cpuinfo?

    It's not so easy. There is a set of coprocessor registers which can be read to obtain such information, but they are not accessible from userspace. A quote from: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html
    "B5.1.2 General features of the CPUID registers
    All of the CPUID registers are:
    - 32-bit read-only registers
    - accessible only in privileged modes"

    This all makes runtime cpu features detection really ugly and system dependent on ARM because the OS kernel has to somehow provide such information to the userspace applications. I have summarized the current situation here, and it looks like using /proc/cpuinfo is the least problematic method at the moment : http://lists.arm.linux.org.uk/lurker/message/20110426.215344.2ccef634.en.html

    I did not get any reply yet, but I will try to keep pushing the arm linux guys to have /proc/cpuinfo at least properly documented. Russell King is the maintainer of arm support in linux.

     
  • Siarhei Siamashka

    Is there a way to do runtime detection of Neon without
    relying on /proc/cpuinfo?

    There are other methods. But all of them are system dependent and have their own problems.

    Relying on a platform-specific trick like that makes me
    nervous.

    It should be safe on this particular platform (linux/android). The other platforms will not get ARM NEON support until somebody provides the needed system dependent code.

    The inclusion of AM_PROG_AS in configure.ac needs to be
    conditional on the platform being == "arm", if possible.

    Done

     
  • Siarhei Siamashka

    libjpeg-turbo-20110502-arm-neon-try2.tar.gz

     
  • DRC

    DRC - 2011-05-02

    So, if there is no way to probe this information at the assembly level, then how does the Linux kernel obtain the information which it puts into /proc/cpuinfo?

    What concerns me is your statement, "the other platforms will not get ARM NEON support until somebody provides the needed system dependent code." So that means if someone else comes along and wants to put libjpeg-turbo on a different ARM-based O/S, then they have to introduce their own special sauce for detecting the processor. That leads to some pretty messy code.

    It's better for everyone if a more generic solution is introduced first, because that means that the code probably doesn't have to be modified later on, which means the chances of breaking it on the platform you care about are reduced.

     
  • Siarhei Siamashka

    So, if there is no way to probe this information at
    the assembly level, then how does the Linux kernel
    obtain the information which it puts into /proc/cpuinfo?

    It does this in the following way: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/arm/include/asm/cputype.h;h=20ae96cc0020eeead866c3a2f983c3297a668034;hb=521cb40b0c44418a4fd36dc633f575813d59a43d#l27

    But in the userspace you are going to get an undefined instruction exception. It's all the same for x86, there are some instructions which are privileged and not allowed to be used by the userspace code (such as modifying page tables, managing cache, interrupt descriptor tables, accessing ports, etc.). And for ARM just somebody "cleverly" decided to block direct access to the cpu features information for the userspace code and this is a privileged operation.

    So that means if someone else comes along and
    wants to put libjpeg-turbo on a different ARM-based
    O/S, then they have to introduce their own special
    sauce for detecting the processor. That leads to
    some pretty messy code.

    Yes, unfortunately that's the case. Fortunately there are not so many relevant operating systems to support (Linux/Android, iOS, ARM Windows), so it's not going to be too messy. Also I suspect that iOS does not need any runtime detection at all and the availability of NEON can be assumed by default for ARMv7.

    It's better for everyone if a more generic
    solution is introduced first, because that
    means that the code probably doesn't have
    to be modified later on, which means the
    chances of breaking it on the platform you
    care about are reduced.

    Sure it's better for everyone. But what if a generic solution just does not exist? If it did, I guess everyone would have started using it by now.

     
  • DRC

    DRC - 2011-05-03

    After applying these patches, I can no longer configure on x86 systems using recent versions of autotools. I get the following error:

    configure: error: conditional "am__fastdepCCAS" was never defined.

    In a nutshell, autotools doesn't like the fact that AM_PROG_AS is invoked conditionally, automake is apparently detecting that one of the Makefile.am files (simd/Makefile.am, specifically) specifies an assembly language file as source code. AM apparently isn't smart enough to also detect that the assembly file is included only when WITH_ARM is true.

    Here's the problem-- libjpeg-turbo can't assume a GNU compiler environment. On Solaris, for instance, we build with Sun Studio to get the best performance. I don't think AM_PROG_AS can safely be included in all cases, but it also seems to break the build if included conditionally.

     
  • Siarhei Siamashka

    I don't think AM_PROG_AS can safely be included
    in all cases

    http://www.gnu.org/s/hello/manual/automake/Public-Macros.html
    "AM_PROG_AS Use this macro when you have assembly code in your project. This will choose the assembler for you (by default the C compiler) and set CCAS, and will also set CCASFLAGS if required. "

    This does not say anything about requiring GNU compiler environment. Also pixman uses AM_PROG_AS (unconditionally) and still builds fine in Solaris and in a variety of other systems.

     
  • DRC

    DRC - 2011-05-03

    OK, patches have been checked in with minor modifications. Please test the trunk and make sure it builds and runs 'make test' successfully on ARM. If so, I will do a build on all of the supported x86 platforms to make sure nothing is broken there.

     
  • Siarhei Siamashka

    Thanks. SVN trunk builds and passes tests on ARM (--witout-simd too). NEON optimizations are also used provide decoding speedup.

     
  • Siarhei Siamashka

    Maybe it's too early to ask, but what are the plans on the way to 1.2? Will there be a new unstable libjpeg-turbo release soon which could be tested by linux distro maintainers on ARM?

    And there are still many ARM NEON optimizations which can be additionally applied.

     
  • DRC

    DRC - 2011-05-10

    LJT 1.2 is slated for early 2012, but the beta may land this Fall if I can finish the re-licensing work in a shorter timeframe.

     
  • DRC

    DRC - 2014-03-27
    • Status: closed --> closed-integrated
     

Log in to post a comment.