Menu

#54 Armv8 neon port for libjpeg-turbo decoder

closed-integrated
DRC
None
1
2014-08-14
2013-10-29
Ragesh
No

Attached patches contains Armv8 neon Port for following functionalities and changes to configuraton and Makefiles to integrate Armv8 neon port to libjpeg-turbo

  1. Idct Slow.
  2. Idct Fast.
  3. Idct 2x2
  4. Idct 4x4
  5. Yuv to RGB conversion
1 Attachments

Related

Patches (closed - use GitHub): #54

Discussion

  • Ragesh

    Ragesh - 2013-10-29

    Missing attachments adding all 9 patches in a single archive file

     
  • DRC

    DRC - 2013-11-01

    Couple of questions/comments:

    -- Are there further optimizations planned, or will these be the only ones? The reason I ask is more for strategic purposes in planning the release of libjpeg-turbo 1.4. If a full ARM 64-bit implementation is planned, then I would like to hold off the release until it is ready. There is plenty of time, since there are currently other features in the works for 1.4 that will take months to implement.

    -- I assume that 'make test' passes?

    -- I will take care of it this time, but if you have any future patches, please submit them against trunk, which is where the new features are landing.

    -- I am attaching two test scripts which can be used along with the canonical libjpeg-turbo test images (see http://www.libjpeg-turbo.org/About/Performance) to generate numbers that can be copied & pasted into a spreadsheet template (also attached.) If you could run these numbers both with and without the ARM 64-bit SIMD code, that would help me greatly. It gives me a feel for how well the new SIMD code is accelerating things, and it allows me to make valid performance claims about it (such as "it's X amount faster, etc.")

     
  • Ragesh

    Ragesh - 2013-11-07

    Thank you for your comments.

    Further optimizations are not yet planned as such and I believe that it
    is possible to optimize further in terms of data packing but skeptical
    if it can really make a big difference.

    'make Test' all test cases passes.

    Regarding performance, as previously discussed the whole development and
    testing was on an ARM simulator and unfortunately we don't have any
    64bit ARM h/w boards still to test performance. I guess the perf figures
    will not make any sense on simulators.

     

    Last edit: Ragesh 2013-11-07
  • DRC

    DRC - 2013-11-07

    OK, thanks for the info. I guess the burning question in my mind is with regards to the iPhone 5S, which is where the demand for this feature primarily lives at the moment. It seems like it wouldn't be beneficial in a performance sense for an iPhone developer to switch from 32-bit ARM v7 code to 64-bit ARM v8 code until the v8 code has full SIMD acceleration. Otherwise, since the v8 version of the code would have to fall back to C for doing convsamp, quantize, fast DCT, and upsample operations, it seems like it wouldn't be as fast as the existing 32-bit v7 code. Correct me if I'm wrong, though.

    It seems to me that it might be better to sit on this until someone in the community can fill in the missing pieces, which will probably come after there is actual hardware available to test. I'd also really like to see side-by-side benchmarks relative to v7, mainly so I can validate or adjust the performance claims I make in the documentation.

     
  • Siarhei Siamashka

    As an experiment, I tried to import the code to github (in linaro-armv8-20131014 branch) and add some comments for the first patch: https://github.com/ssvb/libjpeg-turbo-linaro-armv8/commit/4951d43388a6b37ce3b9aaa7bef0fa9b82c69abd

    I'm most worried about the simulator issues and the RTSM_SQSHRN_SIM_ISSUE ifdef. Can we really trust this tool? Does the code under RTSM_SQSHRN_SIM_ISSUE ifdef (which is currently defined by default in this patch) work correctly on real 64-bit ARM hardware?

    Also do we know anything about the 64-bit iOS assembler syntax? The 32-bit ARM code had to use https://github.com/yuvi/gas-preprocessor

     
  • Ragesh

    Ragesh - 2013-11-08

    RTSM_SQSHRN_SIM_ISSUE, I am not really sure if this is the issue with simulator or it can be the way the saturation instructions work in armv8. In Armv7 the saturation boundary is always 8bit even if you specify boundary to 16bit so I am skeptical how this instruciton will finally work on 64bit h/w. On simulator disabling this macro image reconstruction will have some noise around saturation color.

     
  • Thomas Gall

    Thomas Gall - 2013-11-08

    Hi DRC,

    On Thu, Nov 7, 2013 at 11:28 AM, DRC dcommander@users.sf.net wrote:

    OK, thanks for the info. I guess the burning question in my mind is with
    regards to the iPhone 5S, which is where the demand for this feature
    primarily lives at the moment. It seems like it wouldn't be beneficial in a
    performance sense for an iPhone developer to switch from 32-bit ARM v7 code
    to 64-bit ARM v8 code until the v8 code has full SIMD acceleration.
    Otherwise, since the v8 version of the code would have to fall back to C for
    doing convsamp, quantize, fast DCT, and upsample operations, it seems like
    it wouldn't be as fast as the existing 32-bit v7 code. Correct me if I'm
    wrong, though.

    While I'm sure the A7 based iPads and iPhones are interesting, from a
    Linaro point of view, that's not something that we work with. The
    patches that we've posted are for armv8 and of no surprise for Linux
    to start and will be of use someday on Android.

    It seems to me that it might be better to sit on this until someone in the
    community can fill in the missing pieces, which will probably come after
    there is actual hardware available to test.

    I'm a bit taken aback that you would seem to suggest that these
    patches can't go forward until the iOS side of things are worked out
    for armv8. I appreciate the interest in iOS but support for iOS &
    armv8 should be left to someone who is maintaining libjpeg-turbo on
    iOS. We, Linaro, do not want to advance patches that break iOS but
    likewise I don't think it's reasonable to expect us to take on iOS.
    (at a personal level my arm could be easily twisted since I work with
    iOS but that's a different topic)

    I'd also really like to see
    side-by-side benchmarks relative to v7, mainly so I can validate or adjust
    the performance claims I make in the documentation.

    When it comes to armv8 hardware or emulators, it comes down to the following.

    Simulators: ARM models, qemu
    Server etc hardware: It's coming, 'nuff said.
    Apple's A7 based hardware iOS only and requires a full apple developer
    license to deploy to hardware

    For linux and distro support, already debian, ubuntu, open embedded,
    fedora have builds for armv8 that runs on the various models. Across a
    wide community we all want to see past optimized packages like
    libjpeg-turbo be ready when the real hardware shows up.

    It makes little sense to do performance measurements on any of the
    Simulators for ARMv8. That day will come. The models available aren't
    cycle accurate. Comparing to armv7 on say an A15 to a simulator
    running on an intel box makes no sense. While the ported optimizations
    SHOULD yield good performance, we all know and accept that some
    tweaking might have to be done when the real hardware starts to show
    up.

    Waiting for real hardware just motivates people to get excited when
    their bright shiny hardware shows up and the performance sucks rocks.
    We're working on this and other packages before real hardware so that
    from the get go performance should be reasonable. We accept there
    might be a rework or two. I'd rather have a 90% solution on Day 1 then
    a 0% solution and a long wait because this and other packages haven't
    even been touched.

    Hopefully that sounds reasonable.

    Thanks!
    Tom


    [patches:#54] Armv8 neon port for libjpeg-turbo decoder

    Status: open
    Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
    Last Updated: Fri Nov 01, 2013 05:09 AM UTC
    Owner: DRC

    Attached patches contains Armv8 neon Port for following functionalities and
    changes to configuraton and Makefiles to integrate Armv8 neon port to
    libjpeg-turbo

    Idct Slow.
    Idct Fast.
    Idct 2x2
    Idct 4x4
    Yuv to RGB conversion


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/libjpeg-turbo/patches/54/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

    --
    Regards,
    Tom

    "Where's the kaboom!? There was supposed to be an earth-shattering
    kaboom!" Marvin Martian
    Tech Lead, Graphics Working Group | Linaro.org │ Open source software
    for ARM SoCs
    w) tom.gall att linaro.org
    h) tom_gall att mac.com

     

    Related

    Patches (closed - use GitHub): #54

  • DRC

    DRC - 2013-11-09

    Tom, I said nothing of the sort. Please bear in mind that I am not an expert on this. I'm doing my dead-level best to understand it, but seeing as how I don't get paid for any work I do with ARM code, I cannot really dig into it or fully engage without starving. I am feeling my way through as best I can and putting out ideas.

    No one is asking Linaro to take on iOS support. I take on that support myself and provide iOS binaries because they are easy for me to build (I am primarily a Mac user), and it's the fastest way to get the code in front of developers and testers (since iOS developers link statically with libjpeg-turbo rather than relying on a system distribution of it.) The reality of the situation is that we have iPhone developers who can test the ARM v8 code right now, on real hardware, and give valuable feedback as to its performance and stability.

    I never proposed waiting for iOS to work out ARM v8. There's no need. They've already worked it out. My concern is more that I do not want to ship an incomplete SIMD implementation for ARM v8, because there's a very good chance that it will be slower than ARM v7, and that makes both us and the ARM v8 platform look bad. I would gladly check this in today if it were a complete implementation, which would give me some reassurance that at least it would not regress relative to ARM v7 in terms of performance. I will gladly put this in front of developers as soon as it's complete. Before I am comfortable releasing it into beta, though, I need to be able to put my finger on exactly how much speedup this feature gives relative to plain C code, and how much speedup it gives relative to ARM v7 code.

    It has always been the position of this project to treat performance as a measure of quality, so I regression test every single release to make sure it has not lost any performance relative to the previous one. When I introduce a new feature, I make it a point to fully understand its performance advantages, because that is equal in my mind to understanding the feature's quality. So, basically, without the ability to benchmark this code, I do not know if there are performance "bugs" lurking within it. I also generally try to avoid releasing features that I have to apologize for in some way. It makes things difficult when you have to say, "well, this feature is better than nothing if you insist on using 64-bit code, but you're still better off on the whole using 32-bit code." It's just easier to say "we don't support ARM 64-bit yet. Stand by." I am all about getting people the fastest possible JPEG performance they can achieve on their particular platform.

     
  • Thomas Gall

    Thomas Gall - 2013-11-09

    Hi DRC

    On Fri, Nov 8, 2013 at 6:10 PM, DRC dcommander@users.sf.net wrote:

    Tom, I said nothing of the sort.

    My fault for the crossed wires. Apologies on my part.

    Please bear in mind that I am not an expert
    on this. I'm doing my dead-level best to understand it, but seeing as how I
    don't get paid for any work I do with ARM code, I cannot really dig into it
    or fully engage without starving. I am feeling my way through as best I can
    and putting out ideas.

    Don't want you starving!

    No one is asking Linaro to take on iOS support. I take on that support
    myself and provide iOS binaries because they are easy for me to build (I am
    primarily a Mac user),

    Me too.

    and it's the fastest way to get the code in front of
    developers and testers (since iOS developers link statically with
    libjpeg-turbo rather than relying on a system distribution of it.) The
    reality of the situation is that we have iPhone developers who can test the
    ARM v8 code right now, on real hardware, and give valuable feedback as to
    its performance and stability.

    Yes tho I'd be concerned with the differences between gcc and llvm.
    Apple's obviously put a lot of time into llvm so I'm sure their llvm
    is much better than the community llvm yet.

    I never proposed waiting for iOS to work out ARM v8. There's no need.
    They've already worked it out. My concern is more that I do not want to ship
    an incomplete SIMD implementation for ARM v8, because there's a very good
    chance that it will be slower than ARM v7, and that makes both us and the
    ARM v8 platform look bad. I would gladly check this in today if it were a
    complete implementation, which would give me some reassurance that at least
    it would not regress relative to ARM v7 in terms of performance. I will
    gladly put this in front of developers as soon as it's complete. Before I am
    comfortable releasing it into beta, though, I need to be able to put my
    finger on exactly how much speedup this feature gives relative to plain C
    code, and how much speedup it gives relative to ARM v7 code.

    I wouldn't recommend putting this into a stable libjpeg-turbo yet.
    However getting performance comparisons is only possible on Apple
    hardware. A53, A57 processors just aren't released in real hardware.
    Like I said, we have simulators that's it.

    Still! NEON on v7 and v8 didn't dramatically change. It did change in
    ways that don't allow v7 NEON to "just work" with a recompile on v8.

    It has always been the position of this project to treat performance as a
    measure of quality, so I regression test every single release to make sure
    it has not lost any performance relative to the previous one. When I
    introduce a new feature, I make it a point to fully understand its
    performance advantages, because that is equal in my mind to understanding
    the feature's quality. So, basically, without the ability to benchmark this
    code, I do not know if there are performance "bugs" lurking within it.

    I don't think we can establish a performance baseline as of yet for
    which any performance "bug" would be or could be considered valid for
    the time being. When we've real hardware in hand, you can rest assured
    we'll be all over it and issue patches. Thing is, I want to avoid a
    chicken and egg problem. Holding back on performance updates until
    real hardware basically will put libjpeg-turbo to the back of the line
    for armv8.

    I also generally try to avoid releasing features that I have to apologize for
    in some way. It makes things difficult when you have to say, "well, this
    feature is better than nothing if you insist on using 64-bit code, but
    you're still better off on the whole using 32-bit code."

    I don't think that is the case. It's just not measurable in a
    meaningful way at this point in time.

    It's just easier to
    say "we don't support ARM 64-bit yet. Stand by."

    I don't think that's viable. We do support it. Lots of distros do too.
    libjpeg-turbo is far from the first or last package that is being
    optimized for armv8.

    I am all about getting
    people the fastest possible JPEG performance they can achieve on their
    particular platform.

    For armv8 linux, that's a simulator for now and hardware in the not
    too distant future.

    Regards,
    Tom


    [patches:#54] Armv8 neon port for libjpeg-turbo decoder

    Status: open
    Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
    Last Updated: Fri Nov 08, 2013 04:36 PM UTC
    Owner: DRC

    Attached patches contains Armv8 neon Port for following functionalities and
    changes to configuraton and Makefiles to integrate Armv8 neon port to
    libjpeg-turbo

    Idct Slow.
    Idct Fast.
    Idct 2x2
    Idct 4x4
    Yuv to RGB conversion


    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/libjpeg-turbo/patches/54/

    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/

    --
    Regards,
    Tom

    "Where's the kaboom!? There was supposed to be an earth-shattering
    kaboom!" Marvin Martian
    Tech Lead, Graphics Working Group | Linaro.org │ Open source software
    for ARM SoCs
    w) tom.gall att linaro.org
    h) tom_gall att mac.com

     

    Related

    Patches (closed - use GitHub): #54

    • DRC

      DRC - 2013-11-26

      Hopefully SF has finally fixed the bug that was preventing me from replying, so here goes:

      Fair enough, but can you expand on why you believe that the partial ARM v8 implementation will be as fast as or faster than the full ARM v7 implementation? I don't see how that's possible, unless there is a performance penalty for running v7 code on v8 devices.

      In any sense, it seems like the easiest way to make everyone happy is to just finish the implementation.

       
  • Siarhei Siamashka

    @Ragesh

    "RTSM_SQSHRN_SIM_ISSUE, I am not really sure if this is the issue with simulator or it can be the way the saturation instructions work in armv8."

    It is always a good idea to explicitly make clear what are the speculations and what are the facts (presenting speculations as facts can be confusing). My concern is that we want this NEON code to work correctly on the armv8 hardware when such hardware becomes available. Now you have a selection between two conditionally compiled chunks of code (selected by RTSM_SQSHRN_SIM_ISSUE define) and no real confidence in either of them.

    In Armv7 the saturation boundary is always 8bit even if you specify boundary to 16bit so I am skeptical how this instruciton will finally work on 64bit h/w. On simulator disabling this macro image reconstruction will have some noise around saturation color.

    Do you have the "ARM Architecture Reference Manual, ARMv8, for ARMv8-A architecture profile" ( http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.a_errata1/index.html )? It can be actually used to check if the behaviour of the simulator makes sense for the questionable instructions.

    @Tom

    Waiting for real hardware just motivates people to get excited when
    their bright shiny hardware shows up and the performance sucks rocks.

    These motivated people also would not like to have buggy/broken software on their shiny new hardware (just because the developers of the armv8 optimizations did not have any possibility to test it properly prior to rolling the code out on the customers).

    We're working on this and other packages before real hardware so that
    from the get go performance should be reasonable.

    Is there no real way for Linaro developers to get early access to some 64-bit ARM hardware? Maybe negotiate this with ARM somehow? The current situation is less than perfect.

    The iOS compatibility is only mentioned here because the 64-bit iPhone is already out there. And might or might not be usable for confirming the 64-bit ARM assembly code performance & correctness. If iOS is not practically usable for this purpose, then we are better to find some other way.

     
  • DRC

    DRC - 2014-02-05

    I have provisionally checked the patch into SVN trunk, with some extensive formatting changes to the assembly file to make it more closely match the existing NEON assembly file and to improve general readability. I also modified the patch to configure.ac slightly.

    I can confirm that this code does not compile for iOS at all. iOS developers who may be reading this will have to tell me why. Per previous comments, I mention this only because that platform is the only way we might be able to currently verify the performance of this code.

    I normally don't check experimental code into the libjpeg-turbo code base unless it's in a branch, but in this case, the code was isolated enough that it should not affect stability of the overall project. However, I will not document it as a feature until it has been peer reviewed for stability and performance.

    I consider the code to be incomplete currently, because it lacks full coverage of the SIMD functions that are covered in the ARMv7 code. Further, it seems to use multiple instructions in several cases to take the place of a single instruction in the ARMv7 code. Both of those aspects of the code make me suspect that it might not perform as well as an ARMv7 binary running on the same platform (if such is even possible.)

    Please clarify the purpose of the changes you made to jsimd_arm.c. They seem to leave out significant portions of the SIMD code when running in the simulator. I did not check in those changes yet.

    I will leave this issue open until the feature has been fully accepted, reviewed, and documented.

     
    • Ragesh

      Ragesh - 2014-02-05

      I have only submitted below functions for armv8 neon accelerations

      1. Idct Slow.
      2. Idct Fast.
      3. Idct 2x2
      4. Idct 4x4
      5. Yuv to RGB conversion

      These are mostly decoder functions and the reason to this is when I
      initially started armv8 porting I could find neon armv7 accelerations to
      decoder functions only.
      Now I can see more functions related to encoder, which I have to start
      porting to armv8 and I will try to complete this soon.

      On 5 February 2014 14:18, DRC dcommander@users.sf.net wrote:

      I have provisionally checked the patch into SVN trunk, with some extensive
      formatting changes to the assembly file to make it more closely match the
      existing NEON assembly file and to improve general readability. I also
      modified the patch to configure.ac slightly.

      I can confirm that this code does not compile for iOS at all. iOS
      developers who may be reading this will have to tell me why. Per previous
      comments, I mention this only because that platform is the only way we
      might be able to currently verify the performance of this code.

      I normally don't check experimental code into the libjpeg-turbo code base
      unless it's in a branch, but in this case, the code was isolated enough
      that it should not affect stability of the overall project. However, I will
      not document it as a feature until it has been peer reviewed for stability
      and performance.

      I consider the code to be incomplete currently, because it lacks full
      coverage of the SIMD functions that are covered in the ARMv7 code. Further,
      it seems to use multiple instructions in several cases to take the place of
      a single instruction in the ARMv7 code. Both of those aspects of the code
      make me suspect that it might not perform as well as an ARMv7 binary
      running on the same platform (if such is even possible.)

      Please clarify the purpose of the changes you made to jsimd_arm.c. They
      seem to leave out significant portions of the SIMD code when running in the
      simulator. I did not check in those changes yet.

      The changes in jsimd_arm.c will mask all the neon encoder functions on
      armv8 as there is no neon implementation for these routines in armv8.
      ARMV8_NEON_SIMULATION macro is hard coded to mask the encoder
      functions.

      I will leave this issue open until the feature has been fully accepted,

      reviewed, and documented.

      Status: open
      Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
      Last Updated: Thu Jan 30, 2014 08:33 PM UTC
      Owner: DRC

      Attached patches contains Armv8 neon Port for following functionalities
      and changes to configuraton and Makefiles to integrate Armv8 neon port to
      libjpeg-turbo

      1. Idct Slow.
      2. Idct Fast.
      3. Idct 2x2
      4. Idct 4x4
      5. Yuv to RGB conversion

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/libjpeg-turbo/patches/54/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Patches (closed - use GitHub): #54

  • DRC

    DRC - 2014-02-05

    OK, the correct approach is actually to create a separate file-- jsimd_arm64.c-- for ARMv8 and stub out the unimplemented routines. I have done this and checked the new file into trunk, but you will have to verify that I haven't broken anything.

    Can you answer my question regarding whether ARMv7 binaries will run on an ARMv8 platform?

    Siarhei, your comments on this new code would be much appreciated.

     
    • Siarhei Siamashka

      Siarhei, your comments on this new code would be much appreciated.

      Just based on how it looks, there are many places in the code that could be implemented somewhat better. My older comments for the armv8 idct slow function are here: https://github.com/ssvb/libjpeg-turbo-linaro-armv8/commit/4951d43388a6b37c

      But I understand that without having real armv8 hardware readily available, proper performance tuning is hard and impractical. Asking to fix all of these potential cases of suboptimal code would be unreasonable. We just have to revisit this when linux capable armv8 hardware becomes more common.

      However I still don't like the RTSM_SQSHRN_SIM_ISSUE define very much. I would surely feel less paranoid if we had some information about the results of libjpeg-turbo 'make test' in the following cases:
      a) with RTSM_SQSHRN_SIM_ISSUE define in the simulator
      b) without RTSM_SQSHRN_SIM_ISSUE define in the simulator
      c) with RTSM_SQSHRN_SIM_ISSUE define on real hardware
      d) without RTSM_SQSHRN_SIM_ISSUE define on real hardware

      There should be no problem to run a) and b) tests even now to have a definite confirmation. Also when I was visiting FOSDEM2014 on the last weekend, I had a chance to talk with some Linaro people. And unless I misunderstood something, appears that Linaro currently has at least a few X-GENE boxes with shared ssh access. Was there any attempt made to get access to this hardware for libjpeg-turbo testing? I would guess that something like half an hour would be enough for quickly running 'make test'.

       
  • DRC

    DRC - 2014-03-27
    • status: open --> open-pending-external
     
  • Ragesh

    Ragesh - 2014-07-22

    Recently I tested 64 bit neon code on real hardware and some of the assumptions which I had while testing on simulator got resolved. I am attaching 4 patches along with performance figures after testing it on aarch64 hardware.

    The test was conducted on a low profile board with gcc-4.9 toolchain.

     
  • Siarhei Siamashka

    Thanks for the update.

    Subject: [PATCH 1/4] Armv8 specific runtime check changes

    Aarch64 by default supports neon and runtime check is
    no more required.

    Looks OK to me. Just the comment

    "ARMV8 architectures supports neon extenstion by default"

    has a typo and handling the JSIMD_FORCENEON environment variable now becomes unnecessary.

    Subject: [PATCH 2/4] RTSM macro removed

    RTSM macro removed after testing the code on aarch64 hardware

    So the inactive code path gets removed and this macro was never really necessary (no simulator issue in the first place)?

    Subject: [PATCH 3/4] IDCT slow zero coeffecient check changes

    Zero coeffecient check section of code was not working properly
    through 64 bit registers, now coeffecient fetch is through 32 bit
    registers

    "Was not working" in what way? Was it failing "make test"? And can we actually rewrite this code to do correct 64-bit fetch? This should require less instructions for the same job.

    Subject: [PATCH 4/4] Save all neon register getting used in the routine

    GCC aarch64 4.9 tool chain generated code uses extension register
    including neon through out instead of stack. This makes
    gcc 4.9 much faster in performance and the only side effect of this
    hand coded neon assembly has to preserve all neon register.

    This description does not make much sense (sounds like you are putting blame on GCC 4.9). In other words, are you saying that the current code in SVN is just violating the AArch64 calling conventions by failing to properly save/restore NEON registers? The ABI must be always respected no matter what (and if there are any other violations, they need to be fixed too).

    Please explicitly confirm that the "make test" succeeds after all these changes are applied.

    Also it would be very interesting to compare the performance between the same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same hardware.

     
    • Ragesh

      Ragesh - 2014-07-23

      Thank for the valuable review comments,

      1.> Aarch64 by default supports neon and runtime check is no more required.

      Looks OK to me. Just the comment

      "ARMV8 architectures supports neon extenstion by default"

      has a typo and handling the JSIMD_FORCENEON environment variable now
      becomes unnecessary.

      ----Ans. Even I thought of this but it may help to test/compare neon and
      non-neon versions.

      2.> RTSM macro removed after testing the code on aarch64 hardware

      So the inactive code path gets removed and this macro was never really
      necessary (no simulator issue in the first place)?

      --Ans. Particularly the saturation instructions behave differently in Armv8
      and this was verified on real h/w. I was actually suspecting the simulator
      for the behavior but now things are settled with h/w.

      3> Zero coeffecient check section of code was not working properly
      through 64 bit registers, now coeffecient fetch is through 32 bit
      registers.
      "Was not working" in what way? Was it failing "make test"?

      --The earlier aarch64 zero coeff block code functionality was not verfied
      properly. Make test works well with the earlier code as well but the
      earlier version used to fetch 2 adjacent coeffecient and compare due to
      8byte fetch( 64 bit registers), now its reduced to 4 byte fetch using 32
      bit registers and behaves alike armv7 zero coeff check logic.

      --Ans. And can we actually rewrite this code to do correct 64-bit fetch?
      This should require less instructions for the same job.
      I am not very sure but i hope ldp is doing the 64 bit fetch and writing to
      two consecutive 32 bit registers. so overall I guess we are not losing any
      cycles here. Kindly suggest if you have any better idea.

      4> Save all neon register getting used in the routine
      GCC aarch64 4.9 tool chain generated code uses extension register including
      neon through out instead of stack. This makes gcc 4.9 much faster in
      performance and the only side effect of this hand coded neon assembly has
      to preserve all neon register.

      This description does not make much sense (sounds like you are putting
      blame on GCC 4.9). In other words, are you saying that the current code in
      SVN is just violating the AArch64 calling conventions by failing to
      properly save/restore NEON registers? The ABI must be always respected no
      matter what (and if there are any other violations, they need to be fixed
      too).

      --Ans. Sorry for the confusion on this I will explain it in detail.

      The same code in the svn works well on GCC 4.8 but fails on GCC 4.9 and i
      had spent considerable amount of time suspecting many areas including the
      tjbench code but finally found in the dis-assembly that gcc 4.9 just uses
      extension registers thought out. Interestingly these extension registers
      helps in store/restore of General purpose registers and replaces the
      storage of General purpose registers on stack and I guess this makes gcc
      4.9 generated code perform much faster than the previous version of gcc.

      Please explicitly confirm that the "make test" succeeds after all these
      changes are applied.

      Also it would be very interesting to compare the performance between the
      same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same
      hardware.

      On 23 July 2014 06:04, Siarhei Siamashka ssvb@users.sf.net wrote:

      Thanks for the update.

      Subject: [PATCH 1/4] Armv8 specific runtime check changes

      Aarch64 by default supports neon and runtime check is
      no more required.

      Looks OK to me. Just the comment

      "ARMV8 architectures supports neon extenstion by default"

      has a typo and handling the JSIMD_FORCENEON environment variable now
      becomes unnecessary.

      Subject: [PATCH 2/4] RTSM macro removed

      RTSM macro removed after testing the code on aarch64 hardware

      So the inactive code path gets removed and this macro was never really
      necessary (no simulator issue in the first place)?

      Subject: [PATCH 3/4] IDCT slow zero coeffecient check changes

      Zero coeffecient check section of code was not working properly
      through 64 bit registers, now coeffecient fetch is through 32 bit
      registers

      "Was not working" in what way? Was it failing "make test"? And can we
      actually rewrite this code to do correct 64-bit fetch? This should require
      less instructions for the same job.

      Subject: [PATCH 4/4] Save all neon register getting used in the routine

      GCC aarch64 4.9 tool chain generated code uses extension register
      including neon through out instead of stack. This makes
      gcc 4.9 much faster in performance and the only side effect of this
      hand coded neon assembly has to preserve all neon register.

      This description does not make much sense (sounds like you are putting
      blame on GCC 4.9). In other words, are you saying that the current code in
      SVN is just violating the AArch64 calling conventions by failing to
      properly save/restore NEON registers? The ABI must be always respected no
      matter what (and if there are any other violations, they need to be fixed
      too).

      Please explicitly confirm that the "make test" succeeds after all these
      changes are applied.

      Also it would be very interesting to compare the performance between the
      same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same
      hardware.


      Status: open-pending-external

      Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
      Last Updated: Tue Jul 22, 2014 11:04 AM UTC
      Owner: DRC

      Attached patches contains Armv8 neon Port for following functionalities
      and changes to configuraton and Makefiles to integrate Armv8 neon port to
      libjpeg-turbo

      1. Idct Slow.
      2. Idct Fast.
      3. Idct 2x2
      4. Idct 4x4
      5. Yuv to RGB conversion

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/libjpeg-turbo/patches/54/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       

      Related

      Patches (closed - use GitHub): #54

  • DRC

    DRC - 2014-07-23

    Patches have been checked in, with some formatting tweaks.

     
  • DRC

    DRC - 2014-08-13

    Closing this as implemented. Please open a new patch if you have any modifications or new accelerations. In general, I am still soft-pedaling this feature until it accelerates the same set of functions that are accelerated with ARMv7 and until we can successfully build the code for iOS.

     
  • DRC

    DRC - 2014-08-13
    • status: open-pending-external --> closed-integrated
     
Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.