libjpeg-turbo / Patches (closed - use GitHub) / #54 Armv8 neon port for libjpeg-turbo decoder

Ragesh - 2013-10-29

Missing attachments adding all 9 patches in a single archive file

Armv8-port-patches.tar.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-11-01

Couple of questions/comments:

-- Are there further optimizations planned, or will these be the only ones? The reason I ask is more for strategic purposes in planning the release of libjpeg-turbo 1.4. If a full ARM 64-bit implementation is planned, then I would like to hold off the release until it is ready. There is plenty of time, since there are currently other features in the works for 1.4 that will take months to implement.

-- I assume that 'make test' passes?

-- I will take care of it this time, but if you have any future patches, please submit them against trunk, which is where the new features are landing.

-- I am attaching two test scripts which can be used along with the canonical libjpeg-turbo test images (see http://www.libjpeg-turbo.org/About/Performance) to generate numbers that can be copied & pasted into a spreadsheet template (also attached.) If you could run these numbers both with and without the ARM 64-bit SIMD code, that would help me greatly. It gives me a feel for how well the new SIMD code is accelerating things, and it allows me to make valid performance claims about it (such as "it's X amount faster, etc.")

dotest.accuratedct

dotest.fastdct

libjpegturbo-template.ods

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragesh - 2013-11-07

Thank you for your comments.

Further optimizations are not yet planned as such and I believe that it
is possible to optimize further in terms of data packing but skeptical
if it can really make a big difference.

'make Test' all test cases passes.

Regarding performance, as previously discussed the whole development and
testing was on an ARM simulator and unfortunately we don't have any
64bit ARM h/w boards still to test performance. I guess the perf figures
will not make any sense on simulators.

Last edit: Ragesh 2013-11-07

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-11-07

OK, thanks for the info. I guess the burning question in my mind is with regards to the iPhone 5S, which is where the demand for this feature primarily lives at the moment. It seems like it wouldn't be beneficial in a performance sense for an iPhone developer to switch from 32-bit ARM v7 code to 64-bit ARM v8 code until the v8 code has full SIMD acceleration. Otherwise, since the v8 version of the code would have to fall back to C for doing convsamp, quantize, fast DCT, and upsample operations, it seems like it wouldn't be as fast as the existing 32-bit v7 code. Correct me if I'm wrong, though.

It seems to me that it might be better to sit on this until someone in the community can fill in the missing pieces, which will probably come after there is actual hardware available to test. I'd also really like to see side-by-side benchmarks relative to v7, mainly so I can validate or adjust the performance claims I make in the documentation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Siarhei Siamashka - 2013-11-08

As an experiment, I tried to import the code to github (in linaro-armv8-20131014 branch) and add some comments for the first patch: https://github.com/ssvb/libjpeg-turbo-linaro-armv8/commit/4951d43388a6b37ce3b9aaa7bef0fa9b82c69abd

I'm most worried about the simulator issues and the RTSM_SQSHRN_SIM_ISSUE ifdef. Can we really trust this tool? Does the code under RTSM_SQSHRN_SIM_ISSUE ifdef (which is currently defined by default in this patch) work correctly on real 64-bit ARM hardware?

Also do we know anything about the 64-bit iOS assembler syntax? The 32-bit ARM code had to use https://github.com/yuvi/gas-preprocessor

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragesh - 2013-11-08

RTSM_SQSHRN_SIM_ISSUE, I am not really sure if this is the issue with simulator or it can be the way the saturation instructions work in armv8. In Armv7 the saturation boundary is always 8bit even if you specify boundary to 16bit so I am skeptical how this instruciton will finally work on 64bit h/w. On simulator disabling this macro image reconstruction will have some noise around saturation color.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Gall - 2013-11-08

Hi DRC,

On Thu, Nov 7, 2013 at 11:28 AM, DRC dcommander@users.sf.net wrote:

OK, thanks for the info. I guess the burning question in my mind is with
regards to the iPhone 5S, which is where the demand for this feature
primarily lives at the moment. It seems like it wouldn't be beneficial in a
performance sense for an iPhone developer to switch from 32-bit ARM v7 code
to 64-bit ARM v8 code until the v8 code has full SIMD acceleration.
Otherwise, since the v8 version of the code would have to fall back to C for
doing convsamp, quantize, fast DCT, and upsample operations, it seems like
it wouldn't be as fast as the existing 32-bit v7 code. Correct me if I'm
wrong, though.

While I'm sure the A7 based iPads and iPhones are interesting, from a
Linaro point of view, that's not something that we work with. The
patches that we've posted are for armv8 and of no surprise for Linux
to start and will be of use someday on Android.

It seems to me that it might be better to sit on this until someone in the
community can fill in the missing pieces, which will probably come after
there is actual hardware available to test.

I'm a bit taken aback that you would seem to suggest that these
patches can't go forward until the iOS side of things are worked out
for armv8. I appreciate the interest in iOS but support for iOS &
armv8 should be left to someone who is maintaining libjpeg-turbo on
iOS. We, Linaro, do not want to advance patches that break iOS but
likewise I don't think it's reasonable to expect us to take on iOS.
(at a personal level my arm could be easily twisted since I work with
iOS but that's a different topic)

I'd also really like to see
side-by-side benchmarks relative to v7, mainly so I can validate or adjust
the performance claims I make in the documentation.

When it comes to armv8 hardware or emulators, it comes down to the following.

Simulators: ARM models, qemu
Server etc hardware: It's coming, 'nuff said.
Apple's A7 based hardware iOS only and requires a full apple developer
license to deploy to hardware

For linux and distro support, already debian, ubuntu, open embedded,
fedora have builds for armv8 that runs on the various models. Across a
wide community we all want to see past optimized packages like
libjpeg-turbo be ready when the real hardware shows up.

It makes little sense to do performance measurements on any of the
Simulators for ARMv8. That day will come. The models available aren't
cycle accurate. Comparing to armv7 on say an A15 to a simulator
running on an intel box makes no sense. While the ported optimizations
SHOULD yield good performance, we all know and accept that some
tweaking might have to be done when the real hardware starts to show
up.

Waiting for real hardware just motivates people to get excited when
their bright shiny hardware shows up and the performance sucks rocks.
We're working on this and other packages before real hardware so that
from the get go performance should be reasonable. We accept there
might be a rework or two. I'd rather have a 90% solution on Day 1 then
a 0% solution and a long wait because this and other packages haven't
even been touched.

Hopefully that sounds reasonable.

Thanks!
Tom

[patches:#54] Armv8 neon port for libjpeg-turbo decoder

Status: open
Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
Last Updated: Fri Nov 01, 2013 05:09 AM UTC
Owner: DRC

Attached patches contains Armv8 neon Port for following functionalities and
changes to configuraton and Makefiles to integrate Armv8 neon port to
libjpeg-turbo

Idct Slow.
Idct Fast.
Idct 2x2
Idct 4x4
Yuv to RGB conversion

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/libjpeg-turbo/patches/54/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

--
Regards,
Tom

"Where's the kaboom!? There was supposed to be an earth-shattering
kaboom!" Marvin Martian
Tech Lead, Graphics Working Group | Linaro.org │ Open source software
for ARM SoCs
w) tom.gall att linaro.org
h) tom_gall att mac.com

Related

Patches (closed - use GitHub): ~~#54~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2013-11-09

Tom, I said nothing of the sort. Please bear in mind that I am not an expert on this. I'm doing my dead-level best to understand it, but seeing as how I don't get paid for any work I do with ARM code, I cannot really dig into it or fully engage without starving. I am feeling my way through as best I can and putting out ideas.

No one is asking Linaro to take on iOS support. I take on that support myself and provide iOS binaries because they are easy for me to build (I am primarily a Mac user), and it's the fastest way to get the code in front of developers and testers (since iOS developers link statically with libjpeg-turbo rather than relying on a system distribution of it.) The reality of the situation is that we have iPhone developers who can test the ARM v8 code right now, on real hardware, and give valuable feedback as to its performance and stability.

I never proposed waiting for iOS to work out ARM v8. There's no need. They've already worked it out. My concern is more that I do not want to ship an incomplete SIMD implementation for ARM v8, because there's a very good chance that it will be slower than ARM v7, and that makes both us and the ARM v8 platform look bad. I would gladly check this in today if it were a complete implementation, which would give me some reassurance that at least it would not regress relative to ARM v7 in terms of performance. I will gladly put this in front of developers as soon as it's complete. Before I am comfortable releasing it into beta, though, I need to be able to put my finger on exactly how much speedup this feature gives relative to plain C code, and how much speedup it gives relative to ARM v7 code.

It has always been the position of this project to treat performance as a measure of quality, so I regression test every single release to make sure it has not lost any performance relative to the previous one. When I introduce a new feature, I make it a point to fully understand its performance advantages, because that is equal in my mind to understanding the feature's quality. So, basically, without the ability to benchmark this code, I do not know if there are performance "bugs" lurking within it. I also generally try to avoid releasing features that I have to apologize for in some way. It makes things difficult when you have to say, "well, this feature is better than nothing if you insist on using 64-bit code, but you're still better off on the whole using 32-bit code." It's just easier to say "we don't support ARM 64-bit yet. Stand by." I am all about getting people the fastest possible JPEG performance they can achieve on their particular platform.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thomas Gall - 2013-11-09

Hi DRC

On Fri, Nov 8, 2013 at 6:10 PM, DRC dcommander@users.sf.net wrote:

Tom, I said nothing of the sort.

My fault for the crossed wires. Apologies on my part.

Please bear in mind that I am not an expert
on this. I'm doing my dead-level best to understand it, but seeing as how I
don't get paid for any work I do with ARM code, I cannot really dig into it
or fully engage without starving. I am feeling my way through as best I can
and putting out ideas.

Don't want you starving!

No one is asking Linaro to take on iOS support. I take on that support
myself and provide iOS binaries because they are easy for me to build (I am
primarily a Mac user),

Me too.

and it's the fastest way to get the code in front of
developers and testers (since iOS developers link statically with
libjpeg-turbo rather than relying on a system distribution of it.) The
reality of the situation is that we have iPhone developers who can test the
ARM v8 code right now, on real hardware, and give valuable feedback as to
its performance and stability.

Yes tho I'd be concerned with the differences between gcc and llvm.
Apple's obviously put a lot of time into llvm so I'm sure their llvm
is much better than the community llvm yet.

I never proposed waiting for iOS to work out ARM v8. There's no need.
They've already worked it out. My concern is more that I do not want to ship
an incomplete SIMD implementation for ARM v8, because there's a very good
chance that it will be slower than ARM v7, and that makes both us and the
ARM v8 platform look bad. I would gladly check this in today if it were a
complete implementation, which would give me some reassurance that at least
it would not regress relative to ARM v7 in terms of performance. I will
gladly put this in front of developers as soon as it's complete. Before I am
comfortable releasing it into beta, though, I need to be able to put my
finger on exactly how much speedup this feature gives relative to plain C
code, and how much speedup it gives relative to ARM v7 code.

I wouldn't recommend putting this into a stable libjpeg-turbo yet.
However getting performance comparisons is only possible on Apple
hardware. A53, A57 processors just aren't released in real hardware.
Like I said, we have simulators that's it.

Still! NEON on v7 and v8 didn't dramatically change. It did change in
ways that don't allow v7 NEON to "just work" with a recompile on v8.

It has always been the position of this project to treat performance as a
measure of quality, so I regression test every single release to make sure
it has not lost any performance relative to the previous one. When I
introduce a new feature, I make it a point to fully understand its
performance advantages, because that is equal in my mind to understanding
the feature's quality. So, basically, without the ability to benchmark this
code, I do not know if there are performance "bugs" lurking within it.

I don't think we can establish a performance baseline as of yet for
which any performance "bug" would be or could be considered valid for
the time being. When we've real hardware in hand, you can rest assured
we'll be all over it and issue patches. Thing is, I want to avoid a
chicken and egg problem. Holding back on performance updates until
real hardware basically will put libjpeg-turbo to the back of the line
for armv8.

I also generally try to avoid releasing features that I have to apologize for
in some way. It makes things difficult when you have to say, "well, this
feature is better than nothing if you insist on using 64-bit code, but
you're still better off on the whole using 32-bit code."

I don't think that is the case. It's just not measurable in a
meaningful way at this point in time.

It's just easier to
say "we don't support ARM 64-bit yet. Stand by."

I don't think that's viable. We do support it. Lots of distros do too.
libjpeg-turbo is far from the first or last package that is being
optimized for armv8.

I am all about getting
people the fastest possible JPEG performance they can achieve on their
particular platform.

For armv8 linux, that's a simulator for now and hardware in the not
too distant future.

Regards,
Tom

[patches:#54] Armv8 neon port for libjpeg-turbo decoder

Status: open
Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
Last Updated: Fri Nov 08, 2013 04:36 PM UTC
Owner: DRC

Attached patches contains Armv8 neon Port for following functionalities and
changes to configuraton and Makefiles to integrate Armv8 neon port to
libjpeg-turbo

Idct Slow.
Idct Fast.
Idct 2x2
Idct 4x4
Yuv to RGB conversion

Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/libjpeg-turbo/patches/54/

To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/

--
Regards,
Tom

"Where's the kaboom!? There was supposed to be an earth-shattering
kaboom!" Marvin Martian
Tech Lead, Graphics Working Group | Linaro.org │ Open source software
for ARM SoCs
w) tom.gall att linaro.org
h) tom_gall att mac.com

Related

Patches (closed - use GitHub): ~~#54~~

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- DRC - 2013-11-26
  
  Hopefully SF has finally fixed the bug that was preventing me from replying, so here goes:
  
  Fair enough, but can you expand on why you believe that the partial ARM v8 implementation will be as fast as or faster than the full ARM v7 implementation? I don't see how that's possible, unless there is a performance penalty for running v7 code on v8 devices.
  
  In any sense, it seems like the easiest way to make everyone happy is to just finish the implementation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Siarhei Siamashka - 2013-11-09

@Ragesh

"RTSM_SQSHRN_SIM_ISSUE, I am not really sure if this is the issue with simulator or it can be the way the saturation instructions work in armv8."

It is always a good idea to explicitly make clear what are the speculations and what are the facts (presenting speculations as facts can be confusing). My concern is that we want this NEON code to work correctly on the armv8 hardware when such hardware becomes available. Now you have a selection between two conditionally compiled chunks of code (selected by RTSM_SQSHRN_SIM_ISSUE define) and no real confidence in either of them.

In Armv7 the saturation boundary is always 8bit even if you specify boundary to 16bit so I am skeptical how this instruciton will finally work on 64bit h/w. On simulator disabling this macro image reconstruction will have some noise around saturation color.

Do you have the "ARM Architecture Reference Manual, ARMv8, for ARMv8-A architecture profile" ( http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.a_errata1/index.html )? It can be actually used to check if the behaviour of the simulator makes sense for the questionable instructions.

@Tom

Waiting for real hardware just motivates people to get excited when
their bright shiny hardware shows up and the performance sucks rocks.

These motivated people also would not like to have buggy/broken software on their shiny new hardware (just because the developers of the armv8 optimizations did not have any possibility to test it properly prior to rolling the code out on the customers).

We're working on this and other packages before real hardware so that
from the get go performance should be reasonable.

Is there no real way for Linaro developers to get early access to some 64-bit ARM hardware? Maybe negotiate this with ARM somehow? The current situation is less than perfect.

The iOS compatibility is only mentioned here because the 64-bit iPhone is already out there. And might or might not be usable for confirming the 64-bit ARM assembly code performance & correctness. If iOS is not practically usable for this purpose, then we are better to find some other way.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-02-05

I have provisionally checked the patch into SVN trunk, with some extensive formatting changes to the assembly file to make it more closely match the existing NEON assembly file and to improve general readability. I also modified the patch to configure.ac slightly.

I can confirm that this code does not compile for iOS at all. iOS developers who may be reading this will have to tell me why. Per previous comments, I mention this only because that platform is the only way we might be able to currently verify the performance of this code.

I normally don't check experimental code into the libjpeg-turbo code base unless it's in a branch, but in this case, the code was isolated enough that it should not affect stability of the overall project. However, I will not document it as a feature until it has been peer reviewed for stability and performance.

I consider the code to be incomplete currently, because it lacks full coverage of the SIMD functions that are covered in the ARMv7 code. Further, it seems to use multiple instructions in several cases to take the place of a single instruction in the ARMv7 code. Both of those aspects of the code make me suspect that it might not perform as well as an ARMv7 binary running on the same platform (if such is even possible.)

Please clarify the purpose of the changes you made to jsimd_arm.c. They seem to leave out significant portions of the SIMD code when running in the simulator. I did not check in those changes yet.

I will leave this issue open until the feature has been fully accepted, reviewed, and documented.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ragesh - 2014-02-05
  
  I have only submitted below functions for armv8 neon accelerations
  
  Idct Slow.
  
  Idct Fast.
  
  Idct 2x2
  
  Idct 4x4
  
  Yuv to RGB conversion
  
  These are mostly decoder functions and the reason to this is when I
  initially started armv8 porting I could find neon armv7 accelerations to
  decoder functions only.
  Now I can see more functions related to encoder, which I have to start
  porting to armv8 and I will try to complete this soon.
  
  On 5 February 2014 14:18, DRC dcommander@users.sf.net wrote:
  
  I have provisionally checked the patch into SVN trunk, with some extensive
  formatting changes to the assembly file to make it more closely match the
  existing NEON assembly file and to improve general readability. I also
  modified the patch to configure.ac slightly.
  
  I can confirm that this code does not compile for iOS at all. iOS
  developers who may be reading this will have to tell me why. Per previous
  comments, I mention this only because that platform is the only way we
  might be able to currently verify the performance of this code.
  
  I normally don't check experimental code into the libjpeg-turbo code base
  unless it's in a branch, but in this case, the code was isolated enough
  that it should not affect stability of the overall project. However, I will
  not document it as a feature until it has been peer reviewed for stability
  and performance.
  
  I consider the code to be incomplete currently, because it lacks full
  coverage of the SIMD functions that are covered in the ARMv7 code. Further,
  it seems to use multiple instructions in several cases to take the place of
  a single instruction in the ARMv7 code. Both of those aspects of the code
  make me suspect that it might not perform as well as an ARMv7 binary
  running on the same platform (if such is even possible.)
  
  Please clarify the purpose of the changes you made to jsimd_arm.c. They
  seem to leave out significant portions of the SIMD code when running in the
  simulator. I did not check in those changes yet.
  
  The changes in jsimd_arm.c will mask all the neon encoder functions on
  armv8 as there is no neon implementation for these routines in armv8.
  ARMV8_NEON_SIMULATION macro is hard coded to mask the encoder
  functions.
  
  I will leave this issue open until the feature has been fully accepted,
  
  reviewed, and documented.
  
  [patches:#54] http://sourceforge.net/p/libjpeg-turbo/patches/54/ Armv8
  neon port for libjpeg-turbo decoder*
  
  Status: open
  Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
  Last Updated: Thu Jan 30, 2014 08:33 PM UTC
  Owner: DRC
  
  Attached patches contains Armv8 neon Port for following functionalities
  and changes to configuraton and Makefiles to integrate Armv8 neon port to
  libjpeg-turbo
  
  Idct Slow.
  
  Idct Fast.
  
  Idct 2x2
  
  Idct 4x4
  
  Yuv to RGB conversion
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/libjpeg-turbo/patches/54/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Patches (closed - use GitHub): ~~#54~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-02-05

OK, the correct approach is actually to create a separate file-- jsimd_arm64.c-- for ARMv8 and stub out the unimplemented routines. I have done this and checked the new file into trunk, but you will have to verify that I haven't broken anything.

Can you answer my question regarding whether ARMv7 binaries will run on an ARMv8 platform?

Siarhei, your comments on this new code would be much appreciated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Siarhei Siamashka - 2014-02-05
  
  Siarhei, your comments on this new code would be much appreciated.
  
  Just based on how it looks, there are many places in the code that could be implemented somewhat better. My older comments for the armv8 idct slow function are here: https://github.com/ssvb/libjpeg-turbo-linaro-armv8/commit/4951d43388a6b37c
  
  But I understand that without having real armv8 hardware readily available, proper performance tuning is hard and impractical. Asking to fix all of these potential cases of suboptimal code would be unreasonable. We just have to revisit this when linux capable armv8 hardware becomes more common.
  
  However I still don't like the RTSM_SQSHRN_SIM_ISSUE define very much. I would surely feel less paranoid if we had some information about the results of libjpeg-turbo 'make test' in the following cases:
  a) with RTSM_SQSHRN_SIM_ISSUE define in the simulator
  b) without RTSM_SQSHRN_SIM_ISSUE define in the simulator
  c) with RTSM_SQSHRN_SIM_ISSUE define on real hardware
  d) without RTSM_SQSHRN_SIM_ISSUE define on real hardware
  
  There should be no problem to run a) and b) tests even now to have a definite confirmation. Also when I was visiting FOSDEM2014 on the last weekend, I had a chance to talk with some Linaro people. And unless I misunderstood something, appears that Linaro currently has at least a few X-GENE boxes with shared ssh access. Was there any attempt made to get access to this hardware for libjpeg-turbo testing? I would guess that something like half an hour would be enough for quickly running 'make test'.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-03-27

status: open --> open-pending-external
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ragesh - 2014-07-22

Recently I tested 64 bit neon code on real hardware and some of the assumptions which I had while testing on simulator got resolved. I am attaching 4 patches along with performance figures after testing it on aarch64 hardware.

The test was conducted on a low profile board with gcc-4.9 toolchain.

Patches_perf_data_arm64.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Siarhei Siamashka - 2014-07-23

Thanks for the update.

Subject: [PATCH 1/4] Armv8 specific runtime check changes

Aarch64 by default supports neon and runtime check is
no more required.

Looks OK to me. Just the comment

"ARMV8 architectures supports neon extenstion by default"

has a typo and handling the JSIMD_FORCENEON environment variable now becomes unnecessary.

Subject: [PATCH 2/4] RTSM macro removed

RTSM macro removed after testing the code on aarch64 hardware

So the inactive code path gets removed and this macro was never really necessary (no simulator issue in the first place)?

Subject: [PATCH 3/4] IDCT slow zero coeffecient check changes

Zero coeffecient check section of code was not working properly
through 64 bit registers, now coeffecient fetch is through 32 bit
registers

"Was not working" in what way? Was it failing "make test"? And can we actually rewrite this code to do correct 64-bit fetch? This should require less instructions for the same job.

Subject: [PATCH 4/4] Save all neon register getting used in the routine

GCC aarch64 4.9 tool chain generated code uses extension register
including neon through out instead of stack. This makes
gcc 4.9 much faster in performance and the only side effect of this
hand coded neon assembly has to preserve all neon register.

This description does not make much sense (sounds like you are putting blame on GCC 4.9). In other words, are you saying that the current code in SVN is just violating the AArch64 calling conventions by failing to properly save/restore NEON registers? The ABI must be always respected no matter what (and if there are any other violations, they need to be fixed too).

Please explicitly confirm that the "make test" succeeds after all these changes are applied.

Also it would be very interesting to compare the performance between the same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same hardware.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ragesh - 2014-07-23
  
  Thank for the valuable review comments,
  
  1.> Aarch64 by default supports neon and runtime check is no more required.
  
  Looks OK to me. Just the comment
  
  "ARMV8 architectures supports neon extenstion by default"
  
  has a typo and handling the JSIMD_FORCENEON environment variable now
  becomes unnecessary.
  
  ----Ans. Even I thought of this but it may help to test/compare neon and
  non-neon versions.
  
  2.> RTSM macro removed after testing the code on aarch64 hardware
  
  So the inactive code path gets removed and this macro was never really
  necessary (no simulator issue in the first place)?
  
  --Ans. Particularly the saturation instructions behave differently in Armv8
  and this was verified on real h/w. I was actually suspecting the simulator
  for the behavior but now things are settled with h/w.
  
  3> Zero coeffecient check section of code was not working properly
  through 64 bit registers, now coeffecient fetch is through 32 bit
  registers.
  "Was not working" in what way? Was it failing "make test"?
  
  --The earlier aarch64 zero coeff block code functionality was not verfied
  properly. Make test works well with the earlier code as well but the
  earlier version used to fetch 2 adjacent coeffecient and compare due to
  8byte fetch( 64 bit registers), now its reduced to 4 byte fetch using 32
  bit registers and behaves alike armv7 zero coeff check logic.
  
  --Ans. And can we actually rewrite this code to do correct 64-bit fetch?
  This should require less instructions for the same job.
  I am not very sure but i hope ldp is doing the 64 bit fetch and writing to
  two consecutive 32 bit registers. so overall I guess we are not losing any
  cycles here. Kindly suggest if you have any better idea.
  
  4> Save all neon register getting used in the routine
  GCC aarch64 4.9 tool chain generated code uses extension register including
  neon through out instead of stack. This makes gcc 4.9 much faster in
  performance and the only side effect of this hand coded neon assembly has
  to preserve all neon register.
  
  This description does not make much sense (sounds like you are putting
  blame on GCC 4.9). In other words, are you saying that the current code in
  SVN is just violating the AArch64 calling conventions by failing to
  properly save/restore NEON registers? The ABI must be always respected no
  matter what (and if there are any other violations, they need to be fixed
  too).
  
  --Ans. Sorry for the confusion on this I will explain it in detail.
  
  The same code in the svn works well on GCC 4.8 but fails on GCC 4.9 and i
  had spent considerable amount of time suspecting many areas including the
  tjbench code but finally found in the dis-assembly that gcc 4.9 just uses
  extension registers thought out. Interestingly these extension registers
  helps in store/restore of General purpose registers and replaces the
  storage of General purpose registers on stack and I guess this makes gcc
  4.9 generated code perform much faster than the previous version of gcc.
  
  Please explicitly confirm that the "make test" succeeds after all these
  changes are applied.
  
  Also it would be very interesting to compare the performance between the
  same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same
  hardware.
  
  On 23 July 2014 06:04, Siarhei Siamashka ssvb@users.sf.net wrote:
  
  Thanks for the update.
  
  Subject: [PATCH 1/4] Armv8 specific runtime check changes
  
  Aarch64 by default supports neon and runtime check is
  no more required.
  
  Looks OK to me. Just the comment
  
  "ARMV8 architectures supports neon extenstion by default"
  
  has a typo and handling the JSIMD_FORCENEON environment variable now
  becomes unnecessary.
  
  Subject: [PATCH 2/4] RTSM macro removed
  
  RTSM macro removed after testing the code on aarch64 hardware
  
  So the inactive code path gets removed and this macro was never really
  necessary (no simulator issue in the first place)?
  
  Subject: [PATCH 3/4] IDCT slow zero coeffecient check changes
  
  Zero coeffecient check section of code was not working properly
  through 64 bit registers, now coeffecient fetch is through 32 bit
  registers
  
  "Was not working" in what way? Was it failing "make test"? And can we
  actually rewrite this code to do correct 64-bit fetch? This should require
  less instructions for the same job.
  
  Subject: [PATCH 4/4] Save all neon register getting used in the routine
  
  GCC aarch64 4.9 tool chain generated code uses extension register
  including neon through out instead of stack. This makes
  gcc 4.9 much faster in performance and the only side effect of this
  hand coded neon assembly has to preserve all neon register.
  
  This description does not make much sense (sounds like you are putting
  blame on GCC 4.9). In other words, are you saying that the current code in
  SVN is just violating the AArch64 calling conventions by failing to
  properly save/restore NEON registers? The ABI must be always respected no
  matter what (and if there are any other violations, they need to be fixed
  too).
  
  Please explicitly confirm that the "make test" succeeds after all these
  changes are applied.
  
  Also it would be very interesting to compare the performance between the
  same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same
  hardware.
  
  [patches:#54] http://sourceforge.net/p/libjpeg-turbo/patches/54 Armv8
  neon port for libjpeg-turbo decoder*
  
  Status: open-pending-external
  
  Created: Tue Oct 29, 2013 10:23 PM UTC by Ragesh
  Last Updated: Tue Jul 22, 2014 11:04 AM UTC
  Owner: DRC
  
  Attached patches contains Armv8 neon Port for following functionalities
  and changes to configuraton and Makefiles to integrate Armv8 neon port to
  libjpeg-turbo
  
  Idct Slow.
  
  Idct Fast.
  
  Idct 2x2
  
  Idct 4x4
  
  Yuv to RGB conversion
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/libjpeg-turbo/patches/54/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Patches (closed - use GitHub): ~~#54~~
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-07-23

Patches have been checked in, with some formatting tweaks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-08-13

Closing this as implemented. Please open a new patch if you have any modifications or new accelerations. In general, I am still soft-pedaling this feature until it accelerates the same set of functions that are accelerated with ARMv7 and until we can successfully build the code for iOS.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

DRC - 2014-08-13

status: open-pending-external --> closed-integrated
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Armv8 neon port for libjpeg-turbo decoder

SIMD-accelerated libjpeg-compatible JPEG codec library

Searches

Help

#54 Armv8 neon port for libjpeg-turbo decoder

Related

Discussion

Related

Related

reviewed, and documented.

Related

Related