Armv8 neon port for libjpeg-turbo decoder
SIMD-accelerated libjpeg-compatible JPEG codec library
Brought to you by:
dcommander
Attached patches contains Armv8 neon Port for following functionalities and changes to configuraton and Makefiles to integrate Armv8 neon port to libjpeg-turbo
Patches (closed - use GitHub): #54
Missing attachments adding all 9 patches in a single archive file
Couple of questions/comments:
-- Are there further optimizations planned, or will these be the only ones? The reason I ask is more for strategic purposes in planning the release of libjpeg-turbo 1.4. If a full ARM 64-bit implementation is planned, then I would like to hold off the release until it is ready. There is plenty of time, since there are currently other features in the works for 1.4 that will take months to implement.
-- I assume that 'make test' passes?
-- I will take care of it this time, but if you have any future patches, please submit them against trunk, which is where the new features are landing.
-- I am attaching two test scripts which can be used along with the canonical libjpeg-turbo test images (see http://www.libjpeg-turbo.org/About/Performance) to generate numbers that can be copied & pasted into a spreadsheet template (also attached.) If you could run these numbers both with and without the ARM 64-bit SIMD code, that would help me greatly. It gives me a feel for how well the new SIMD code is accelerating things, and it allows me to make valid performance claims about it (such as "it's X amount faster, etc.")
Thank you for your comments.
Further optimizations are not yet planned as such and I believe that it
is possible to optimize further in terms of data packing but skeptical
if it can really make a big difference.
'make Test' all test cases passes.
Regarding performance, as previously discussed the whole development and
testing was on an ARM simulator and unfortunately we don't have any
64bit ARM h/w boards still to test performance. I guess the perf figures
will not make any sense on simulators.
Last edit: Ragesh 2013-11-07
OK, thanks for the info. I guess the burning question in my mind is with regards to the iPhone 5S, which is where the demand for this feature primarily lives at the moment. It seems like it wouldn't be beneficial in a performance sense for an iPhone developer to switch from 32-bit ARM v7 code to 64-bit ARM v8 code until the v8 code has full SIMD acceleration. Otherwise, since the v8 version of the code would have to fall back to C for doing convsamp, quantize, fast DCT, and upsample operations, it seems like it wouldn't be as fast as the existing 32-bit v7 code. Correct me if I'm wrong, though.
It seems to me that it might be better to sit on this until someone in the community can fill in the missing pieces, which will probably come after there is actual hardware available to test. I'd also really like to see side-by-side benchmarks relative to v7, mainly so I can validate or adjust the performance claims I make in the documentation.
As an experiment, I tried to import the code to github (in linaro-armv8-20131014 branch) and add some comments for the first patch: https://github.com/ssvb/libjpeg-turbo-linaro-armv8/commit/4951d43388a6b37ce3b9aaa7bef0fa9b82c69abd
I'm most worried about the simulator issues and the RTSM_SQSHRN_SIM_ISSUE ifdef. Can we really trust this tool? Does the code under RTSM_SQSHRN_SIM_ISSUE ifdef (which is currently defined by default in this patch) work correctly on real 64-bit ARM hardware?
Also do we know anything about the 64-bit iOS assembler syntax? The 32-bit ARM code had to use https://github.com/yuvi/gas-preprocessor
RTSM_SQSHRN_SIM_ISSUE, I am not really sure if this is the issue with simulator or it can be the way the saturation instructions work in armv8. In Armv7 the saturation boundary is always 8bit even if you specify boundary to 16bit so I am skeptical how this instruciton will finally work on 64bit h/w. On simulator disabling this macro image reconstruction will have some noise around saturation color.
Hi DRC,
On Thu, Nov 7, 2013 at 11:28 AM, DRC dcommander@users.sf.net wrote:
While I'm sure the A7 based iPads and iPhones are interesting, from a
Linaro point of view, that's not something that we work with. The
patches that we've posted are for armv8 and of no surprise for Linux
to start and will be of use someday on Android.
I'm a bit taken aback that you would seem to suggest that these
patches can't go forward until the iOS side of things are worked out
for armv8. I appreciate the interest in iOS but support for iOS &
armv8 should be left to someone who is maintaining libjpeg-turbo on
iOS. We, Linaro, do not want to advance patches that break iOS but
likewise I don't think it's reasonable to expect us to take on iOS.
(at a personal level my arm could be easily twisted since I work with
iOS but that's a different topic)
When it comes to armv8 hardware or emulators, it comes down to the following.
Simulators: ARM models, qemu
Server etc hardware: It's coming, 'nuff said.
Apple's A7 based hardware iOS only and requires a full apple developer
license to deploy to hardware
For linux and distro support, already debian, ubuntu, open embedded,
fedora have builds for armv8 that runs on the various models. Across a
wide community we all want to see past optimized packages like
libjpeg-turbo be ready when the real hardware shows up.
It makes little sense to do performance measurements on any of the
Simulators for ARMv8. That day will come. The models available aren't
cycle accurate. Comparing to armv7 on say an A15 to a simulator
running on an intel box makes no sense. While the ported optimizations
SHOULD yield good performance, we all know and accept that some
tweaking might have to be done when the real hardware starts to show
up.
Waiting for real hardware just motivates people to get excited when
their bright shiny hardware shows up and the performance sucks rocks.
We're working on this and other packages before real hardware so that
from the get go performance should be reasonable. We accept there
might be a rework or two. I'd rather have a 90% solution on Day 1 then
a 0% solution and a long wait because this and other packages haven't
even been touched.
Hopefully that sounds reasonable.
Thanks!
Tom
--
Regards,
Tom
"Where's the kaboom!? There was supposed to be an earth-shattering
kaboom!" Marvin Martian
Tech Lead, Graphics Working Group | Linaro.org │ Open source software
for ARM SoCs
w) tom.gall att linaro.org
h) tom_gall att mac.com
Related
Patches (closed - use GitHub):
#54Tom, I said nothing of the sort. Please bear in mind that I am not an expert on this. I'm doing my dead-level best to understand it, but seeing as how I don't get paid for any work I do with ARM code, I cannot really dig into it or fully engage without starving. I am feeling my way through as best I can and putting out ideas.
No one is asking Linaro to take on iOS support. I take on that support myself and provide iOS binaries because they are easy for me to build (I am primarily a Mac user), and it's the fastest way to get the code in front of developers and testers (since iOS developers link statically with libjpeg-turbo rather than relying on a system distribution of it.) The reality of the situation is that we have iPhone developers who can test the ARM v8 code right now, on real hardware, and give valuable feedback as to its performance and stability.
I never proposed waiting for iOS to work out ARM v8. There's no need. They've already worked it out. My concern is more that I do not want to ship an incomplete SIMD implementation for ARM v8, because there's a very good chance that it will be slower than ARM v7, and that makes both us and the ARM v8 platform look bad. I would gladly check this in today if it were a complete implementation, which would give me some reassurance that at least it would not regress relative to ARM v7 in terms of performance. I will gladly put this in front of developers as soon as it's complete. Before I am comfortable releasing it into beta, though, I need to be able to put my finger on exactly how much speedup this feature gives relative to plain C code, and how much speedup it gives relative to ARM v7 code.
It has always been the position of this project to treat performance as a measure of quality, so I regression test every single release to make sure it has not lost any performance relative to the previous one. When I introduce a new feature, I make it a point to fully understand its performance advantages, because that is equal in my mind to understanding the feature's quality. So, basically, without the ability to benchmark this code, I do not know if there are performance "bugs" lurking within it. I also generally try to avoid releasing features that I have to apologize for in some way. It makes things difficult when you have to say, "well, this feature is better than nothing if you insist on using 64-bit code, but you're still better off on the whole using 32-bit code." It's just easier to say "we don't support ARM 64-bit yet. Stand by." I am all about getting people the fastest possible JPEG performance they can achieve on their particular platform.
Hi DRC
On Fri, Nov 8, 2013 at 6:10 PM, DRC dcommander@users.sf.net wrote:
My fault for the crossed wires. Apologies on my part.
Don't want you starving!
Me too.
Yes tho I'd be concerned with the differences between gcc and llvm.
Apple's obviously put a lot of time into llvm so I'm sure their llvm
is much better than the community llvm yet.
I wouldn't recommend putting this into a stable libjpeg-turbo yet.
However getting performance comparisons is only possible on Apple
hardware. A53, A57 processors just aren't released in real hardware.
Like I said, we have simulators that's it.
Still! NEON on v7 and v8 didn't dramatically change. It did change in
ways that don't allow v7 NEON to "just work" with a recompile on v8.
I don't think we can establish a performance baseline as of yet for
which any performance "bug" would be or could be considered valid for
the time being. When we've real hardware in hand, you can rest assured
we'll be all over it and issue patches. Thing is, I want to avoid a
chicken and egg problem. Holding back on performance updates until
real hardware basically will put libjpeg-turbo to the back of the line
for armv8.
I don't think that is the case. It's just not measurable in a
meaningful way at this point in time.
I don't think that's viable. We do support it. Lots of distros do too.
libjpeg-turbo is far from the first or last package that is being
optimized for armv8.
For armv8 linux, that's a simulator for now and hardware in the not
too distant future.
Regards,
Tom
--
Regards,
Tom
"Where's the kaboom!? There was supposed to be an earth-shattering
kaboom!" Marvin Martian
Tech Lead, Graphics Working Group | Linaro.org │ Open source software
for ARM SoCs
w) tom.gall att linaro.org
h) tom_gall att mac.com
Related
Patches (closed - use GitHub):
#54Hopefully SF has finally fixed the bug that was preventing me from replying, so here goes:
Fair enough, but can you expand on why you believe that the partial ARM v8 implementation will be as fast as or faster than the full ARM v7 implementation? I don't see how that's possible, unless there is a performance penalty for running v7 code on v8 devices.
In any sense, it seems like the easiest way to make everyone happy is to just finish the implementation.
@Ragesh
It is always a good idea to explicitly make clear what are the speculations and what are the facts (presenting speculations as facts can be confusing). My concern is that we want this NEON code to work correctly on the armv8 hardware when such hardware becomes available. Now you have a selection between two conditionally compiled chunks of code (selected by RTSM_SQSHRN_SIM_ISSUE define) and no real confidence in either of them.
Do you have the "ARM Architecture Reference Manual, ARMv8, for ARMv8-A architecture profile" ( http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487a.a_errata1/index.html )? It can be actually used to check if the behaviour of the simulator makes sense for the questionable instructions.
@Tom
These motivated people also would not like to have buggy/broken software on their shiny new hardware (just because the developers of the armv8 optimizations did not have any possibility to test it properly prior to rolling the code out on the customers).
Is there no real way for Linaro developers to get early access to some 64-bit ARM hardware? Maybe negotiate this with ARM somehow? The current situation is less than perfect.
The iOS compatibility is only mentioned here because the 64-bit iPhone is already out there. And might or might not be usable for confirming the 64-bit ARM assembly code performance & correctness. If iOS is not practically usable for this purpose, then we are better to find some other way.
I have provisionally checked the patch into SVN trunk, with some extensive formatting changes to the assembly file to make it more closely match the existing NEON assembly file and to improve general readability. I also modified the patch to configure.ac slightly.
I can confirm that this code does not compile for iOS at all. iOS developers who may be reading this will have to tell me why. Per previous comments, I mention this only because that platform is the only way we might be able to currently verify the performance of this code.
I normally don't check experimental code into the libjpeg-turbo code base unless it's in a branch, but in this case, the code was isolated enough that it should not affect stability of the overall project. However, I will not document it as a feature until it has been peer reviewed for stability and performance.
I consider the code to be incomplete currently, because it lacks full coverage of the SIMD functions that are covered in the ARMv7 code. Further, it seems to use multiple instructions in several cases to take the place of a single instruction in the ARMv7 code. Both of those aspects of the code make me suspect that it might not perform as well as an ARMv7 binary running on the same platform (if such is even possible.)
Please clarify the purpose of the changes you made to jsimd_arm.c. They seem to leave out significant portions of the SIMD code when running in the simulator. I did not check in those changes yet.
I will leave this issue open until the feature has been fully accepted, reviewed, and documented.
I have only submitted below functions for armv8 neon accelerations
These are mostly decoder functions and the reason to this is when I
initially started armv8 porting I could find neon armv7 accelerations to
decoder functions only.
Now I can see more functions related to encoder, which I have to start
porting to armv8 and I will try to complete this soon.
On 5 February 2014 14:18, DRC dcommander@users.sf.net wrote:
I will leave this issue open until the feature has been fully accepted,
Related
Patches (closed - use GitHub):
#54OK, the correct approach is actually to create a separate file-- jsimd_arm64.c-- for ARMv8 and stub out the unimplemented routines. I have done this and checked the new file into trunk, but you will have to verify that I haven't broken anything.
Can you answer my question regarding whether ARMv7 binaries will run on an ARMv8 platform?
Siarhei, your comments on this new code would be much appreciated.
Just based on how it looks, there are many places in the code that could be implemented somewhat better. My older comments for the armv8 idct slow function are here: https://github.com/ssvb/libjpeg-turbo-linaro-armv8/commit/4951d43388a6b37c
But I understand that without having real armv8 hardware readily available, proper performance tuning is hard and impractical. Asking to fix all of these potential cases of suboptimal code would be unreasonable. We just have to revisit this when linux capable armv8 hardware becomes more common.
However I still don't like the RTSM_SQSHRN_SIM_ISSUE define very much. I would surely feel less paranoid if we had some information about the results of libjpeg-turbo 'make test' in the following cases:
a) with RTSM_SQSHRN_SIM_ISSUE define in the simulator
b) without RTSM_SQSHRN_SIM_ISSUE define in the simulator
c) with RTSM_SQSHRN_SIM_ISSUE define on real hardware
d) without RTSM_SQSHRN_SIM_ISSUE define on real hardware
There should be no problem to run a) and b) tests even now to have a definite confirmation. Also when I was visiting FOSDEM2014 on the last weekend, I had a chance to talk with some Linaro people. And unless I misunderstood something, appears that Linaro currently has at least a few X-GENE boxes with shared ssh access. Was there any attempt made to get access to this hardware for libjpeg-turbo testing? I would guess that something like half an hour would be enough for quickly running 'make test'.
Recently I tested 64 bit neon code on real hardware and some of the assumptions which I had while testing on simulator got resolved. I am attaching 4 patches along with performance figures after testing it on aarch64 hardware.
The test was conducted on a low profile board with gcc-4.9 toolchain.
Thanks for the update.
Looks OK to me. Just the comment
"ARMV8 architectures supports neon extenstion by default"
has a typo and handling the JSIMD_FORCENEON environment variable now becomes unnecessary.
So the inactive code path gets removed and this macro was never really necessary (no simulator issue in the first place)?
"Was not working" in what way? Was it failing "make test"? And can we actually rewrite this code to do correct 64-bit fetch? This should require less instructions for the same job.
This description does not make much sense (sounds like you are putting blame on GCC 4.9). In other words, are you saying that the current code in SVN is just violating the AArch64 calling conventions by failing to properly save/restore NEON registers? The ABI must be always respected no matter what (and if there are any other violations, they need to be fixed too).
Please explicitly confirm that the "make test" succeeds after all these changes are applied.
Also it would be very interesting to compare the performance between the same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same hardware.
Thank for the valuable review comments,
1.> Aarch64 by default supports neon and runtime check is no more required.
Looks OK to me. Just the comment
"ARMV8 architectures supports neon extenstion by default"
has a typo and handling the JSIMD_FORCENEON environment variable now
becomes unnecessary.
----Ans. Even I thought of this but it may help to test/compare neon and
non-neon versions.
2.> RTSM macro removed after testing the code on aarch64 hardware
So the inactive code path gets removed and this macro was never really
necessary (no simulator issue in the first place)?
--Ans. Particularly the saturation instructions behave differently in Armv8
and this was verified on real h/w. I was actually suspecting the simulator
for the behavior but now things are settled with h/w.
3> Zero coeffecient check section of code was not working properly
through 64 bit registers, now coeffecient fetch is through 32 bit
registers.
"Was not working" in what way? Was it failing "make test"?
--The earlier aarch64 zero coeff block code functionality was not verfied
properly. Make test works well with the earlier code as well but the
earlier version used to fetch 2 adjacent coeffecient and compare due to
8byte fetch( 64 bit registers), now its reduced to 4 byte fetch using 32
bit registers and behaves alike armv7 zero coeff check logic.
--Ans. And can we actually rewrite this code to do correct 64-bit fetch?
This should require less instructions for the same job.
I am not very sure but i hope ldp is doing the 64 bit fetch and writing to
two consecutive 32 bit registers. so overall I guess we are not losing any
cycles here. Kindly suggest if you have any better idea.
4> Save all neon register getting used in the routine
GCC aarch64 4.9 tool chain generated code uses extension register including
neon through out instead of stack. This makes gcc 4.9 much faster in
performance and the only side effect of this hand coded neon assembly has
to preserve all neon register.
This description does not make much sense (sounds like you are putting
blame on GCC 4.9). In other words, are you saying that the current code in
SVN is just violating the AArch64 calling conventions by failing to
properly save/restore NEON registers? The ABI must be always respected no
matter what (and if there are any other violations, they need to be fixed
too).
--Ans. Sorry for the confusion on this I will explain it in detail.
The same code in the svn works well on GCC 4.8 but fails on GCC 4.9 and i
had spent considerable amount of time suspecting many areas including the
tjbench code but finally found in the dis-assembly that gcc 4.9 just uses
extension registers thought out. Interestingly these extension registers
helps in store/restore of General purpose registers and replaces the
storage of General purpose registers on stack and I guess this makes gcc
4.9 generated code perform much faster than the previous version of gcc.
Please explicitly confirm that the "make test" succeeds after all these
changes are applied.
Also it would be very interesting to compare the performance between the
same libjpeg-turbo compiled in 32-bit mode vs. 64-bit mode on the same
hardware.
On 23 July 2014 06:04, Siarhei Siamashka ssvb@users.sf.net wrote:
Related
Patches (closed - use GitHub):
#54Patches have been checked in, with some formatting tweaks.
Closing this as implemented. Please open a new patch if you have any modifications or new accelerations. In general, I am still soft-pedaling this feature until it accelerates the same set of functions that are accelerated with ARMv7 and until we can successfully build the code for iOS.