Thread: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation

Status: Beta

Brought to you by: cmassiot, kempfjb, sammy, walken

libmpeg2-devel

[mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation

From: R. Denis-C. <re...@vi...> - 2009-09-15 20:31:02

Attachments: libmpeg2-mc-neon-v0.patch

	Hello all,

ARMv7 includes an optional "Advanced SIMD" instructions set, commercially 
known as NEON. This is included in the recent Cortex line of ARM processors. 
In particular, Cortex-A8 is found on TI-OMAP3xxx boards such as BeagleBoard, 
or the Nokia N900.

Attached is an intial patch against libmpeg2 trunk to use NEON for motion 
compensation. This is preliminary. There are a bunch of known CPU stalls. 
Those could probably be fixed using plain assembly and interleaving subsequent 
loads. Also, iDCT is not optimized. Anyway, here are my results with an 
OMA3430 board:

With C, no acceleration:
7305 frames in 19.87 sec (367.64 fps), 155 last 0.50 sec (310.00 fps)
7308 frames decoded in 19.88 seconds (367.61 fps)                    
7288 frames in 19.88 sec (366.60 fps), 170 last 0.50 sec (340.00 fps)
7308 frames decoded in 19.95 seconds (366.32 fps)                    

With ARM acceleration (current libmpeg2):
7254 frames in 18.88 sec (384.22 fps), 180 last 0.50 sec (360.00 fps)
7308 frames decoded in 19.04 seconds (383.82 fps)
7263 frames in 18.88 sec (384.69 fps), 175 last 0.50 sec (350.00 fps)
7308 frames decoded in 19.02 seconds (384.23 fps)

With NEON acceleration (this patch):
7129 frames in 15.39 sec (463.22 fps), 245 last 0.50 sec (490.00 fps)
7308 frames decoded in 15.85 seconds (461.07 fps)
7127 frames in 15.38 sec (463.39 fps), 245 last 0.50 sec (490.00 fps)
7308 frames decoded in 15.85 seconds (461.07 fps)

So, there is already quite a big improvement!

I wonder if there is any warranty on the memory alignment of some of the 
buffers? NEON can save one cycle per load/store we use aligned-specific 
opcodes. Currently, the code assumes no alignment.

Comments welcome!

-- 
Rémi Denis-Courmont
http://git.remlab.net/cgi-bin/gitweb.cgi?p=vlc-courmisch.git;a=summary

Re: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation

From: Rémi Denis-C. <re...@vi...> - 2009-09-16 06:45:52

On Tue, 15 Sep 2009 22:18:22 +0100, Måns Rullgård <ma...@ma...> wrote:
>> Attached is an intial patch against libmpeg2 trunk to use NEON for
>> motion compensation. This is preliminary. There are a bunch of known
>> CPU stalls. Those could probably be fixed using plain assembly and
>> interleaving subsequent loads. Also, iDCT is not optimized.
> 
> Why don't you steal the functions from FFmpeg instead?  They are much
> better optimised than this patch.

FFmpeg-based MPEG2 decoding was so slow on the target, that I did not even
consider the possibility that it might have been optimized. For whatever
reason, it is (eye-)noticeably slower than current libmpeg2 with the
non-SIMD ARM optimizations. Why is this so? I do not know. There could be
something wrong with VLC, but then again the FFmpeg h.264 decoding is
accelerated fine. Or it could be a build problem, or it could be that I did
not test properly, or it could be a FFmpeg problem in other part.

-- 
Rémi Denis-Courmont

Re: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation

From: Måns R. <ma...@ma...> - 2009-09-16 10:16:40

Rémi Denis-Courmont <re...@vi...> writes:

> On Tue, 15 Sep 2009 22:18:22 +0100, Måns Rullgård <ma...@ma...> wrote:
>>> Attached is an intial patch against libmpeg2 trunk to use NEON for
>>> motion compensation. This is preliminary. There are a bunch of known
>>> CPU stalls. Those could probably be fixed using plain assembly and
>>> interleaving subsequent loads. Also, iDCT is not optimized.
>> 
>> Why don't you steal the functions from FFmpeg instead?  They are much
>> better optimised than this patch.
>
> FFmpeg-based MPEG2 decoding was so slow on the target, that I did not even
> consider the possibility that it might have been optimized. For whatever
> reason, it is (eye-)noticeably slower than current libmpeg2 with the
> non-SIMD ARM optimizations. Why is this so? I do not know. There could be

Even if the FFmpeg mpeg2 decoder is slow, the NEON MC and IDCT
functions should be fast.

> something wrong with VLC, but then again the FFmpeg h.264 decoding is
> accelerated fine. Or it could be a build problem, or it could be that I did
> not test properly, or it could be a FFmpeg problem in other part.

Can you reproduce the difference with ffmpeg and mpeg2dec called
directly, not from vlc?  Could you do a quick oprofile run?

-- 
Måns Rullgård
ma...@ma...

Re: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation

From: Måns R. <ma...@ma...> - 2009-09-18 14:25:23

Rémi Denis-Courmont <re...@vi...> writes:

> On Tue, 15 Sep 2009 22:18:22 +0100, Måns Rullgård <ma...@ma...> wrote:
>>> Attached is an intial patch against libmpeg2 trunk to use NEON for
>>> motion compensation. This is preliminary. There are a bunch of known
>>> CPU stalls. Those could probably be fixed using plain assembly and
>>> interleaving subsequent loads. Also, iDCT is not optimized.
>> 
>> Why don't you steal the functions from FFmpeg instead?  They are much
>> better optimised than this patch.
>
> FFmpeg-based MPEG2 decoding was so slow on the target, that I did not even
> consider the possibility that it might have been optimized. For whatever
> reason, it is (eye-)noticeably slower than current libmpeg2 with the
> non-SIMD ARM optimizations. Why is this so? I do not know. There could be
> something wrong with VLC, but then again the FFmpeg h.264 decoding is
> accelerated fine. Or it could be a build problem, or it could be that I did
> not test properly, or it could be a FFmpeg problem in other part.

I have compared FFmpeg against libmpeg2 myself, and FFmpeg on ARMv7 is
about 1.5x faster than libmpeg2.  If your experience with VLC is to
the contrary, there must be a problem in VLC.

-- 
Måns Rullgård
ma...@ma...

Re: [mpeg2-dev] Fwd: Re: [RFC] [PATCH] ARM Advanced SIMD motion compensation

From: Diego B. <di...@bi...> - 2009-09-18 15:39:30

On Fri, Sep 18, 2009 at 04:32:55PM +0200, Rémi Denis-Courmont wrote:
> 
> Anyway that's orthogonal to optimizing libmpeg2.

No offense to anybody, but optimizing libmpeg2 sounds like a waste of
time.  It's unmaintained and slower to begin with..

Diego

Re: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation

From: Giridhar T. <gir...@ya...> - 2009-09-16 13:17:56

On NEON alignment,
There can be lot improvement in performance if buffers are aligned. Also interleaving stores would not throttle the store buffer.
Maximum outstanding stores can be 8 dregisters.

Instructions in A8 are statically scheduled and hence there are two varieties of loads and stores one without specifying alignment restriction and other with alignment restriction (@64, @128, @256 bits alignment). There are no intrinsic for specifying alignment.

So to get maximum memory bandwidth one has to align buffers and also use special instructions specifiers.

Example:
Example:
VLD1    {d0},[pSrc]        ;// takes 2 cycles
VLD1    {d0,d1},[pSrc@64]  ;// takes 1 cycles
VST1    {d0},[pDst]        ;// takes 2 cycles
VST1    {d0,d1},[pDst@64]  ;// takes 1 cycles

VLD1    {d0,d1},[pSrc]     ;// takes 2 cycles
VLD1    {d0,d1},[pSrc@128] ;// takes 1 cycles
VST1    {d0,d1},[pDst]     ;// takes 2 cycles
VST1    {d0,d1},[pDst@128] ;// takes 1 cycles

For more information on code examples one can check 
http://www.arm.com/products/multimedia/openmax/index.html

Regards,
/G


--- On Wed, 9/16/09, Rémi Denis-Courmont <re...@vi...> wrote:

> From: Rémi Denis-Courmont <re...@vi...>
> Subject: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation
> To: lib...@li...
> Date: Wednesday, September 16, 2009, 1:42 AM
>     Hello all,
> 
> ARMv7 includes an optional "Advanced SIMD" instructions
> set, commercially 
> known as NEON. This is included in the recent Cortex line
> of ARM processors. 
> In particular, Cortex-A8 is found on TI-OMAP3xxx boards
> such as BeagleBoard, 
> or the Nokia N900.
> 
> Attached is an intial patch against libmpeg2 trunk to use
> NEON for motion 
> compensation. This is preliminary. There are a bunch of
> known CPU stalls. 
> Those could probably be fixed using plain assembly and
> interleaving subsequent 
> loads. Also, iDCT is not optimized. Anyway, here are my
> results with an 
> OMA3430 board:
> 
> With C, no acceleration:
> 7305 frames in 19.87 sec (367.64 fps), 155 last 0.50 sec
> (310.00 fps)
> 7308 frames decoded in 19.88 seconds (367.61 fps) 
>                
>   
> 7288 frames in 19.88 sec (366.60 fps), 170 last 0.50 sec
> (340.00 fps)
> 7308 frames decoded in 19.95 seconds (366.32 fps) 
>                
>   
> 
> With ARM acceleration (current libmpeg2):
> 7254 frames in 18.88 sec (384.22 fps), 180 last 0.50 sec
> (360.00 fps)
> 7308 frames decoded in 19.04 seconds (383.82 fps)
> 7263 frames in 18.88 sec (384.69 fps), 175 last 0.50 sec
> (350.00 fps)
> 7308 frames decoded in 19.02 seconds (384.23 fps)
> 
> With NEON acceleration (this patch):
> 7129 frames in 15.39 sec (463.22 fps), 245 last 0.50 sec
> (490.00 fps)
> 7308 frames decoded in 15.85 seconds (461.07 fps)
> 7127 frames in 15.38 sec (463.39 fps), 245 last 0.50 sec
> (490.00 fps)
> 7308 frames decoded in 15.85 seconds (461.07 fps)
> 
> So, there is already quite a big improvement!
> 
> I wonder if there is any warranty on the memory alignment
> of some of the 
> buffers? NEON can save one cycle per load/store we use
> aligned-specific 
> opcodes. Currently, the code assumes no alignment.
> 
> Comments welcome!
> 
> -- 
> Rémi Denis-Courmont
> http://git.remlab.net/cgi-bin/gitweb.cgi?p=vlc-courmisch.git;a=summary
> 
> -----Inline Attachment Follows-----
> 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer
> Conference in SF, CA
> is the only developer event you need to attend this year.
> Jumpstart your
> developing skills, take BlackBerry mobile applications to
> market and stay 
> ahead of the curve. Join us from November 9-12, 2009.
> Register now!
> http://p.sf.net/sfu/devconf
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> Libmpeg2-devel mailing list
> Lib...@li...
> https://lists.sourceforge.net/lists/listinfo/libmpeg2-devel
>