From: R. Denis-C. <re...@vi...> - 2009-09-15 20:31:02
Attachments:
libmpeg2-mc-neon-v0.patch
|
Hello all, ARMv7 includes an optional "Advanced SIMD" instructions set, commercially known as NEON. This is included in the recent Cortex line of ARM processors. In particular, Cortex-A8 is found on TI-OMAP3xxx boards such as BeagleBoard, or the Nokia N900. Attached is an intial patch against libmpeg2 trunk to use NEON for motion compensation. This is preliminary. There are a bunch of known CPU stalls. Those could probably be fixed using plain assembly and interleaving subsequent loads. Also, iDCT is not optimized. Anyway, here are my results with an OMA3430 board: With C, no acceleration: 7305 frames in 19.87 sec (367.64 fps), 155 last 0.50 sec (310.00 fps) 7308 frames decoded in 19.88 seconds (367.61 fps) 7288 frames in 19.88 sec (366.60 fps), 170 last 0.50 sec (340.00 fps) 7308 frames decoded in 19.95 seconds (366.32 fps) With ARM acceleration (current libmpeg2): 7254 frames in 18.88 sec (384.22 fps), 180 last 0.50 sec (360.00 fps) 7308 frames decoded in 19.04 seconds (383.82 fps) 7263 frames in 18.88 sec (384.69 fps), 175 last 0.50 sec (350.00 fps) 7308 frames decoded in 19.02 seconds (384.23 fps) With NEON acceleration (this patch): 7129 frames in 15.39 sec (463.22 fps), 245 last 0.50 sec (490.00 fps) 7308 frames decoded in 15.85 seconds (461.07 fps) 7127 frames in 15.38 sec (463.39 fps), 245 last 0.50 sec (490.00 fps) 7308 frames decoded in 15.85 seconds (461.07 fps) So, there is already quite a big improvement! I wonder if there is any warranty on the memory alignment of some of the buffers? NEON can save one cycle per load/store we use aligned-specific opcodes. Currently, the code assumes no alignment. Comments welcome! -- Rémi Denis-Courmont http://git.remlab.net/cgi-bin/gitweb.cgi?p=vlc-courmisch.git;a=summary |
From: Rémi Denis-C. <re...@vi...> - 2009-09-16 06:45:52
|
On Tue, 15 Sep 2009 22:18:22 +0100, Måns Rullgård <ma...@ma...> wrote: >> Attached is an intial patch against libmpeg2 trunk to use NEON for >> motion compensation. This is preliminary. There are a bunch of known >> CPU stalls. Those could probably be fixed using plain assembly and >> interleaving subsequent loads. Also, iDCT is not optimized. > > Why don't you steal the functions from FFmpeg instead? They are much > better optimised than this patch. FFmpeg-based MPEG2 decoding was so slow on the target, that I did not even consider the possibility that it might have been optimized. For whatever reason, it is (eye-)noticeably slower than current libmpeg2 with the non-SIMD ARM optimizations. Why is this so? I do not know. There could be something wrong with VLC, but then again the FFmpeg h.264 decoding is accelerated fine. Or it could be a build problem, or it could be that I did not test properly, or it could be a FFmpeg problem in other part. -- Rémi Denis-Courmont |
From: Måns R. <ma...@ma...> - 2009-09-16 10:16:40
|
Rémi Denis-Courmont <re...@vi...> writes: > On Tue, 15 Sep 2009 22:18:22 +0100, Måns Rullgård <ma...@ma...> wrote: >>> Attached is an intial patch against libmpeg2 trunk to use NEON for >>> motion compensation. This is preliminary. There are a bunch of known >>> CPU stalls. Those could probably be fixed using plain assembly and >>> interleaving subsequent loads. Also, iDCT is not optimized. >> >> Why don't you steal the functions from FFmpeg instead? They are much >> better optimised than this patch. > > FFmpeg-based MPEG2 decoding was so slow on the target, that I did not even > consider the possibility that it might have been optimized. For whatever > reason, it is (eye-)noticeably slower than current libmpeg2 with the > non-SIMD ARM optimizations. Why is this so? I do not know. There could be Even if the FFmpeg mpeg2 decoder is slow, the NEON MC and IDCT functions should be fast. > something wrong with VLC, but then again the FFmpeg h.264 decoding is > accelerated fine. Or it could be a build problem, or it could be that I did > not test properly, or it could be a FFmpeg problem in other part. Can you reproduce the difference with ffmpeg and mpeg2dec called directly, not from vlc? Could you do a quick oprofile run? -- Måns Rullgård ma...@ma... |
From: Måns R. <ma...@ma...> - 2009-09-18 14:25:23
|
Rémi Denis-Courmont <re...@vi...> writes: > On Tue, 15 Sep 2009 22:18:22 +0100, Måns Rullgård <ma...@ma...> wrote: >>> Attached is an intial patch against libmpeg2 trunk to use NEON for >>> motion compensation. This is preliminary. There are a bunch of known >>> CPU stalls. Those could probably be fixed using plain assembly and >>> interleaving subsequent loads. Also, iDCT is not optimized. >> >> Why don't you steal the functions from FFmpeg instead? They are much >> better optimised than this patch. > > FFmpeg-based MPEG2 decoding was so slow on the target, that I did not even > consider the possibility that it might have been optimized. For whatever > reason, it is (eye-)noticeably slower than current libmpeg2 with the > non-SIMD ARM optimizations. Why is this so? I do not know. There could be > something wrong with VLC, but then again the FFmpeg h.264 decoding is > accelerated fine. Or it could be a build problem, or it could be that I did > not test properly, or it could be a FFmpeg problem in other part. I have compared FFmpeg against libmpeg2 myself, and FFmpeg on ARMv7 is about 1.5x faster than libmpeg2. If your experience with VLC is to the contrary, there must be a problem in VLC. -- Måns Rullgård ma...@ma... |
From: Diego B. <di...@bi...> - 2009-09-18 15:39:30
|
On Fri, Sep 18, 2009 at 04:32:55PM +0200, Rémi Denis-Courmont wrote: > > Anyway that's orthogonal to optimizing libmpeg2. No offense to anybody, but optimizing libmpeg2 sounds like a waste of time. It's unmaintained and slower to begin with.. Diego |
From: Giridhar T. <gir...@ya...> - 2009-09-16 13:17:56
|
On NEON alignment, There can be lot improvement in performance if buffers are aligned. Also interleaving stores would not throttle the store buffer. Maximum outstanding stores can be 8 dregisters. Instructions in A8 are statically scheduled and hence there are two varieties of loads and stores one without specifying alignment restriction and other with alignment restriction (@64, @128, @256 bits alignment). There are no intrinsic for specifying alignment. So to get maximum memory bandwidth one has to align buffers and also use special instructions specifiers. Example: Example: VLD1 {d0},[pSrc] ;// takes 2 cycles VLD1 {d0,d1},[pSrc@64] ;// takes 1 cycles VST1 {d0},[pDst] ;// takes 2 cycles VST1 {d0,d1},[pDst@64] ;// takes 1 cycles VLD1 {d0,d1},[pSrc] ;// takes 2 cycles VLD1 {d0,d1},[pSrc@128] ;// takes 1 cycles VST1 {d0,d1},[pDst] ;// takes 2 cycles VST1 {d0,d1},[pDst@128] ;// takes 1 cycles For more information on code examples one can check http://www.arm.com/products/multimedia/openmax/index.html Regards, /G --- On Wed, 9/16/09, Rémi Denis-Courmont <re...@vi...> wrote: > From: Rémi Denis-Courmont <re...@vi...> > Subject: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation > To: lib...@li... > Date: Wednesday, September 16, 2009, 1:42 AM > Hello all, > > ARMv7 includes an optional "Advanced SIMD" instructions > set, commercially > known as NEON. This is included in the recent Cortex line > of ARM processors. > In particular, Cortex-A8 is found on TI-OMAP3xxx boards > such as BeagleBoard, > or the Nokia N900. > > Attached is an intial patch against libmpeg2 trunk to use > NEON for motion > compensation. This is preliminary. There are a bunch of > known CPU stalls. > Those could probably be fixed using plain assembly and > interleaving subsequent > loads. Also, iDCT is not optimized. Anyway, here are my > results with an > OMA3430 board: > > With C, no acceleration: > 7305 frames in 19.87 sec (367.64 fps), 155 last 0.50 sec > (310.00 fps) > 7308 frames decoded in 19.88 seconds (367.61 fps) > > > 7288 frames in 19.88 sec (366.60 fps), 170 last 0.50 sec > (340.00 fps) > 7308 frames decoded in 19.95 seconds (366.32 fps) > > > > With ARM acceleration (current libmpeg2): > 7254 frames in 18.88 sec (384.22 fps), 180 last 0.50 sec > (360.00 fps) > 7308 frames decoded in 19.04 seconds (383.82 fps) > 7263 frames in 18.88 sec (384.69 fps), 175 last 0.50 sec > (350.00 fps) > 7308 frames decoded in 19.02 seconds (384.23 fps) > > With NEON acceleration (this patch): > 7129 frames in 15.39 sec (463.22 fps), 245 last 0.50 sec > (490.00 fps) > 7308 frames decoded in 15.85 seconds (461.07 fps) > 7127 frames in 15.38 sec (463.39 fps), 245 last 0.50 sec > (490.00 fps) > 7308 frames decoded in 15.85 seconds (461.07 fps) > > So, there is already quite a big improvement! > > I wonder if there is any warranty on the memory alignment > of some of the > buffers? NEON can save one cycle per load/store we use > aligned-specific > opcodes. Currently, the code assumes no alignment. > > Comments welcome! > > -- > Rémi Denis-Courmont > http://git.remlab.net/cgi-bin/gitweb.cgi?p=vlc-courmisch.git;a=summary > > -----Inline Attachment Follows----- > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry® Developer > Conference in SF, CA > is the only developer event you need to attend this year. > Jumpstart your > developing skills, take BlackBerry mobile applications to > market and stay > ahead of the curve. Join us from November 9-12, 2009. > Register now! > http://p.sf.net/sfu/devconf > -----Inline Attachment Follows----- > > _______________________________________________ > Libmpeg2-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmpeg2-devel > |