From: Giridhar T. <gir...@ya...> - 2009-09-16 13:17:56
|
On NEON alignment, There can be lot improvement in performance if buffers are aligned. Also interleaving stores would not throttle the store buffer. Maximum outstanding stores can be 8 dregisters. Instructions in A8 are statically scheduled and hence there are two varieties of loads and stores one without specifying alignment restriction and other with alignment restriction (@64, @128, @256 bits alignment). There are no intrinsic for specifying alignment. So to get maximum memory bandwidth one has to align buffers and also use special instructions specifiers. Example: Example: VLD1 {d0},[pSrc] ;// takes 2 cycles VLD1 {d0,d1},[pSrc@64] ;// takes 1 cycles VST1 {d0},[pDst] ;// takes 2 cycles VST1 {d0,d1},[pDst@64] ;// takes 1 cycles VLD1 {d0,d1},[pSrc] ;// takes 2 cycles VLD1 {d0,d1},[pSrc@128] ;// takes 1 cycles VST1 {d0,d1},[pDst] ;// takes 2 cycles VST1 {d0,d1},[pDst@128] ;// takes 1 cycles For more information on code examples one can check http://www.arm.com/products/multimedia/openmax/index.html Regards, /G --- On Wed, 9/16/09, Rémi Denis-Courmont <re...@vi...> wrote: > From: Rémi Denis-Courmont <re...@vi...> > Subject: [mpeg2-dev] [RFC] [PATCH] ARM Advanced SIMD motion compensation > To: lib...@li... > Date: Wednesday, September 16, 2009, 1:42 AM > Hello all, > > ARMv7 includes an optional "Advanced SIMD" instructions > set, commercially > known as NEON. This is included in the recent Cortex line > of ARM processors. > In particular, Cortex-A8 is found on TI-OMAP3xxx boards > such as BeagleBoard, > or the Nokia N900. > > Attached is an intial patch against libmpeg2 trunk to use > NEON for motion > compensation. This is preliminary. There are a bunch of > known CPU stalls. > Those could probably be fixed using plain assembly and > interleaving subsequent > loads. Also, iDCT is not optimized. Anyway, here are my > results with an > OMA3430 board: > > With C, no acceleration: > 7305 frames in 19.87 sec (367.64 fps), 155 last 0.50 sec > (310.00 fps) > 7308 frames decoded in 19.88 seconds (367.61 fps) > > > 7288 frames in 19.88 sec (366.60 fps), 170 last 0.50 sec > (340.00 fps) > 7308 frames decoded in 19.95 seconds (366.32 fps) > > > > With ARM acceleration (current libmpeg2): > 7254 frames in 18.88 sec (384.22 fps), 180 last 0.50 sec > (360.00 fps) > 7308 frames decoded in 19.04 seconds (383.82 fps) > 7263 frames in 18.88 sec (384.69 fps), 175 last 0.50 sec > (350.00 fps) > 7308 frames decoded in 19.02 seconds (384.23 fps) > > With NEON acceleration (this patch): > 7129 frames in 15.39 sec (463.22 fps), 245 last 0.50 sec > (490.00 fps) > 7308 frames decoded in 15.85 seconds (461.07 fps) > 7127 frames in 15.38 sec (463.39 fps), 245 last 0.50 sec > (490.00 fps) > 7308 frames decoded in 15.85 seconds (461.07 fps) > > So, there is already quite a big improvement! > > I wonder if there is any warranty on the memory alignment > of some of the > buffers? NEON can save one cycle per load/store we use > aligned-specific > opcodes. Currently, the code assumes no alignment. > > Comments welcome! > > -- > Rémi Denis-Courmont > http://git.remlab.net/cgi-bin/gitweb.cgi?p=vlc-courmisch.git;a=summary > > -----Inline Attachment Follows----- > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry® Developer > Conference in SF, CA > is the only developer event you need to attend this year. > Jumpstart your > developing skills, take BlackBerry mobile applications to > market and stay > ahead of the curve. Join us from November 9-12, 2009. > Register now! > http://p.sf.net/sfu/devconf > -----Inline Attachment Follows----- > > _______________________________________________ > Libmpeg2-devel mailing list > Lib...@li... > https://lists.sourceforge.net/lists/listinfo/libmpeg2-devel > |