SourceForge has been redesigned. Learn more.
Close

Essential reading for anyone playing with codecs and assembler

birdwes
2012-09-19
2012-12-06
1 2 > >> (Page 1 of 2)
  • birdwes

    birdwes - 2012-09-23

    I'm down to 9 seconds now for my modded source for the ITU g729a test suite on the Pi. It still sounds as expected when I play my Pi encoded "Mary had a little lamb" sample after encoding and decoding. I will publish patches for educational purposes in maybe a couple of weeks if I can squeeze some more out. There is still some mileage, but it's getting more difficult.

     
  • Gernot

    Gernot - 2012-09-24

    Amazing job so far! I'm really looking forward to see your results.

     
  • birdwes

    birdwes - 2012-09-25

    Down to 7 seconds now (the speech.in file is missing from my T*1 test folder for some reason) but we can now encode 11s of my speech on the Pi in about 1.5s. I'm not worried about decoding as that is so much faster and takes care of itself as you optimize the encoder. It certainly looks as if we be able to run a few channels in Asterisk.

    To test the playback quality of test vector files simply ftp it back into your workstation and import it as a raw file into Audacity at 8000Hz.

    There is still some way to go though. I'm starting to get to functions they don't mention in the research papers, except for one: D4i40_17_fast (in acelc_pa.c) - which keeps sticking out like a sore thumb in the profiler; some researchers did mention this to be a big problem area. If anyone has ideas for this then please let me know.

    More PDFs will follow soon.

    Thanks for setting up this space Gemot ;-)

    The journey is not over yet though...

     
  • birdwes

    birdwes - 2012-09-25

    Another very useful link - a clock cycle calculator for another ARM processor. It makes it a lot easier to understand how the pipeline works for instructions that take more than one cycle. The trick is to get clock cycles for "free" whilst you are waiting for a multiply result to be ready:

    See http://pulsar.webshaker.net/ccc/index.php?lng=us

     
  • birdwes

    birdwes - 2012-10-05

    I'm still here - recoding the inner loops of D4i40_17_fast (and some others perhaps) in assembler now. This will take some time (read: a few days or weeks as it is my spare time).

     
    Last edit: birdwes 2012-10-05
  • birdwes

    birdwes - 2012-10-08

    Early tests on i2 and i3 D4i40_17_fast hand coded (not yet pipelined, still a test harness) loops (still a little buggy - needs to be bit perfect) indicate only 30% CPU in ARM native comparison of the same execution time of C. I think we may able to get over 10 channels????. That is my target. This is going to take more time. Please be patient.

    Prob now is it compiles in MS-VC 2008 (Win7-x64) and Raspbian, but stackdumps on Cygwin on Win7-i386!

     
  • birdwes

    birdwes - 2012-10-13

    Roughly 4 seconds now for SPEECH.IN in the the T*2 folder, just from more C optimisations. That's 35 seconds of speech. I haven't got the the D4i40_17_fast stuff working yet.

     
  • birdwes

    birdwes - 2012-10-14

    Rats - I've just done something like the DPF optimisation in section 3.3.4 of http://www.cecs.uci.edu/technical_report/TR03-09.pdf. Lost "bit perfectness" (but it sounds fine). Need to backtrack and figure out where I lost a "bit".

    10 Research more
    20 Analyse the code (again)
    30 Optimise the code (again)
    40 Recompile the code on MSVC, then Cygwin, then the Pi
    50 Run the test vectors on all 3 platforms again
    60 if ( results == OK ) { Run gprof on Cygwin and Pi; if ( benchmark == BAD ) goto 10; }
    70 else { Analyse the errors
    ; }
    80 Fix the C errors
    on MSVC }
    90 ftp back to the Pi
    100 Fix the assembly errors on the Pi
    110 If ( code_OK && fast_enough ) goto 130;
    # INLINE SHOULD SAY "BEQ DONE"
    110 Else goto 10
    asm( "DONE:" )
    130 Exit and go for a beer

     
    Last edit: birdwes 2012-10-14
  • birdwes

    birdwes - 2012-10-16

    Most annoying. The version from 2 days ago is "bit perfect". I can't see the error yet in the diffs.....

     
  • birdwes

    birdwes - 2012-10-18

    Bit perfect again (including the DPF optimisation above) in C on Visual Studio 2008, gcc Cygwin and the Pi. I still have some "none bit perfect" assembler to fix on the Pi though. I'm getting some very good benchmarks so far. I'm going to need a tester/reviwer in maybe 3-10 weeks? I think it's great that we've got to the stage (in just maybe 8 hrs per week for the last month) of exceeding optimisations that took 14 man months at Texas Instruments, Motorola and others; thanks to them publishing how they did it!

    Even the C-only version massively outperforms other publicy availble sources available for ARM. I still have a lot to do though. The patch is not yet fit for public release.

    This has only ever been benchmarked as a comparison against the ITU test suite, and nothing else. It is an educational experiment.

     
  • birdwes

    birdwes - 2012-10-24

    WOWOWOW!

        alp2 = L_mac (alp1, *p3++, _1_8);
        alp2 = L_mac (alp2, *p4++, _1_2);
    
        sq2 = mult (ps2, ps2);
        alp_16 = wround (alp2);
    
    tmp = L_mult (alp, sq2);
    

    / not included: /
    s = L_msu (tmp, sq, alp_16);

    FROM acelp_ca.c - D4i40_17_fast()

    13.5 clock cycles for the bulk of an inner loop!

     
  • birdwes

    birdwes - 2012-10-29

    Still on the case - it's going to be some time yet though... (acelp_ca.c)

    Just 26 clock cycles for this gem (per inner loop) - it's all about stuffing the pipeline: http://pulsar.webshaker.net/ccc/sample-ff09ec64

        for (i = k + (Word16) 1; i < NB_POS; i++) {
    #ifndef DDD
            Word32 a,b,x,y;
    
            __asm __volatile("\n" \
    "@ label                                        \n"     \
    "       ldrsh   %[a], [%[ptr_h1]], #2           \n"     \
    "       ldrsh   %[b], [%[ptr_h2]], #2           \n"     \
    "@ Store p0 in *p2, p1 in *p3 as we need it for temp\n" \
    "       str     %[p0], [%[p2]]                  \n"     \
    "       str     %[p1], [%[p3]]                  \n"     \
    "       smulbb  %[x], %[a], %[b]                \n"     \
    "       ldrsh   %[p0], [%[ptr_h1]], #2          \n"     \
    "       ldrsh   %[p1], [%[ptr_h2]], #2          \n"     \
    "       smulbb  %[y], %[p0], %[p1]              \n"     \
    "       ldrsh   %[a], [%[ptr_h1]], #2           \n"     \
    "       ldrsh   %[b], [%[ptr_h2]], #2           \n"     \
    "       qdadd   %[cor], %[cor], %[x]            \n"     \
    "       ldrsh   %[p0], [%[ptr_h1]], #2          \n"     \
    "       ldrsh   %[p1], [%[ptr_h2]], #2          \n"     \
    "       smulbb  %[x], %[a], %[b]                \n"     \
    "       qdadd   %[cor], %[cor], %[y]            \n"     \
    "       ldrsh   %[a], [%[ptr_h1]], #2           \n"     \
    "       ldrsh   %[b], [%[ptr_h2]], #2           \n"     \
    "       smulbb  %[y], %[p0], %[p1]              \n"     \
    "@put p1 back                                   \n"     \
    "       ldr     %[p1], [%[p3]]                  \n"     \
    "       mov     %[p0], %[cor], ASR #16          \n"     \
    "       qdadd   %[cor], %[cor], %[x]            \n"     \
    "       strh    %[p0], [%[p3]]                  \n"     \
    "       smulbb  %[x], %[a], %[b]                \n"     \
    "@Done with p0 as a and b                       \n"     \
    "       ldr     %[p0], [%[p2]]                  \n"     \
    "       mov     %[a], %[cor], ASR #16           \n"     \
    "       qdadd   %[cor], %[cor], %[y]            \n"     \
    "       strh    %[a], [%[p2]]                   \n"     \
    "       sub     %[p3], %[p3], #18               \n"     \
    "       sub     %[p2], %[p2], #18               \n"     \
    "       mov     %[b], %[cor], ASR #16           \n"     \
    "       qdadd   %[cor], %[cor], %[x]            \n"     \
    "       strh    %[b], [%[p1]]                   \n"     \
    "       sub     %[p1], %[p1], #18               \n"     \
    "       mov     %[a], %[cor], ASR #16           \n"     \
    "       strh    %[a], [%[p0]]                   \n"     \
    "       sub     %[p0], %[p0], #18               \n"     \
    "@ label                                        \n"     \
    "                                               \n"     \
            : [cor] "+r" (cor), [ptr_h1] "+r" (ptr_h1), [ptr_h2] "+r" (ptr_h2), \
                    [p0] "+r" (p0), [p1] "+r" (p1), [p2] "+r" (p2), \
                    [p3] "+r" (p3), [a] "+r" (a), [b] "+r" (b), [x] "+r" (x), [y] "+r" (y) \
            : \
            : );
    #else
    
          cor = L_mac (cor, *ptr_h1, *ptr_h2);
          ptr_h1++;
          ptr_h2++;
          cor = L_mac (cor, *ptr_h1, *ptr_h2);
          ptr_h1++;
          ptr_h2++;
          *p3 = extract_h (cor);
    
          cor = L_mac (cor, *ptr_h1, *ptr_h2);
          ptr_h1++;
          ptr_h2++;
          *p2 = extract_h (cor);
    
          cor = L_mac (cor, *ptr_h1, *ptr_h2);
          ptr_h1++;
          ptr_h2++;
          *p1 = extract_h (cor);
    
          cor = L_mac (cor, *ptr_h1, *ptr_h2);
          ptr_h1++;
          ptr_h2++;
          *p0 = extract_h (cor);
    
          p3 -= ldec;
          p2 -= ldec;
          p1 -= ldec;
          p0 -= ldec;
    #endif
        }
    
     
    Last edit: birdwes 2012-10-29
  • birdwes

    birdwes - 2012-10-29

    Look at the timestamps on these tests! under 3 seconds to encode 37 seconds speech - bit perfect. The code still needs some tidy up though.

    root@raspberrypi:~/g729a_h/TV729A_2# date; ../coder speech.in speech.bit; date
    Mon Oct 29 21:31:25 UTC 2012

    ** ITU G.729A 8 KBIT/S SPEECH CODER **

    ------------------- Fixed point C simulation -----------------

    ------------------- Version 1.1 -----------------

    Input speech file : speech.in
    Output bitstream file: speech.bit
    Mon Oct 29 21:31:29 UTC 2012
    root@raspberrypi:~/g729a_h/TV729A_2# date; ../coder speech.in speech.bit; date
    Mon Oct 29 21:31:30 UTC 2012

    ** ITU G.729A 8 KBIT/S SPEECH CODER **

    ------------------- Fixed point C simulation -----------------

    ------------------- Version 1.1 -----------------

    Input speech file : speech.in
    Output bitstream file: speech.bit
    Mon Oct 29 21:31:33 UTC 2012
    root@raspberrypi:~/g729a_h/TV729A_2# date; ../coder speech.in speech.bit; date
    Mon Oct 29 21:31:34 UTC 2012

    ** ITU G.729A 8 KBIT/S SPEECH CODER **

    ------------------- Fixed point C simulation -----------------

    ------------------- Version 1.1 -----------------

    Input speech file : speech.in
    Output bitstream file: speech.bit
    Mon Oct 29 21:31:38 UTC 2012
    root@raspberrypi:~/g729a_h/TV729A_2# date; ../coder speech.in speech.bit; date
    Mon Oct 29 21:31:40 UTC 2012

    ** ITU G.729A 8 KBIT/S SPEECH CODER **

    ------------------- Fixed point C simulation -----------------

    ------------------- Version 1.1 -----------------

    Input speech file : speech.in
    Output bitstream file: speech.bit
    Mon Oct 29 21:31:43 UTC 2012
    root@raspberrypi:~/g729a_h/TV729A_2# diff speech.bit speech.bit.ref
    root@raspberrypi:~/g729a_h/TV729A_2#

     
  • birdwes

    birdwes - 2012-11-02

    My optimisation work is nearly complete. According to the GNU profiler (gprof) the encoder is now taking around 2.75 seconds and the decoder around 0.75 seconds for the 37.5 second test vector. Add those together and you get 3.5 seconds, less than 10% of the realtime rate for duplex audio. So, we're probably looking at 10% CPU per full duplex channel.

    Excluding any other software running, that does mean that on G729a alone a raspberry pi IS just powerful enough to encode/decode 10 full duplex channels!

    I still have lots of tidying up to to, and there are a few more areas left to optimise, but there is not much fat left to trim. It's nearly at the limit of diminishing returns.

     
  • TirsoJRP

    TirsoJRP - 2012-11-06

    Any chance a mere mortal can test this?

     
  • Chris Lucksted

    Chris Lucksted - 2012-11-08

    Ditto here.. if you are looking for beta testing, we're looking for a PI 729 solution to deal with SIP trunk providers in South America!

     
  • birdwes

    birdwes - 2012-11-15

    I've found more fat to trim. Encoder down to 2.39s and decoder down to 0.51s for the 37.5s test vector. Bit Perfect!

     
  • birdwes

    birdwes - 2012-11-16

    There is still more to go. Anyone interested in the source code and how I'm doing this needs to know that the ITU primitives L_mac and L_msu are the enemy; especially in (even not nested) loops like (modified) pitch_a.c - Pitch_ol_fast():

    ~~~~~~

    /--------------------------------------------------------
    * Verification for risk of overflow.
    --------------------------------------------------------*/

    Overflow = 0;
    sum = 0;

    / TODO ARM /
    for (i = (0 -( PIT_MAX )) ; i < L_FRAME; i ++) {
    sum = L_mac_o (sum, signal[i], signal[i], &Overflow);
    }

    ~~~~

    This loop has hundreds of thousands of iterations. I'm going to use the Q flag for overflow detection, instead of passing a C pointer reference.

    ARM instructions SMLAD and SMUAD are really useful here. LDR is faster than LDMIA if you interleave the access.

     
  • birdwes

    birdwes - 2012-11-19

    2.07s for the encoder (I did see 1.99s pop up on gprof once). The decoder has a 1 bit problem with 1min:17s extract from Winston Churchill's speech "Never, in the field of human conflict...", even though it is bit perfect on the ITU test vector "speech.in".

    It's not quite bit perfect on my new test vector winston.in; one click 45s in, but we can fix that. I'm close to the limit now.

    Now that it's fast, I'm using bigger test vectors.

     
    Last edit: birdwes 2012-11-19
1 2 > >> (Page 1 of 2)

Log in to post a comment.