Optimisation of DPTR usage

  • Robert Gush

    Robert Gush - 2010-01-30

    The usage of external memory seems use excessive reloading of DPTR even when it is clear that DPTR has not changed.
    Sample code from list file:
                                  2 ; File Created by SDCC : free open source ANSI-C Compiler
                                  3 ; Version 2.9.4 #5595 (Dec 15 2009) (MINGW32)

                              2406 ; Hardware.c:1101: ShadPorts |= 0x01;
       0608 90s00r03           2407 mov dptr,#(_ShadPorts + 0x0003)
       060B E0                 2408 movx a,@dptr
       060C FA                 2409 mov r2,a
       060D 43 02 01           2410 orl ar2,#0x01
       0610 90s00r03           2411 mov dptr,#(_ShadPorts + 0x0003)
       0613 EA                 2412 mov a,r2
       0614 F0                 2413 movx @dptr,a
       0615 80 0D              2414 sjmp 00103$

    Any suggestions on how to improve this?


  • Frieder Ferlemann

    Hi Robert,

    there are several approaches to this:

    a) write a peephole that fixes this special case

    b) get someone to investigate in why different code is generated for
        "my_xdata_char_array |= 0x01;"   and
        "my_xdata_char |= 0x01;"
       If the same code would be generated then one of the
       existing peepholes 248.a to 248.h would apply.
       This would be preferred over a set of peephole rules like a)

    c) extend sdcc/src/mcs51/rtrack.c so that symbolic values assigned to dptr (and possibly R0,R1 for variables in idata/pdata memory) are handled as well. This would be preferred over b) because it would work for other operations as well.

    d) have tracking of register values on a slightly higher level (that knows about basic block dominators and predecessors (manual 9.2))
      Having d) would be most preferred:)

    Anyway probably starting with a) makes sense (but only fixes your specific example).

    Save the following lines into file "mypeeph.def" and append
    the options "-fverbose-asm -peep-file mypeeph.def" to the command line.

    replace {
            mov     dptr,%1
            movx    a,@dptr
            mov     %2,a
            orl     a%2,%3
            mov     dptr,%1
            mov     a,%2
            movx    @dptr,a
    } by {
            mov     dptr,%1
            movx    a,@dptr
            ;       Peephole my248.x        optimized or to xdata array
            orl     a,%3
            mov     %2,a
            movx    @dptr,a
    } if operandsNotRelated( %3 %2)

    There is an old (2001) related feature request "Optimization: Track State of ACC & DPTR" at:

  • Robert Gush

    Robert Gush - 2010-01-31

    Hi Frieder,

    thanks for your prompt response, I will think about a), but as you indicate it only handles the example I posted. The is the '&=' case too and the repeated loading of dptr with the same value, eg:

                               2683 ; Hardware.c:1231: WRtoBPnoT1(C);
       0728 90 89 00           2684 mov dptr,#_HardWarePort
       072B EC                 2685 mov a,r4
       072C F0                 2686 movx @dptr,a
       072D C2 97              2687 clr _P1_7
       072F 00                 2688 nop
       0730 D2 97              2689 setb _P1_7
                               2690 ; Hardware.c:1232: WRtoBPnoT1(c+1); //clock hi
       0732 90 89 00           2691 mov dptr,#_HardWarePort
       0735 ED                 2692 mov a,r5
       0736 F0                 2693 movx @dptr,a
       0737 C2 97              2694 clr _P1_7
       0739 00                 2695 nop
       073A D2 97              2696 setb _P1_7

    WRtoBPnoT1 is a macro.
    What is a little frustrating is that there are some absolutely brilliant bits in the generated code - one shown above is R5 is set to R4 +1 outside the loop - and then the repeated loading of dptr.
    On the plus side, the codedoes work.
    Thanks again for the feedback, and to all the developers for making SDCC possible.


  • dfulab

    dfulab - 2010-02-07

    Speaking of inefficient code, which one do you think is the fastest for MCS51?

    The good old 8051 has a hardware multiplier that does a 8 bit by 8-bit multiply in 4 cycles.
    Thankfully, the compile doesn't try to "optimize" multiply by 2^n by shifts.
    Unless you use inline assembly code for special cases,  you are almost better off in SDCC using a multiply.

    unsigned short Mult16(unsigned char Block)

    unsigned short MyShift4(unsigned char Block) __naked
    { Block;

       mov  a,dpl
       swap a
       mov _DPH,a // DPH = (Block & 0xf0)>>4
       mov  _DPL,a // DPL = (Block & 0x0f)<<4
       anl _DPL,#0xF0
       anl _DPH,#0x0F  

    unsigned short Shift4(unsigned char Block)

    The above code is compiled to:

    102 _Mult16:
    111 ; test.c:3: return(Block*16);
    112 mov a,dpl
    113 mov b,#0x10
    114 mul ab
    115 mov dpl,a
    116 mov dph,b
    117 ret

    127 _MyShift4:
    128 ; naked function: no prologue.
    129 ; test.c:18: _endasm;
    131    mov a,dpl
    132    swap a
    133    mov _DPH,a
    134    mov _DPL,a
    135    anl _DPL,#0xF0
    136    anl _DPH,#0x0F
    137    ret

    148 _Shift4:
    149 ; test.c:23: return(Block<<4);
    150 clr a
    151 swap a
    152 anl a,#0xf0
    153 xch a,dpl
    154 swap a
    155 xch a,dpl
    156 xrl a,dpl
    157 xch a,dpl
    158 anl a,#0xf0
    159 xch a,dpl
    160 xrl a,dpl
    161 mov dph,a
    162 ret

  • Frieder Ferlemann

    Hi Robert,

    (you might have wanted to open a separate topic for the shifting stuff - I guess no-one can digg this out again?^)

    How about:

    unsigned short MyShift4(unsigned char Block) __naked

          mov     a,dpl
          swap    a
          mov     dph,a
          anl     a,#0xf0
          mov     dpl,a
          xrl     dph,a  // undoing setting of upper 4bit

    2 bytes shorter and thus equally short as the multiply by 16, somewhat better with respect to speed.

    The problem with the code generated for:

    unsigned short Shift4(unsigned char Block)

    is that a (very efficient) 16 bit to 16 bit shift is used when an 8 bit shift to a 16 bit result would have done.

    Within the compiler you'll find the code generation in src/mcs51/gen.c in function AccAXLsh "left shift a:x by known count (0..7)" currently at line 8544. ( http://sdcc.svn.sourceforge.net/viewvc/sdcc/trunk/sdcc/src/mcs51/gen.c?view=markup )


    (If you did not receive this by email then the white space arrangement probably is (C) 2010 sourceforge)

  • dfulab

    dfulab - 2010-02-07


    Thanks for the insight.

  • Frieder Ferlemann

    SDCC #5678 and later should be slightly better for your Shift4() function.
    (redundant "swap a" and "anl a,#0xf0" removed)

  • Frieder Ferlemann

    Copy and pasting from the related feature request "Optimization: Track State of ACC & DPTR" at:

    Implemented in SDCC 2.9.7 #5697 (Feb 19 2010).

    Set an environment variable SDCC_REGTRACK if you want
    to enable register tracking.
    If you want diagnostic output use option -fverbose-asm
    and additionally set SDCC_REGTRACK_VERBOSE

    The plan would be to remove the environment variable
    SDCC_REGTRACK later, make register tracking the
    default behaviour and have a compiler option
    -no-regtrack then.
    "Then" probably means after the next release:)

    (note, I renamed the environment variable to the
    slightly more descriptive SDCC_REGTRACK)

  • Robert Gush

    Robert Gush - 2010-02-20

    Hi Frieder,
    Thanks for the update for register tracking. I will try it out shortly.
    BTW it was "dfulab" that put in the 'shift 4' topic. :)


Log in to post a comment.