Menu

#592 Improve stack allocation

None
open
None
5
2019-01-26
2018-11-12
No

SDCC is not that good at allocating variables on the stack. It makes rather wasteful use of stack space. E.g. "SDCC generates very bad stack-overflow-prone code for long functions […] effectively eliminates possibilities for usage of SDCC with any RTOS on STM8." from https://github.com/shkolnick-kun/bugurtos/issues/13 (though stm8 is worse than the other backends).

In the long term, it might be possible to come up with an advanced scheme based on the works of Thorup in regoster allocation.

For now, a big step forward could be done by using the state of the art, which, AFAIK, is based on Chaitin's work in register allocation. A few years ago I did some experiments showing that an unaligned variant of this could save even more stack space compared to the aligned variant. However, the unaligned variant would require changes in code generation.

As a first step, I want to change to aligned Chaitin for the z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.

Next steps:

  • Aligned Chaitin for z80-based.
  • Use Chaitin in some other backends, too.
  • Somehow consider coalescing (RFE #401).
  • Deal with the overallocation for partially-spilt variables in stm8 (ir currently uses space for the whole variable, even if parts of the variable have been allocated to registers).

Philipp

P.S.: I had already worked on this in 2011 / 2012 in the stack-compact branch. Now I basically just fix parts of that and port them to current SDCC.

Discussion

  • Philipp Klaus Krause

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -7,6 +7,7 @@
     As a first step, I want to change to aligned Chaitin for hte z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.
    
     Next steps:
    +
    
     * Aligned Chaitin for z80-based.
     * Use Chaitin in some other backends, too.
     * Somehow consider coalescing (RFE #401).
    
    • Group: -->
     
  • Philipp Klaus Krause

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -4,7 +4,7 @@
    
     For now, a big step forward could be done by using the state of the art, which, AFAIK, is based on Chaitin's work in register allocation. A few years ago I did some experiments showing that an unaligned variant of this could save even more stack space compared to the aligned variant. However, the unaligned variant would require changes in code generation.
    
    -As a first step, I want to change to aligned Chaitin for hte z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.
    +As a first step, I want to change to aligned Chaitin for the z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.
    
     Next steps:
    
     
  • Philipp Klaus Krause

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -14,3 +14,5 @@
    
     * Deal with the overallocation for partially-spilt variables in stm8 (ir currently uses space for the whole variable, even if parts of the variable have been allocated to registers).
    
     Philipp
    +
    +P.S.: I had already worked on this in 2011 / 2012 in the stack-compact branch. Now I basically just fix parts of that and port them to current SDCC.
    
     
  • Philipp Klaus Krause

    In [r10654], aligned Chaitin is the new defautl allocator for z80-related backends (the old one can be chosen by changing the line

    #define USE_OLDSALLOC 0 // Change to 1 to use old stack allocator
    

    in ralloc.h.

    Overall, it tends to result in lower stack usage (a typical example would be print_format() from the printf() implementation now using 52 instead of 67 bytes for variables on the stack), though there are a few stack size regression, too (mostly in functions with few, but large variables) - the regressions seem to be due to stricter alignment.

    Philipp

     
  • Philipp Klaus Krause

    In [r10659], aigned Chaitin is used as the allcoator for stm8. Since there were serious issues in the old one, no option to go back to the old one is provided.

    Again, there are some stack size regressions for some functions with few, large variables. For most functions there is a reduction in stack usage (e.g. print_format() went from 109 to 52 bytes).

    For some functions that were affected by bugs in teh old allocator, there is a huge reduction (e.g. a function in the regression test gcc-torture-execute-ashldi-1 went from 504 bytes down to 8).

    Philipp

     
  • Philipp Klaus Krause

    In revision [r10666], with a bit of fine-tuning the heuristis in the aligned Chaitin allocator, the few regressions got fixed; overall, stack usage went down a bit further.

    SDCC now has a state-of-the-art stack allocator (for the stm8, z80, z180, gbz80, tlcs90, r2k and r3ka backends).

    There probably is further potential in:

    • Coalescing¹
    • Unaligned Chaitin²
    • Better handling of partially spilt variables for stm8
    • Porting to more backends
    • Research into even better stack-allocation algorithms

    Philipp

    ¹ This will involve a trade-off: Code size and speed vs. stack size.
    ² Needs some modifications in the backends.

     

    Last edit: Philipp Klaus Krause 2018-11-20
  • Philipp Klaus Krause

    In revision [r10675], the aligned Chaitin allocator has been replaced by unaligned Chaitin¹. The result is a further, but small reduction in stack usage.

    Philipp

    ¹ So far relative alignment restristions are enforced on operands that are used in the same instructions. Pendingn changes in the backends, this could be relaxed a bit further, to gain a little bit more stack space.

     
  • Nikolay

    Nikolay - 2019-01-26

    Sometimes the compiler creates too many auxiliary locals for the same local struct variable (Version 3.8.4 #10807 (MINGW32), compiler options: -mstm8 --std-c11 --opt-code-speed --max-allocs-per-node 100000).

    E.g., we have the next types:

    typedef struct
    {
        uint8_t f1;
        uint8_t f2;
        uint8_t f3;
        uint8_t f4;
        uint8_t f5;
        uint8_t f6;
    } str1;
    
    typedef struct
    {
        uint8_t f;
        str1 str;
    } str2;
    

    Assignment of every structure field to a local variable causes creating a separate local pointer to nested struct for every opreation:

            f1 = str.str.f1;
            f2 = str.str.f2;
            f3 = str.str.f3;
            f4 = str.str.f4;
            f5 = str.str.f5;
            f6 = str.str.f6;
    

    produces the next code:

        sub sp, #25
    ;   1.c: 30: while(1)
        ldw x, sp
        incw    x
        ldw (0x08, sp), x
        ldw (0x0a, sp), x
        ldw (0x0c, sp), x
        ldw (0x0e, sp), x
        ldw (0x10, sp), x
        ldw (0x12, sp), x
        .......
    

    Full test program text:

    #include <stdint.h>
    
    typedef struct
    {
        uint8_t f1;
        uint8_t f2;
        uint8_t f3;
        uint8_t f4;
        uint8_t f5;
        uint8_t f6;
    } str1;
    
    typedef struct
    {
        uint8_t f;
        str1 str;
    } str2;
    
    void test1(uint8_t f)
    {
    }
    
    void test2()
    {
        uint8_t i, f1, f2, f3, f4, f5, f6;
        str2 str;
    
        i = 0;
    
        while(1)
        {
            f1 = str.str.f1;
            f2 = str.str.f2;
            f3 = str.str.f3;
            f4 = str.str.f4;
            f5 = str.str.f5;
            f6 = str.str.f6;
    
            test1(f1);
            test1(f2);
            test1(f3);
            test1(f4);
            test1(f5);
            test1(f6);
    
            if(i) break;
        }
    }
    

    Full compiler output for test2 function:

    _test2:
        sub sp, #25
    ;   1.c: 30: while(1)
        ldw x, sp
        incw    x
        ldw (0x08, sp), x
        ldw (0x0a, sp), x
        ldw (0x0c, sp), x
        ldw (0x0e, sp), x
        ldw (0x10, sp), x
        ldw (0x12, sp), x
    00104$:
    ;   1.c: 33: f1 = str.str.f1;
        ldw x, (0x08, sp)
        ld  a, (0x1, x)
        ld  (0x14, sp), a
    ;   1.c: 34: f2 = str.str.f2;
        ldw x, (0x0a, sp)
        ld  a, (0x2, x)
        ld  (0x15, sp), a
    ;   1.c: 35: f3 = str.str.f3;
        ldw x, (0x0c, sp)
        ld  a, (0x3, x)
        ld  (0x16, sp), a
    ;   1.c: 36: f4 = str.str.f4;
        ldw x, (0x0e, sp)
        ld  a, (0x4, x)
        ld  (0x17, sp), a
    ;   1.c: 37: f5 = str.str.f5;
        ldw x, (0x10, sp)
        ld  a, (0x5, x)
        ld  (0x18, sp), a
    ;   1.c: 38: f6 = str.str.f6;
        ldw x, (0x12, sp)
        ld  a, (0x6, x)
        ld  (0x19, sp), a
    ;   1.c: 40: test1(f1);
        ld  a, (0x14, sp)
        push    a
        call    _test1
        pop a
    ;   1.c: 41: test1(f2);
        ld  a, (0x15, sp)
        push    a
        call    _test1
        pop a
    ;   1.c: 42: test1(f3);
        ld  a, (0x16, sp)
        push    a
        call    _test1
        pop a
    ;   1.c: 43: test1(f4);
        ld  a, (0x17, sp)
        push    a
        call    _test1
        pop a
    ;   1.c: 44: test1(f5);
        ld  a, (0x18, sp)
        push    a
        call    _test1
        pop a
    ;   1.c: 45: test1(f6);
        ld  a, (0x19, sp)
        push    a
        call    _test1
        pop a
    ;   1.c: 47: if(i) break;
        jra 00104$
    ;   1.c: 49: }
        addw    sp, #25
        ret
    

    It's interesting that removing either while loop or "if(i) break;" line makes compiler producing better output (without these weird locals, taking 13 bytes of stack memory instead of 25).

     

Log in to post a comment.

MongoDB Logo MongoDB