Small Device C Compiler (SDCC) / Feature Requests / #592 Improve stack allocation

Description has changed:

Diff:

--- old
+++ new
@@ -7,6 +7,7 @@
 As a first step, I want to change to aligned Chaitin for hte z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.

 Next steps:
+

 * Aligned Chaitin for z80-based.
 * Use Chaitin in some other backends, too.
 * Somehow consider coalescing (RFE #401).

Group: -->

Description has changed:

Diff:

--- old
+++ new
@@ -4,7 +4,7 @@

 For now, a big step forward could be done by using the state of the art, which, AFAIK, is based on Chaitin's work in register allocation. A few years ago I did some experiments showing that an unaligned variant of this could save even more stack space compared to the aligned variant. However, the unaligned variant would require changes in code generation.

-As a first step, I want to change to aligned Chaitin for hte z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.
+As a first step, I want to change to aligned Chaitin for the z80-based backends; I have been working on this most of yesterday and today, and hope to be able to commit my work later today.

 Next steps:

Description has changed:

Diff:

--- old
+++ new
@@ -14,3 +14,5 @@

 * Deal with the overallocation for partially-spilt variables in stm8 (ir currently uses space for the whole variable, even if parts of the variable have been allocated to registers).

 Philipp
+
+P.S.: I had already worked on this in 2011 / 2012 in the stack-compact branch. Now I basically just fix parts of that and port them to current SDCC.

Philipp Klaus Krause - 2018-11-12

In [r10654], aligned Chaitin is the new defautl allocator for z80-related backends (the old one can be chosen by changing the line

#define USE_OLDSALLOC 0 // Change to 1 to use old stack allocator

in ralloc.h.

Overall, it tends to result in lower stack usage (a typical example would be print_format() from the printf() implementation now using 52 instead of 67 bytes for variables on the stack), though there are a few stack size regression, too (mostly in functions with few, but large variables) - the regressions seem to be due to stricter alignment.

Philipp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2018-11-13

In [r10659], aigned Chaitin is used as the allcoator for stm8. Since there were serious issues in the old one, no option to go back to the old one is provided.

Again, there are some stack size regressions for some functions with few, large variables. For most functions there is a reduction in stack usage (e.g. print_format() went from 109 to 52 bytes).

For some functions that were affected by bugs in teh old allocator, there is a huge reduction (e.g. a function in the regression test gcc-torture-execute-ashldi-1 went from 504 bytes down to 8).

Philipp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2018-11-14

In revision [r10666], with a bit of fine-tuning the heuristis in the aligned Chaitin allocator, the few regressions got fixed; overall, stack usage went down a bit further.

SDCC now has a state-of-the-art stack allocator (for the stm8, z80, z180, gbz80, tlcs90, r2k and r3ka backends).

There probably is further potential in:

Coalescing¹

Unaligned Chaitin²

Better handling of partially spilt variables for stm8

Porting to more backends

Research into even better stack-allocation algorithms

Philipp

¹ This will involve a trade-off: Code size and speed vs. stack size.
² Needs some modifications in the backends.

Last edit: Philipp Klaus Krause 2018-11-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2018-11-20

In revision [r10675], the aligned Chaitin allocator has been replaced by unaligned Chaitin¹. The result is a further, but small reduction in stack usage.

Philipp

¹ So far relative alignment restristions are enforced on operands that are used in the same instructions. Pendingn changes in the backends, this could be relaxed a bit further, to gain a little bit more stack space.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sometimes the compiler creates too many auxiliary locals for the same local struct variable (Version 3.8.4 #10807 (MINGW32), compiler options: -mstm8 --std-c11 --opt-code-speed --max-allocs-per-node 100000).

E.g., we have the next types:

typedef struct
{
    uint8_t f1;
    uint8_t f2;
    uint8_t f3;
    uint8_t f4;
    uint8_t f5;
    uint8_t f6;
} str1;

typedef struct
{
    uint8_t f;
    str1 str;
} str2;

Assignment of every structure field to a local variable causes creating a separate local pointer to nested struct for every opreation:

        f1 = str.str.f1;
        f2 = str.str.f2;
        f3 = str.str.f3;
        f4 = str.str.f4;
        f5 = str.str.f5;
        f6 = str.str.f6;

produces the next code:

    sub sp, #25
;   1.c: 30: while(1)
    ldw x, sp
    incw    x
    ldw (0x08, sp), x
    ldw (0x0a, sp), x
    ldw (0x0c, sp), x
    ldw (0x0e, sp), x
    ldw (0x10, sp), x
    ldw (0x12, sp), x
    .......

Full test program text:

#include <stdint.h>

typedef struct
{
    uint8_t f1;
    uint8_t f2;
    uint8_t f3;
    uint8_t f4;
    uint8_t f5;
    uint8_t f6;
} str1;

typedef struct
{
    uint8_t f;
    str1 str;
} str2;

void test1(uint8_t f)
{
}

void test2()
{
    uint8_t i, f1, f2, f3, f4, f5, f6;
    str2 str;

    i = 0;

    while(1)
    {
        f1 = str.str.f1;
        f2 = str.str.f2;
        f3 = str.str.f3;
        f4 = str.str.f4;
        f5 = str.str.f5;
        f6 = str.str.f6;

        test1(f1);
        test1(f2);
        test1(f3);
        test1(f4);
        test1(f5);
        test1(f6);

        if(i) break;
    }
}

Full compiler output for test2 function:

_test2:
    sub sp, #25
;   1.c: 30: while(1)
    ldw x, sp
    incw    x
    ldw (0x08, sp), x
    ldw (0x0a, sp), x
    ldw (0x0c, sp), x
    ldw (0x0e, sp), x
    ldw (0x10, sp), x
    ldw (0x12, sp), x
00104$:
;   1.c: 33: f1 = str.str.f1;
    ldw x, (0x08, sp)
    ld  a, (0x1, x)
    ld  (0x14, sp), a
;   1.c: 34: f2 = str.str.f2;
    ldw x, (0x0a, sp)
    ld  a, (0x2, x)
    ld  (0x15, sp), a
;   1.c: 35: f3 = str.str.f3;
    ldw x, (0x0c, sp)
    ld  a, (0x3, x)
    ld  (0x16, sp), a
;   1.c: 36: f4 = str.str.f4;
    ldw x, (0x0e, sp)
    ld  a, (0x4, x)
    ld  (0x17, sp), a
;   1.c: 37: f5 = str.str.f5;
    ldw x, (0x10, sp)
    ld  a, (0x5, x)
    ld  (0x18, sp), a
;   1.c: 38: f6 = str.str.f6;
    ldw x, (0x12, sp)
    ld  a, (0x6, x)
    ld  (0x19, sp), a
;   1.c: 40: test1(f1);
    ld  a, (0x14, sp)
    push    a
    call    _test1
    pop a
;   1.c: 41: test1(f2);
    ld  a, (0x15, sp)
    push    a
    call    _test1
    pop a
;   1.c: 42: test1(f3);
    ld  a, (0x16, sp)
    push    a
    call    _test1
    pop a
;   1.c: 43: test1(f4);
    ld  a, (0x17, sp)
    push    a
    call    _test1
    pop a
;   1.c: 44: test1(f5);
    ld  a, (0x18, sp)
    push    a
    call    _test1
    pop a
;   1.c: 45: test1(f6);
    ld  a, (0x19, sp)
    push    a
    call    _test1
    pop a
;   1.c: 47: if(i) break;
    jra 00104$
;   1.c: 49: }
    addw    sp, #25
    ret

It's interesting that removing either while loop or "if(i) break;" line makes compiler producing better output (without these weird locals, taking 13 bytes of stack memory instead of 25).

test.c

Improve stack allocation

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#592 Improve stack allocation

Discussion