Menu

#362 Peephole rules pack - savings for bitfields operations

None
closed-accepted
Visenri
None
5
2021-11-07
2021-01-25
Visenri
No

I've been working again with STM8 SDCC compiler and I found that many things could be improved with more peephole rules.
I have prepared a patch with many new rules and some changes to some code to get better results.
I put bitfields in the tittle, but the savings are in many more places.

For the medium memory model this is the comparison pre-post patch:

Summary for 'stm8': 0 failures, 12335 tests, 2385 test cases, 2532879 bytes, 78765777 tick
Summary for 'stm8': 0 failures, 12335 tests, 2385 test cases, 2532143 bytes, 78691548 ticks

Bitfields is where the improvement is more noticeable:
bitfields (f: 0, t: 131, c: 8, b: 8003, T: 11182)
bitfields (f: 0, t: 131, c: 8, b: 7720, T: 10870)

-3.5% in size.
But the number is not what i think is important, now all bitfield operations are optimized as they should, so programs handling bitfileds in regs in a lot of different ways will benefit a lot. This also makes a lot of operations atomic.

The rules are also splitted to be as small as possible, so a minimal set of rules performs a lot of optimizations, this is needed because a lot of bitfield operations are done in a similar way, but with small variations (extra srl, or swap). Take a look at the rules to see what I mean.

Related

Patches: #361
Patches: #362
Patches: #373

Discussion

1 2 3 > >> (Page 1 of 3)
  • Visenri

    Visenri - 2021-01-25

    stm8/gen.c:
    Minor change to homogenise generated assembly, to allow better/simpler pattern matching for rules
    See also ticket by Basil Hussain [feature-requests:#728]
    Attached, the new file and the diff with current version.

     

    Related

    Feature Requests: #728


    Last edit: Maarten Brock 2021-03-06
    • Philipp Klaus Krause

      If I apply this part by itself, I get a regression:

      Summary for 'stm8': 0 failures, 12336 tests, 2386 test cases, 2530819 bytes, 79389154 ticks
      Summary for 'stm8-large': 0 failures, 14317 tests, 2386 test cases, 2663826 bytes, 82506774 ticks
      

      Previously:

      Summary for 'stm8': 0 failures, 12336 tests, 2386 test cases, 2530742 bytes, 78855709 ticks
      Summary for 'stm8-large': 0 failures, 14317 tests, 2386 test cases, 2663750 bytes, 81973330 ticks
      
       
      • Visenri

        Visenri - 2021-01-25

        Of course, have you tested it after applying patches to rules 10b & 10c?
        I mean, change (#(%2 + 0) -> #(_%2)) see peeph.def.diff.
        After being applied you shouldn't see regressions, but a net gain.

         
        • Philipp Klaus Krause

          After a bit more testing, I have applied this change, along with the changes and new rules 10b to 10e in [r12021].

          However, I have omitted 10f, as it looks wrong to me: The sp-relative adressing mode of the STM8 only allows offsets up to 255, so the rule would create broken code for %2 > 255.
          The regression tests pass, but few regression tests use local objects of total size > 255, which would probably be needed to trigger the issue.

          P.S.: Technically, there is a similar problem with the 10a to 10e rules: addw wraps around at 0x10000, while (., x) doesn't. However, AFAIK no code generated by SDCC relies on this wraparound for pointer arithmetic. Maybe it is possible to write C code using uintptr_t, that gets broken by one of 10 a to 10e; if we find such a case, we'll have to deal with it then.

           

          Last edit: Philipp Klaus Krause 2021-01-26
          • Visenri

            Visenri - 2021-01-26

            About 10f, my fault, this should fix it (added checks for range):

            replace restart {
                ldw %1, sp
                addw    %1, #%2
                ld  a, (%1)
            } by {
                ld  a, (%2, sp)
                ; peephole 10f moved addition to sp to storage instruction.
            } if notUsed(%1), notUsed('c'), operandsLiteral(%2), immdInRange(0 255 '+' 0 %2  %9)
            

            P.S.: Post edited by spth (Philipp Klaus Krause) to fix code formatting.

             

            Last edit: Philipp Klaus Krause 2021-01-27
            • Philipp Klaus Krause

              Thanks. Looks good , applied in [r12022].

               
  • Visenri

    Visenri - 2021-01-25

    SDCCpeeph.c:
    Added new operator 'bp' for immdInRange.
    See comments in ticket [patches:#361]
    Added new function "optimizeCodeSize" to be used in peephole rules.
    Attached, the new file and the diff with current version.

     

    Related

    Patches: #361


    Last edit: Maarten Brock 2021-03-06
    • Philipp Klaus Krause

      Could we have a more generic alternative to optimizeCodeSize? I'd like to also be able to check for other targets and for the opposite, e.g. applying a rule only if not optimizing for code speed. Maybe something like optimizeFor('code size'), optimizeNotFor('code speed')? Or maybe something like optimizationGoal('code size' '1'), optimizationGoal('code speed' '0')?

       

      Last edit: Maarten Brock 2021-03-06
      • Visenri

        Visenri - 2021-03-03

        Any comments about the proposed "optimizeGoal" function?
        See last post for details: [patches:#362]
        https://sourceforge.net/p/sdcc/patches/362/#56f4

         

        Related

        Patches: #362


        Last edit: Maarten Brock 2021-03-06
  • Visenri

    Visenri - 2021-01-25

    Prerequisites:
    Some PACK1 & PACK2 rules make use of:
    New operator 'bp' for immdInRange (updated SDCCpeeph.c)
    New function "optimizeCodeSize" (updated SDCCpeeph.c)
    Some rules need also the modified stm8/gen.c to work properly

    Rules 8, 9 , 9a & 10:
    Added missing notUsed('c').

    Rules 10b & 10c:
    Changed to be more generic, to match more cases (changes needed to stm8/gen.c)
    See also ticket by Basil Hussain [feature-requests:#728]
    #(%2 + 0) ->#(_%2)
    Added missing notUsed('c').

    Rules 10d, 10e & 10f:
    Added new rules similar to 10b and 10c covering:
    addw - ldw
    addw - ld(%), a
    ldw %,sp - addw - ld

    Rules 18-X, 19-X, 20-X, 21-X (total 32):
    Removed and replaced by new ones (PACK1).

    Rules PACK 1
    10 new rules to prepare different cases to match same rules
    12 rules to handle bset, bres, bcpl in all its variants.
    1 rule to optimize bitfield complement.
    2 more rules to convert ldw-ld-op-ld where op is and, or & xor to ld-op-ld

    Rule 22a:
    Moved comment into corect place, was not matching because of that.

    Rule 52 & 53:
    Renamed to 52a & 53a respectively.

    Rules 52b & 53b
    Added similar rules to 52 & 53 for large memory model.

    Rules j7-eq-ne & j7-ne-eq
    Changed spaces to tabs like all other rules.

    Rules PACK 2
    2 rules to save size when restoring stack addw sp -> pop, popw (requires "optimizeCodeSize" function in SDCCpeeph.c)
    2 rules for unneeded sub or cp with 0
    4 rules to convert bitfield operations (and-or) to direct load and store
    2 rule to convert ldw-ld into ld direct
    1 rule to convert and to bcp when possible (enables/simplifies other rules)
    2 rules for bitfield test (1 bit), changing and-dec to bcp
    2 rules to convert shifts/swap + bcp to bcp
    5 rules to simplify bit reversal and copy to carry
    2 rules to change ld-and-sub-bccm to bcpl
    2 rules to change ld-bcp-jreq to btjf
    2 rules to chagne ld-bcp-jrne to btjt
    1 rule to change ld-bcp-jreq to ld-jrpl
    1 rule to change ld-bcpjrne to ld-jrmi
    2 rules to change ld-srl-jrc to btjt (taken from ticket by Basil Hussain
    https://sourceforge.net/p/sdcc/feature-requests/690/ )
    2 rules to change ld-srl-jrnc to btjf (taken from ticket by Basil Hussain
    https://sourceforge.net/p/sdcc/feature-requests/690/ )
    1 rule to move ld a outside loop
    3 rules for swap optimizations
    1 rule to do operations directly on stack for: 'swap' 'clr' 'dec' 'inc' 'neg' 'cpl'
    4 rules to remove ld into stack that will be immediately out of scope
    3 rules to remove dead addw/incw
    1 rule to remove redundant jr

    Fixed comment about emulated two bytes skip instruction using cp/bcp
    Attached, the new file and the diff with current version.

     

    Related

    Feature Requests: #728


    Last edit: Maarten Brock 2021-03-06
    • Philipp Klaus Krause

      The fixes for existing rules (8 to 10c, 22a), the port of ret rules to retf (52a, 53a) and the space harmonization (j7-ne-eq, j7-eq-ne) have been applied in [r12020]. Thanks.

       
  • Visenri

    Visenri - 2021-01-25

    Regression tests have been done using my updated stm8/peep.c:
    [patches:#290]

    Without that patch, the improvements may be a little lower.

     

    Related

    Patches: #290


    Last edit: Maarten Brock 2021-03-06
  • Visenri

    Visenri - 2021-01-25

    An example in action:

    typedef struct
    {
      unsigned char bit0    : 1;
      unsigned char bit1    : 1;
      unsigned char bit2    : 1;
      unsigned char bit3    : 1;
      unsigned char bit4    : 1;
      unsigned char bit5    : 1;
      unsigned char bit6    : 1;
      unsigned char bit7    : 1;
    } struct_8bits;
    volatile static struct_8bits testst;
    
    uint8_t test_tail_opt_return2()
    {
        return PA_ODR;
    }
    
    uint8_t test_tail_opt_return()
    {
        testst.bit_in_test ^= 1;
        return test_tail_opt_return2();
    }
    

    Without optimization:

    _test_tail_opt_return2:
    ;   ../src/test4.c: 425: return PA_ODR;
        ld  a, _PA_ODR+0
    00101$:
    ;   ../src/test4.c: 426: }
        ret
    ;   ../src/test4.c: 429: uint8_t test_tail_opt_return()
    ;   -----------------------------------------
    ;    function test_tail_opt_return
    ;   -----------------------------------------
    _test_tail_opt_return:
    ;   ../src/test4.c: 431: testst.bit_in_test ^= 1;
        ldw x, #(_testst+0)
        ld  a, (x)
        swap    a
        srl a
        srl a
        srl a
        and a, #0x01
        clrw    x
        xor a, #0x01
        srl a
        bccm    _testst+0, #7
    ;   ../src/test4.c: 433: return test_tail_opt_return2();
        jp  _test_tail_opt_return2
    

    This is how it handles such a monstruosity step by step:

    ;    function test_tail_opt_return
    ;   -----------------------------------------
    ;   Register assignment might be sub-optimal.
    ;   Stack space usage: 0 bytes.
    _test_tail_opt_return:
    ;   ../src/test4.c: 431: testst.bit_in_test ^= 1;
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; genPointerGet
    ; peephole replaced 'ldw-ld' by 'ld direct' v1.
    ; genCast
    ; genAssign
    ; peephole 3 removed dead clrw of x.
    ; genXor
    ; peephole replaced 'xor-and' by 'cpl-and'.
    ; peephole exchanged 'and-xor' order.
    ; genCast
    ; genAssign
    ; genPointerSet
    ; peephole replaced 'swap-and-sub' (8) by 'and-sub' (128).
    ; peephole replaced 'srl-and-sub' by 'and-sub'.
    ; peephole replaced 'srl-and-sub' by 'and-sub'.
    ; peephole replaced 'srl-and-sub' by 'and-sub'.
    ; peephole replaced 'cpl-and-srl' by 'and-sub'.
        bcpl    _testst+0, #7
    ; peephole replaced 'and-sub-bccm' by 'bcpl' (compiler symbol).
    ;   ../src/test4.c: 433: return test_tail_opt_return2();
    ; genCall
    ; genReturn
    ; genLabel
    ; peephole j30 removed unused label 00101$.
    ;   ../src/test4.c: 434: }
    ; genEndFunction
        jp  _test_tail_opt_return2
    ; peephole 52a removed unused ret.
    

    Lots of rules have been applied until reaching the final bcpl, this allows many variations of bits and compiler outputs.

    Notice also, that tail call optimization normally disabled the whole optimization, because target label of the last "jp" was not found, resulting in "S4O_ABORT" in "scan4op" inside stm8/peep.c.
    With the new stm8/peep.c this is treated as a return, returning S4O_TERM and allowing optimizations.

     

    Last edit: Visenri 2021-01-25
  • Philipp Klaus Krause

    Regarding the rules that replace load-shift-jrnc by btj?: I think these need an additional notUsedFrom condition to cover the case of a being used in the code at label %2.
    Anyway, moslty due to me partially applying the patch, the situation has gotten a bit messy.
    Would you mind preparing separate patches against current for your peephole packs?
    After testing, I think most rules can go in, but with the 4.1.0 release coming soon, and me not having much time these days, I'll try to be very careful, and rather do it bit-by-bit insetad of applying a big patch at once.

     
    • Visenri

      Visenri - 2021-02-03

      When is the release scheduled?
      I didn't find time to test all properly with the current version.
      Maybe this weekend I'll have some time separate the patches and test them properly.

      Would that fit the release schedule?

       
      • Philipp Klaus Krause

        See https://sourceforge.net/p/sdcc/wiki/SDCC%204.1.0%20Release/.
        Soft freeze (i.e. no more new features in trunk) is scheduled for Thursday next week. Even before that I'd be reluctant about big changes, unless necessary for bugfixes.
        But work can always continue in branches.

         
  • Visenri

    Visenri - 2021-01-27

    You're right about the rules, but I think there are more rules affected and more conditions must be applied to be strictly correct:
    notUsedFrom(%x 'a'), notUsedFrom(%x 'z'), notUsedFrom(%x 'n'), notUsedFrom(%x 'c')
    for:
    'and-dec-jrne' by 'bcp-jreq'
    'and-dec-jreq' by 'bcp-jrne'
    'ld-bcp-jrne' by 'btjf'
    'ld-bcp-jrne' by 'btjt'
    'ld-srl-jrc' by 'btjt'
    'ld-srl-jrnc' by 'btjf'

    notUsedFrom(%x 'z'), notUsedFrom(%x 'n')
    for:
    'ld-bcp-jreq' by 'ld-jrpl'
    'ld-bcp-jrne' by 'ld-jrmi'

    %x means the jump variable, of course.

    About the separate patches:
    Maybe I can prepare it, but it should be pretty straightforward, I would leave pack1 as one indivisible pack (because they are closely related) and pack2 could be splitted in several patches.

     
  • Visenri

    Visenri - 2021-01-27

    I've never used the condition NotUsedFrom, my fault, but I have to say that documentation is, to say the least, scarce .

    So, without documentation, I have to look at source code or other examples.

    After looking at the source code of SDCCpeeph.c, it seems that notUsed can accept multiple variables in one call (but no one used this functionality at least with STM8 rules).
    Surprisingly, notUsedFrom has no loop to handle this condition.
    So:
    notUsed('a' 'c')
    should work, but
    notUsedFrom(%X 'a' 'c')
    will show a "malformed" error.

    I think it would be good to handle both cases the same way.
    And if we use this feature would make rules much more readable.

     
  • Visenri

    Visenri - 2021-03-01

    I have updated the files (vs rev 12070) with all the changes requested and some more:

    SDCCpeeph.c:
    Modified "notUsedFrom": now can be used the same way as "notUsed" (multiple elements to check).
    New function "optimizeGoal", accepts arguments like 'size', 'speed', '!size', '!speed' or combinations (except both 'speed' and 'size' for obvious reasons). I hope you like it!
    New operator 'bp' for immdInRange as described previously.

    Rules:
    Pack1 & Pack2 remain almost as they were in my initial patch with some minor changes:
    - Added fixes suggested using notUsedFrom as required (only 12 Pack2 rules affected).
    - Changed all rules to use notUsedFrom & notUsed with multiple elements to check, much shorter and readable.
    - Changed rules to use new function "optimizeGoal" (2 rules in Pack2).

    Swap optimizations in pack2 may be useless if you merge the patches from: [patches:#363]
    But until then, they're in.

    I'll upload the files once I finish testing (I hope next week).

     

    Related

    Patches: #363


    Last edit: Maarten Brock 2021-03-06
    • Philipp Klaus Krause

      I can't find those updated files you mention. Did you forget an attachment?

       
      • Visenri

        Visenri - 2021-03-04

        As I said:

        I'll upload the files once I finish testing (I hope next week).

        That means, this week, if I can't find more time, I will upload what I have right now.
        But before that I just wanted to know your opinion about the new optimizeGoal as described:

        accepts arguments like :
        'size'
        'speed'
        '!size'
        '!speed'
        or combinations
        '!size' '!speed' (that would mean balanced).
        (except both 'speed' and 'size' for obvious reasons). I hope you like it!

         
        • Philipp Klaus Krause

          Yes, I do like this design.
          Can we put the word "code" in there, so it isn't just "size" and "speed"? After all we are optimizing for code size (and the option is called --opt-code-size.

          We'll probably have one more optimizations option in the future, to optimize for RAM usage, possibly called --opt-data-size, so we'll want to avoid confusion between optimizing for code size vs. data size.

           
          • Visenri

            Visenri - 2021-03-08

            Finally, here you have the updated files.
            All remains as stated in https://sourceforge.net/p/sdcc/patches/362/#56f4.

            Additionally:
            The updated version of OptimizeGoal function uses these strings:
            'code-speed'
            'code-size'
            Added some more comments to the code.
            Changed the NotUsedFrom function description to reflect the new functionality.

            Regarding the rules packs:
            Pack1 should be kept together.
            Pack2 can be applied in steps if you want, just search for //---- and you'll see 4 sub-packs. It can be split even more, because some of those sub-packs have unrelated rules, but for me, as long as tests are passed and I see no obvious flaw, all should be good to go in.

             

            Last edit: Visenri 2021-03-08
            • Philipp Klaus Krause

              In [r12089] I've applied the notUsedFrom improvement and the optimizeGoal (As optimizeFor) to trunk.
              For the bit position I want to have a look at both proposed approaches again before applying one.
              For the peephole rules: They need to be numbered / named (like the existing ones). The peephole rule numbers help in two ways:
              1) When a bug is reported for which --no-peep is a workaround, compiling with --fverbose-asm and looking at the number helps find the broken rule.
              2) Running the regression tests and doing a grep for each peephole rule number helps to find rules that are not tested in regression tests (e.g. rules that have become redundant due to improvements in other parts of SDCC).

              Over time, peephole numbers get a bit messy, so once in a while they all get renumbered. Still, any time a new peephole rule is introduced it should have a number.

               
              • Visenri

                Visenri - 2021-03-08

                About the peephole rules numbering. I was avoiding it on purpose for 3 reasons.

                1) I didn't know if you wanted to renumber them yourself, maybe my numbering does not fit your taste.
                2) I really don't feel comfortable with it to begin with, as you said, it gets messy over time, because you run out of numbers or want to add another related rule in between 10a and 10b.
                3) I am already using --fverbose-asm and doing search over the asm files to find where the rules are being applied, but I use the whole rule text, for example: "peephole replaced 'xor-and' by 'cpl-and'."

                If you want I can give them numbers, but for example, with PACK1, I get into trouble immediately, I have to start using numbers and letters because they are between rule number 17 and 20a, and there are 25 rules in total, how am I expected to assign numbers?.

                Here you have various proposals:
                1) Keep the long name to search for the rules, but make sure they are unique

                replace restart {
                    ld a, _%1
                    and a, #%3
                    ld _%1, a
                } by {
                    bres _%1, #%4
                    ; peephole replaced 'and' by 'bres' (compiler symbol).
                

                This is what would identify the rule (I should change some of them to be unique):

                ; peephole replaced 'and' by 'bres' (compiler symbol).
                

                2) Use short prefixes and suffixes related to the operation and then numbers, for example:

                Same rule would turn into:

                ; peephole LAL-b1s replaced 'and' by 'bres' (compiler symbol).
                

                L= LOAD
                A= AND
                L= LOAD
                b= bit operation
                1= version 1 of the rule (in case more match the same pattern).
                s= symbol (symbol or literal with the same rule repeat a lot).

                To me this is the best one, it is much easier to avoid conflicts / renumbering in the future.

                3) Use short prefixes related to rule groups or more generic concepts:

                ; peephole BOP-1 replaced 'and' by 'bres' (compiler symbol).
                

                BOP = bit operation
                1= like current numbering schema.

                4) Just keep using the messy numbers as up to now :D.

                Maybe I'm making a mountain out of a molehill!

                 
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.