Menu

#362 Peephole rules pack - savings for bitfields operations

None
closed-accepted
Visenri
None
5
2021-11-07
2021-01-25
Visenri
No

I've been working again with STM8 SDCC compiler and I found that many things could be improved with more peephole rules.
I have prepared a patch with many new rules and some changes to some code to get better results.
I put bitfields in the tittle, but the savings are in many more places.

For the medium memory model this is the comparison pre-post patch:

Summary for 'stm8': 0 failures, 12335 tests, 2385 test cases, 2532879 bytes, 78765777 tick
Summary for 'stm8': 0 failures, 12335 tests, 2385 test cases, 2532143 bytes, 78691548 ticks

Bitfields is where the improvement is more noticeable:
bitfields (f: 0, t: 131, c: 8, b: 8003, T: 11182)
bitfields (f: 0, t: 131, c: 8, b: 7720, T: 10870)

-3.5% in size.
But the number is not what i think is important, now all bitfield operations are optimized as they should, so programs handling bitfileds in regs in a lot of different ways will benefit a lot. This also makes a lot of operations atomic.

The rules are also splitted to be as small as possible, so a minimal set of rules performs a lot of optimizations, this is needed because a lot of bitfield operations are done in a similar way, but with small variations (extra srl, or swap). Take a look at the rules to see what I mean.

Related

Patches: #361
Patches: #362
Patches: #373

Discussion

<< < 1 2 3 (Page 3 of 3)
  • Daniel Drotos

    Daniel Drotos - 2021-06-10

    On Thu, 10 Jun 2021, Philipp Klaus Krause wrote:

    However, I have to admit that I don't really know what the number
    called "cycles" in section 7 of the STM8 manual means.

    Dear Philipp and Visenri,

    uCsim is an ISS (Instruction Set Simulator) where cycle accuracy is
    not a key point. As you noted, it is hard to correctly implement it
    for pipeline CPUs. Cycle count in uCsim is just an aproximation. I'm
    afraid it will never match to real hw exactly.

    Some time ago I introduced the "so called" virtual ticks. Result of
    them is printed on output of "conf" command:

    Simulation started, PC=0x008000
    Loading from gen/stm8/abs/abs.ihx
    1169 words read from gen/stm8/abs/abs.ihx
    --- Running: gen/stm8/abs/abs
    Running testAbs
    --- Summary: 0/6/1: 0 failed of 6 tests in 1 cases.
    Stop at 0x00831e: (110) Program stopped itself
    F 0x00831e
    Simulated 3891 ticks (4.864e-04 sec)
    Host usage: 0.002176 sec, rate=0.223513
    CPU state= OK PC= 0x00831e XTAL= 8e+06
    Operation since last reset= (6475 vclks)
    Inst= 2309 Fetch= 4019 Read= 1436 Write= 1020
    

    "Operation since last reset" here is sum of fetches, mem reads and
    writes. In my optinion it measures sw efficiency somehow. And makes
    different CPUs comparable.

    Daniel

     
    • Visenri

      Visenri - 2021-06-10

      I've opened a ticket: [#3250]
      Please, follow any further discussion there.
      I have summarized your comments there.

       

      Related

      Bugs: #3250

  • Philipp Klaus Krause

    Pack 3 is added in [r12418], with some whitespace fixes.

     
  • Visenri

    Visenri - 2021-06-03

    Nice to see most of the rules added.

    Regarding rules 600a & 600b:

    I checked them and I think it was just a typo and those rules were meant to check 'c' not 'n':

    // Unneeded sub-cp
    replace restart {
        ld  a, %1
        %2  a, #%3
    } by {
        ld  a, %1
        ; peephole 600a removed unneeded %2 v1
    } if same(%2 'cp' 'sub'), operandsLiteral(%3), immdInRange(0 0 '+' 0 %3 %4), notUsed('c')
    
    replace restart {
        ld  %1, a
        %2  a, #%3
    } by {
        ld  %1, a
        ; peephole 600b removed unneeded %2 v2
    } if same(%2 'cp' 'sub'), operandsLiteral(%3), immdInRange(0 0 '+' 0 %3 %4), notUsed('c')
    

    If 0 is subtracted from any number the result is the same number, so flags 'n' and 'z' will be the same that after the 'ld'.
    Only 'c' may be different after removing the 'cp' / 'sub', because it's updated with the carry/borrow of the operation.

    Anyway, I'm just shocked that you get some regression failures and I didn't get any. Could you tell me which test(s) failed?
    I will run now the tests with the last commits (12419). Later I will add these rules and I'll repeat the tests.

     

    Last edit: Visenri 2021-06-03
  • Visenri

    Visenri - 2021-06-04

    With this changes it also fails.

    constantRange.asm fails:

    This:

        ld  a, (0x01, sp)
        cp  a, #0x00
    ; peephole j5 changed absolute to relative unconditional jump.
        jrsle   00112$
    

    Translates into:

    ; peephole 0a removed dead load into a from 0x01.
    ; peephole 600a removed unneeded cp v1
    ; peephole j5 changed absolute to relative unconditional jump.
        jrsle   00112$
    

    When it should not.

    Maybe I'm not seeing something, but to me, Rule '0a' is being applied when it should not ('jrsle' uses 'n' & 'z').
    notUsed('v') may be needed and this is not implemented!

     
    • Philipp Klaus Krause

      By changing the macro D_ in peep.c, one can enable debug output. That could tell where a check for n being read goes wrong.
      I guess we'll also want to add support for notUsed('v') sometime, but that's probably beyond the scope of this patch.

      P.S.: Could you contact me via email? My address is in sdcc/Changelog.

       
      • Visenri

        Visenri - 2021-06-04

        I am well aware of the D_ macro and other debugging techniques.

        I needed to reduce the test to the minimum code triggering the problem and later start a debug session, yesterday it was too late for this even for me (I'm a night owl).

        I just wanted you or anyone else to know about the wrong behavior, Just in case anyone wanted to have a look.

        About the email. Of course, I'll send you an email right now.

         
        • Visenri

          Visenri - 2021-06-04

          After testing it with a minimal example, I think I found the problem.

          There is a typo in stm8/peep.c in function "stm8MightReadFlag":

          "jrsgte"
          

          should be

          "jrsgt"
          

          The other similar instruction is "jrsge" and is checked correctly.

          I'll pass all the test before celebrating it too much :D.

           
          • Visenri

            Visenri - 2021-06-05

            Success!
            All tests pass, and I checked the modified rules are being applied 37 times.

            Here you have the patches.

             
            • Philipp Klaus Krause

              Looks good to me. I suggest you go ahead and apply it.
              You might also want to still look into rule pack 2, and if it should replace some of the current rules.

               
              • Visenri

                Visenri - 2021-06-06

                Applied in [r12425]
                I took my time to avoid screwing up things in my first commit :D.

                I'll check rule pack 2 next week.

                 
                • Visenri

                  Visenri - 2021-06-12

                  I have applied rule pack 2 and other improvements in [r12449].

                  • Modified code generation (genPointerSet) to allow optimizations without * reordering rules for ld-and (120) from pack 1 (#362).
                  • Implemented some improvements in peep.c (argCont) to avoid false 'x' positives, a tiny step towards #290.
                  • Fixed double spaces in rules 10f & 10h (that were messing notUsed logic).
                  • Replaced rules 18, 19, 20 & 21 by 202x, 203x, 204x 205x, 210x &220x making use of new "operandsLitOrSym".
                  • Expanded rule 131 scope.
                  • Modified rules (140, 141, 150, 160, 161) to work without needing rule 121 from pack 1 (#362).
                  • Added new rules 515 & 516, similar to 512 & 513 when comparison is done with 'dec' instead of 'cp'.

                  With the changes I've done rules like 120 & 121 are no longer necessary, so now the state of this patch is:

                  1. Done, pure reordering rules not applied and no longer needed (120 & 121):
                  2. Done.
                  3. Done
                  4. Not applied because it is pure reordering.
                  5. Done
                  6. Done
                   

                  Last edit: Visenri 2021-06-12
                  • Visenri

                    Visenri - 2021-06-12

                    Pay attention to the list of packs done, I changed it, because pack numbers were wrong.

                     
  • Visenri

    Visenri - 2021-06-11
    • assigned_to: Visenri
    • Group: -->
     
  • Visenri

    Visenri - 2021-06-12

    Regargind rules not applied from pack 4.
    There is still room for improvement by applying rules from that pack, you can see it in examples found in several tests.

    Here an example from bitfields-bits1_preBits_0_pattern_0_varType_3.c:

    ;   gen/stm8/bitfields-bits1/bitfields-bits1_preBits_0_pattern_0_varType_3.c: 313: volatileBits.bitToTest ^= 1;
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; genAssign
    ; genPointerGet
        ld  a, (1, sp)
    ; peephole 10g moved increment to sp into storage instruction.
        srl a
        srl a
        srl a
        and a, #0x01
    ; genCast
    ; genAssign
    ; peephole 3 removed dead clrw of x.
    ; genXor
        xor a, #0x01
    ; genCast
    ; genAssign
    ; genPointerSet
        ldw x, sp
        incw    x
        sll a
        sll a
        sll a
        and a, #0x08
        push    a
        ld  a, (x)
        and a, #0xf7
        or  a, (1, sp)
        ld  (x), a
        pop a
    

    This "monstruosity" is simplified gracefully with rules from pack 4, here you can see an example from previous tests I did:

    ;   gen/stm8/bitfields-bits1/bitfields-bits1_preBits_0_pattern_0_varType_3.c: 297: volatileBits.bitToTest ^= 1;
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; skipping iCode since result will be rematerialized
    ; genAssign
    ; genPointerGet
    ; peephole 10g moved increment to sp into storage instruction.
    ; genCast
    ; genAssign
    ; peephole 3 removed dead clrw of x.
    ; genXor
    ; peephole 110 replaced 'xor-and' by 'cpl-and'.
    ; peephole 121 exchanged 'and-xor' order.
    ; genCast
    ; genAssign
    ; genPointerSet
    ; peephole 402 changed order of 'srl-ldw' to 'ldw-srl'.
    ; peephole 402 changed order of 'srl-ldw' to 'ldw-srl'.
    ; peephole 402 changed order of 'srl-ldw' to 'ldw-srl'.
    ; peephole 402 changed order of 'cpl-ldw' to 'ldw-cpl'.
    ; peephole 401 changed order of 'and-ldw' to 'ldw-and'.
    ; peephole 407 changed order of 'srl-incw' to 'incw-srl'.
    ; peephole 407 changed order of 'srl-incw' to 'incw-srl'.
    ; peephole 407 changed order of 'srl-incw' to 'incw-srl'.
    ; peephole 407 changed order of 'cpl-incw' to 'incw-cpl'.
    ; peephole 406 changed order of 'and-incw' to 'incw-and'.
    ; peephole 161 replaced 'srl-cpl-and-sll' (0x01) by 'xor-and' (2, 3).
    ; peephole 160 replaced 'srl-xor-and-sll' (2, 3) by 'xor-and' (4, 7).
    ; peephole 160 replaced 'srl-xor-and-sll' (4, 7) by 'xor-and' (8, 15).
    ; peephole 110 replaced 'xor-and' by 'cpl-and'.
    ; peephole 130 combined 'and-and' (15, 0x08) into 'and' (8).
    ; peephole 120 exchanged src operands in 'ld-and'.
    ; peephole 0w removed dead load into x from sp.
    ; peephole 642 removed dead incw x.
        ld a, (1, sp)
        xor a, #8
        ld  (1, sp), a
    ; peephole 300c replaced 'cpl-and-and-or' by 'xor' v3, with incw.
    ;   gen/stm8/bitfields-bits1/bitfields-bits1_preBits_0_pattern_0_varType_3.c: 298: dummy_f();
    

    4 different rules are applied to finally being able to apply other rules.
    The main problem are these two instructions:

        ldw x, sp
        incw    x
    

    That get in the way of applying rules 160, 161 and later all the others.

    The job of rules 4xx, as clearly illustrated by this example, was to allow ldw and inw present many times in similar situations to go up until other rules can be applied to 'a' instructions.

    Getting the same kind of optimization with specific rules for every case would require an insane number of extra rules.

     

    Last edit: Visenri 2021-06-13
    • Visenri

      Visenri - 2021-06-13

      Fixed comment of rule 406, was showing 'addw' because of a copy-paste error.

       
  • Philipp Klaus Krause

    What is the current situation? Has everything that makes sense been applied (so we could close the ticket)? Are there parts left, that we might to apply, for which I should do some testing?

     
  • Visenri

    Visenri - 2021-11-05

    As I said some months ago, after [r12449]:
    ...the state of this patch is:

    1. Done, pure reordering rules not applied and no longer needed (120 & 121):
    2. Done.
    3. Done
    4. Not applied because it is pure reordering.
    5. Done
    6. Done

    As I said, there is margin for improvement, but I think all that could be applied safely has been applied.

    Even thought there are potential problems (endless loops)) with with pure reordering rules (pack 4). I've never seen them in all my testing, because no other rule seems to undo the changes done by rules of pack 4.

    If we put pack 4 aside, this ticket should be closed.

     
    • Philipp Klaus Krause

      Then I suggest to close this ticket. Since you're the owner, that would be done by you. If we want to brink back an equivalent of pack4, we could do so in a new ticket. Maybe we can even achieve the effect without having rules that do not reduce code size.

       
      • Visenri

        Visenri - 2021-11-07

        Ok, I'm closing it.

         
  • Visenri

    Visenri - 2021-11-07
    • status: open --> closed-accepted
     
<< < 1 2 3 (Page 3 of 3)

Log in to post a comment.