Small Device C Compiler (SDCC) / Patches / #362 Peephole rules pack

Visenri - 2021-01-25

stm8/gen.c:
Minor change to homogenise generated assembly, to allow better/simpler pattern matching for rules
See also ticket by Basil Hussain [feature-requests:#728]
Attached, the new file and the diff with current version.

Related

Feature Requests: #728

Last edit: Maarten Brock 2021-03-06

gen.c

gen.c.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2021-01-25
  
  If I apply this part by itself, I get a regression:
  
  Summary for 'stm8': 0 failures, 12336 tests, 2386 test cases, 2530819 bytes, 79389154 ticks Summary for 'stm8-large': 0 failures, 14317 tests, 2386 test cases, 2663826 bytes, 82506774 ticks
  
  Previously:
  
  Summary for 'stm8': 0 failures, 12336 tests, 2386 test cases, 2530742 bytes, 78855709 ticks Summary for 'stm8-large': 0 failures, 14317 tests, 2386 test cases, 2663750 bytes, 81973330 ticks
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Visenri - 2021-01-25
    
    Of course, have you tested it after applying patches to rules 10b & 10c?
    I mean, change (#(%2 + 0) -> #(_%2)) see peeph.def.diff.
    After being applied you shouldn't see regressions, but a net gain.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Philipp Klaus Krause - 2021-01-26
      
      After a bit more testing, I have applied this change, along with the changes and new rules 10b to 10e in [r12021].
      
      However, I have omitted 10f, as it looks wrong to me: The sp-relative adressing mode of the STM8 only allows offsets up to 255, so the rule would create broken code for %2 > 255.
      The regression tests pass, but few regression tests use local objects of total size > 255, which would probably be needed to trigger the issue.
      
      P.S.: Technically, there is a similar problem with the 10a to 10e rules: addw wraps around at 0x10000, while (., x) doesn't. However, AFAIK no code generated by SDCC relies on this wraparound for pointer arithmetic. Maybe it is possible to write C code using uintptr_t, that gets broken by one of 10 a to 10e; if we find such a case, we'll have to deal with it then.
      
      Last edit: Philipp Klaus Krause 2021-01-26
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Visenri - 2021-01-26
        
        About 10f, my fault, this should fix it (added checks for range):
        
        replace restart { ldw %1, sp addw %1, #%2 ld a, (%1) } by { ld a, (%2, sp) ; peephole 10f moved addition to sp to storage instruction. } if notUsed(%1), notUsed('c'), operandsLiteral(%2), immdInRange(0 255 '+' 0 %2 %9)
        
        P.S.: Post edited by spth (Philipp Klaus Krause) to fix code formatting.
        
        Last edit: Philipp Klaus Krause 2021-01-27
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Philipp Klaus Krause - 2021-01-27
        
        Thanks. Looks good , applied in [r12022].
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visenri - 2021-01-25

SDCCpeeph.c:
Added new operator 'bp' for immdInRange.
See comments in ticket [patches:#361]
Added new function "optimizeCodeSize" to be used in peephole rules.
Attached, the new file and the diff with current version.

Related

Patches: ~~#361~~

Last edit: Maarten Brock 2021-03-06

SDCCpeeph.c

SDCCpeeph.c.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2021-01-25
  
  Could we have a more generic alternative to optimizeCodeSize? I'd like to also be able to check for other targets and for the opposite, e.g. applying a rule only if not optimizing for code speed. Maybe something like optimizeFor('code size'), optimizeNotFor('code speed')? Or maybe something like optimizationGoal('code size' '1'), optimizationGoal('code speed' '0')?
  
  Last edit: Maarten Brock 2021-03-06
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Visenri - 2021-03-03
    
    Any comments about the proposed "optimizeGoal" function?
    See last post for details: [patches:#362]
    https://sourceforge.net/p/sdcc/patches/362/#56f4
    
    Related
    
    Patches: ~~#362~~
    
    Last edit: Maarten Brock 2021-03-06
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visenri - 2021-01-25

Prerequisites:
Some PACK1 & PACK2 rules make use of:
New operator 'bp' for immdInRange (updated SDCCpeeph.c)
New function "optimizeCodeSize" (updated SDCCpeeph.c)
Some rules need also the modified stm8/gen.c to work properly

Rules 8, 9 , 9a & 10:
Added missing notUsed('c').

Rules 10b & 10c:
Changed to be more generic, to match more cases (changes needed to stm8/gen.c)
See also ticket by Basil Hussain [feature-requests:#728]
#(%2 + 0) ->#(_%2)
Added missing notUsed('c').

Rules 10d, 10e & 10f:
Added new rules similar to 10b and 10c covering:
addw - ldw
addw - ld(%), a
ldw %,sp - addw - ld

Rules 18-X, 19-X, 20-X, 21-X (total 32):
Removed and replaced by new ones (PACK1).

Rules PACK 1
10 new rules to prepare different cases to match same rules
12 rules to handle bset, bres, bcpl in all its variants.
1 rule to optimize bitfield complement.
2 more rules to convert ldw-ld-op-ld where op is and, or & xor to ld-op-ld

Rule 22a:
Moved comment into corect place, was not matching because of that.

Rule 52 & 53:
Renamed to 52a & 53a respectively.

Rules 52b & 53b
Added similar rules to 52 & 53 for large memory model.

Rules j7-eq-ne & j7-ne-eq
Changed spaces to tabs like all other rules.

Rules PACK 2
2 rules to save size when restoring stack addw sp -> pop, popw (requires "optimizeCodeSize" function in SDCCpeeph.c)
2 rules for unneeded sub or cp with 0
4 rules to convert bitfield operations (and-or) to direct load and store
2 rule to convert ldw-ld into ld direct
1 rule to convert and to bcp when possible (enables/simplifies other rules)
2 rules for bitfield test (1 bit), changing and-dec to bcp
2 rules to convert shifts/swap + bcp to bcp
5 rules to simplify bit reversal and copy to carry
2 rules to change ld-and-sub-bccm to bcpl
2 rules to change ld-bcp-jreq to btjf
2 rules to chagne ld-bcp-jrne to btjt
1 rule to change ld-bcp-jreq to ld-jrpl
1 rule to change ld-bcpjrne to ld-jrmi
2 rules to change ld-srl-jrc to btjt (taken from ticket by Basil Hussain
https://sourceforge.net/p/sdcc/feature-requests/690/ )
2 rules to change ld-srl-jrnc to btjf (taken from ticket by Basil Hussain
https://sourceforge.net/p/sdcc/feature-requests/690/ )
1 rule to move ld a outside loop
3 rules for swap optimizations
1 rule to do operations directly on stack for: 'swap' 'clr' 'dec' 'inc' 'neg' 'cpl'
4 rules to remove ld into stack that will be immediately out of scope
3 rules to remove dead addw/incw
1 rule to remove redundant jr

Fixed comment about emulated two bytes skip instruction using cp/bcp
Attached, the new file and the diff with current version.

Related

Feature Requests: #728

Last edit: Maarten Brock 2021-03-06

peeph.def

peeph.def.diff

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2021-01-25
  
  The fixes for existing rules (8 to 10c, 22a), the port of ret rules to retf (52a, 53a) and the space harmonization (j7-ne-eq, j7-eq-ne) have been applied in [r12020]. Thanks.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visenri - 2021-01-25

Regression tests have been done using my updated stm8/peep.c:
[patches:#290]

Without that patch, the improvements may be a little lower.

Related

Patches: #290

Last edit: Maarten Brock 2021-03-06

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

An example in action:

typedef struct
{
  unsigned char bit0    : 1;
  unsigned char bit1    : 1;
  unsigned char bit2    : 1;
  unsigned char bit3    : 1;
  unsigned char bit4    : 1;
  unsigned char bit5    : 1;
  unsigned char bit6    : 1;
  unsigned char bit7    : 1;
} struct_8bits;
volatile static struct_8bits testst;

uint8_t test_tail_opt_return2()
{
    return PA_ODR;
}

uint8_t test_tail_opt_return()
{
    testst.bit_in_test ^= 1;
    return test_tail_opt_return2();
}

Without optimization:

_test_tail_opt_return2:
;   ../src/test4.c: 425: return PA_ODR;
    ld  a, _PA_ODR+0
00101$:
;   ../src/test4.c: 426: }
    ret
;   ../src/test4.c: 429: uint8_t test_tail_opt_return()
;   -----------------------------------------
;    function test_tail_opt_return
;   -----------------------------------------
_test_tail_opt_return:
;   ../src/test4.c: 431: testst.bit_in_test ^= 1;
    ldw x, #(_testst+0)
    ld  a, (x)
    swap    a
    srl a
    srl a
    srl a
    and a, #0x01
    clrw    x
    xor a, #0x01
    srl a
    bccm    _testst+0, #7
;   ../src/test4.c: 433: return test_tail_opt_return2();
    jp  _test_tail_opt_return2

This is how it handles such a monstruosity step by step:

;    function test_tail_opt_return
;   -----------------------------------------
;   Register assignment might be sub-optimal.
;   Stack space usage: 0 bytes.
_test_tail_opt_return:
;   ../src/test4.c: 431: testst.bit_in_test ^= 1;
; skipping iCode since result will be rematerialized
; skipping iCode since result will be rematerialized
; skipping iCode since result will be rematerialized
; skipping iCode since result will be rematerialized
; genPointerGet
; peephole replaced 'ldw-ld' by 'ld direct' v1.
; genCast
; genAssign
; peephole 3 removed dead clrw of x.
; genXor
; peephole replaced 'xor-and' by 'cpl-and'.
; peephole exchanged 'and-xor' order.
; genCast
; genAssign
; genPointerSet
; peephole replaced 'swap-and-sub' (8) by 'and-sub' (128).
; peephole replaced 'srl-and-sub' by 'and-sub'.
; peephole replaced 'srl-and-sub' by 'and-sub'.
; peephole replaced 'srl-and-sub' by 'and-sub'.
; peephole replaced 'cpl-and-srl' by 'and-sub'.
    bcpl    _testst+0, #7
; peephole replaced 'and-sub-bccm' by 'bcpl' (compiler symbol).
;   ../src/test4.c: 433: return test_tail_opt_return2();
; genCall
; genReturn
; genLabel
; peephole j30 removed unused label 00101$.
;   ../src/test4.c: 434: }
; genEndFunction
    jp  _test_tail_opt_return2
; peephole 52a removed unused ret.

Lots of rules have been applied until reaching the final bcpl, this allows many variations of bits and compiler outputs.

Notice also, that tail call optimization normally disabled the whole optimization, because target label of the last "jp" was not found, resulting in "S4O_ABORT" in "scan4op" inside stm8/peep.c.
With the new stm8/peep.c this is treated as a return, returning S4O_TERM and allowing optimizations.

Last edit: Visenri 2021-01-25

Philipp Klaus Krause - 2021-01-27

Regarding the rules that replace load-shift-jrnc by btj?: I think these need an additional notUsedFrom condition to cover the case of a being used in the code at label %2.
Anyway, moslty due to me partially applying the patch, the situation has gotten a bit messy.
Would you mind preparing separate patches against current for your peephole packs?
After testing, I think most rules can go in, but with the 4.1.0 release coming soon, and me not having much time these days, I'll try to be very careful, and rather do it bit-by-bit insetad of applying a big patch at once.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Visenri - 2021-02-03
  
  When is the release scheduled?
  I didn't find time to test all properly with the current version.
  Maybe this weekend I'll have some time separate the patches and test them properly.
  
  Would that fit the release schedule?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2021-02-05
    
    See https://sourceforge.net/p/sdcc/wiki/SDCC%204.1.0%20Release/.
    Soft freeze (i.e. no more new features in trunk) is scheduled for Thursday next week. Even before that I'd be reluctant about big changes, unless necessary for bugfixes.
    But work can always continue in branches.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visenri - 2021-01-27

You're right about the rules, but I think there are more rules affected and more conditions must be applied to be strictly correct:
notUsedFrom(%x 'a'), notUsedFrom(%x 'z'), notUsedFrom(%x 'n'), notUsedFrom(%x 'c')
for:
'and-dec-jrne' by 'bcp-jreq'
'and-dec-jreq' by 'bcp-jrne'
'ld-bcp-jrne' by 'btjf'
'ld-bcp-jrne' by 'btjt'
'ld-srl-jrc' by 'btjt'
'ld-srl-jrnc' by 'btjf'

notUsedFrom(%x 'z'), notUsedFrom(%x 'n')
for:
'ld-bcp-jreq' by 'ld-jrpl'
'ld-bcp-jrne' by 'ld-jrmi'

%x means the jump variable, of course.

About the separate patches:
Maybe I can prepare it, but it should be pretty straightforward, I would leave pack1 as one indivisible pack (because they are closely related) and pack2 could be splitted in several patches.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visenri - 2021-01-27

I've never used the condition NotUsedFrom, my fault, but I have to say that documentation is, to say the least, scarce .

So, without documentation, I have to look at source code or other examples.

After looking at the source code of SDCCpeeph.c, it seems that notUsed can accept multiple variables in one call (but no one used this functionality at least with STM8 rules).
Surprisingly, notUsedFrom has no loop to handle this condition.
So:
notUsed('a' 'c')
should work, but
notUsedFrom(%X 'a' 'c')
will show a "malformed" error.

I think it would be good to handle both cases the same way.
And if we use this feature would make rules much more readable.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Visenri - 2021-03-01

I have updated the files (vs rev 12070) with all the changes requested and some more:

SDCCpeeph.c:
Modified "notUsedFrom": now can be used the same way as "notUsed" (multiple elements to check).
New function "optimizeGoal", accepts arguments like 'size', 'speed', '!size', '!speed' or combinations (except both 'speed' and 'size' for obvious reasons). I hope you like it!
New operator 'bp' for immdInRange as described previously.

Rules:
Pack1 & Pack2 remain almost as they were in my initial patch with some minor changes:
- Added fixes suggested using notUsedFrom as required (only 12 Pack2 rules affected).
- Changed all rules to use notUsedFrom & notUsed with multiple elements to check, much shorter and readable.
- Changed rules to use new function "optimizeGoal" (2 rules in Pack2).

Swap optimizations in pack2 may be useless if you merge the patches from: [patches:#363]
But until then, they're in.

I'll upload the files once I finish testing (I hope next week).

Related

Patches: #363

Last edit: Maarten Brock 2021-03-06

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2021-03-04
  
  I can't find those updated files you mention. Did you forget an attachment?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Visenri - 2021-03-04
    
    As I said:
    
    I'll upload the files once I finish testing (I hope next week).
    
    That means, this week, if I can't find more time, I will upload what I have right now.
    But before that I just wanted to know your opinion about the new optimizeGoal as described:
    
    accepts arguments like :
    'size'
    'speed'
    '!size'
    '!speed'
    or combinations
    '!size' '!speed' (that would mean balanced).
    (except both 'speed' and 'size' for obvious reasons). I hope you like it!
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Philipp Klaus Krause - 2021-03-04
      
      Yes, I do like this design.
      Can we put the word "code" in there, so it isn't just "size" and "speed"? After all we are optimizing for code size (and the option is called --opt-code-size.
      
      We'll probably have one more optimizations option in the future, to optimize for RAM usage, possibly called --opt-data-size, so we'll want to avoid confusion between optimizing for code size vs. data size.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Visenri - 2021-03-08
        
        Finally, here you have the updated files.
        All remains as stated in https://sourceforge.net/p/sdcc/patches/362/#56f4.
        
        Additionally:
        The updated version of OptimizeGoal function uses these strings:
        'code-speed'
        'code-size'
        Added some more comments to the code.
        Changed the NotUsedFrom function description to reflect the new functionality.
        
        Regarding the rules packs:
        Pack1 should be kept together.
        Pack2 can be applied in steps if you want, just search for //---- and you'll see 4 sub-packs. It can be split even more, because some of those sub-packs have unrelated rules, but for me, as long as tests are passed and I see no obvious flaw, all should be good to go in.
        
        Last edit: Visenri 2021-03-08
        
        SDCCpeeph.c
        
        SDCCpeeph.c.diff
        
        peeph.def
        
        peeph.def.diff
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Philipp Klaus Krause - 2021-03-08
        
        In [r12089] I've applied the notUsedFrom improvement and the optimizeGoal (As optimizeFor) to trunk.
        For the bit position I want to have a look at both proposed approaches again before applying one.
        For the peephole rules: They need to be numbered / named (like the existing ones). The peephole rule numbers help in two ways:
        1) When a bug is reported for which --no-peep is a workaround, compiling with --fverbose-asm and looking at the number helps find the broken rule.
        2) Running the regression tests and doing a grep for each peephole rule number helps to find rules that are not tested in regression tests (e.g. rules that have become redundant due to improvements in other parts of SDCC).
        
        Over time, peephole numbers get a bit messy, so once in a while they all get renumbered. Still, any time a new peephole rule is introduced it should have a number.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Visenri - 2021-03-08
        
        About the peephole rules numbering. I was avoiding it on purpose for 3 reasons.
        
        1) I didn't know if you wanted to renumber them yourself, maybe my numbering does not fit your taste.
        2) I really don't feel comfortable with it to begin with, as you said, it gets messy over time, because you run out of numbers or want to add another related rule in between 10a and 10b.
        3) I am already using --fverbose-asm and doing search over the asm files to find where the rules are being applied, but I use the whole rule text, for example: "peephole replaced 'xor-and' by 'cpl-and'."
        
        If you want I can give them numbers, but for example, with PACK1, I get into trouble immediately, I have to start using numbers and letters because they are between rule number 17 and 20a, and there are 25 rules in total, how am I expected to assign numbers?.
        
        Here you have various proposals:
        1) Keep the long name to search for the rules, but make sure they are unique
        
        replace restart { ld a, _%1 and a, #%3 ld _%1, a } by { bres _%1, #%4 ; peephole replaced 'and' by 'bres' (compiler symbol).
        
        This is what would identify the rule (I should change some of them to be unique):
        
        ; peephole replaced 'and' by 'bres' (compiler symbol).
        
        2) Use short prefixes and suffixes related to the operation and then numbers, for example:
        
        Same rule would turn into:
        
        ; peephole LAL-b1s replaced 'and' by 'bres' (compiler symbol).
        
        L= LOAD
        A= AND
        L= LOAD
        b= bit operation
        1= version 1 of the rule (in case more match the same pattern).
        s= symbol (symbol or literal with the same rule repeat a lot).
        
        To me this is the best one, it is much easier to avoid conflicts / renumbering in the future.
        
        3) Use short prefixes related to rule groups or more generic concepts:
        
        ; peephole BOP-1 replaced 'and' by 'bres' (compiler symbol).
        
        BOP = bit operation
        1= like current numbering schema.
        
        4) Just keep using the messy numbers as up to now :D.
        
        Maybe I'm making a mountain out of a molehill!
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Peephole rules pack - savings for bitfields operations

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#362 Peephole rules pack - savings for bitfields operations

Related

Discussion

Related

Related

Related

Related

Related

Related