I've been working again with STM8 SDCC compiler and I found that many things could be improved with more peephole rules.
I have prepared a patch with many new rules and some changes to some code to get better results.
I put bitfields in the tittle, but the savings are in many more places.
For the medium memory model this is the comparison pre-post patch:
Summary for 'stm8': 0 failures, 12335 tests, 2385 test cases, 2532879 bytes, 78765777 tick
Summary for 'stm8': 0 failures, 12335 tests, 2385 test cases, 2532143 bytes, 78691548 ticks
Bitfields is where the improvement is more noticeable:
bitfields (f: 0, t: 131, c: 8, b: 8003, T: 11182)
bitfields (f: 0, t: 131, c: 8, b: 7720, T: 10870)
-3.5% in size.
But the number is not what i think is important, now all bitfield operations are optimized as they should, so programs handling bitfileds in regs in a lot of different ways will benefit a lot. This also makes a lot of operations atomic.
The rules are also splitted to be as small as possible, so a minimal set of rules performs a lot of optimizations, this is needed because a lot of bitfield operations are done in a similar way, but with small variations (extra srl, or swap). Take a look at the rules to see what I mean.
stm8/gen.c:
Minor change to homogenise generated assembly, to allow better/simpler pattern matching for rules
See also ticket by Basil Hussain [feature-requests:#728]
Attached, the new file and the diff with current version.
Related
Feature Requests: #728
Last edit: Maarten Brock 2021-03-06
If I apply this part by itself, I get a regression:
Previously:
Of course, have you tested it after applying patches to rules 10b & 10c?
I mean, change (#(%2 + 0) -> #(_%2)) see peeph.def.diff.
After being applied you shouldn't see regressions, but a net gain.
After a bit more testing, I have applied this change, along with the changes and new rules 10b to 10e in [r12021].
However, I have omitted 10f, as it looks wrong to me: The sp-relative adressing mode of the STM8 only allows offsets up to 255, so the rule would create broken code for %2 > 255.
The regression tests pass, but few regression tests use local objects of total size > 255, which would probably be needed to trigger the issue.
P.S.: Technically, there is a similar problem with the 10a to 10e rules: addw wraps around at 0x10000, while (., x) doesn't. However, AFAIK no code generated by SDCC relies on this wraparound for pointer arithmetic. Maybe it is possible to write C code using uintptr_t, that gets broken by one of 10 a to 10e; if we find such a case, we'll have to deal with it then.
Last edit: Philipp Klaus Krause 2021-01-26
About 10f, my fault, this should fix it (added checks for range):
P.S.: Post edited by spth (Philipp Klaus Krause) to fix code formatting.
Last edit: Philipp Klaus Krause 2021-01-27
Thanks. Looks good , applied in [r12022].
SDCCpeeph.c:
Added new operator 'bp' for immdInRange.
See comments in ticket [patches:#361]
Added new function "optimizeCodeSize" to be used in peephole rules.
Attached, the new file and the diff with current version.
Related
Patches:
#361Last edit: Maarten Brock 2021-03-06
Could we have a more generic alternative to optimizeCodeSize? I'd like to also be able to check for other targets and for the opposite, e.g. applying a rule only if not optimizing for code speed. Maybe something like optimizeFor('code size'), optimizeNotFor('code speed')? Or maybe something like optimizationGoal('code size' '1'), optimizationGoal('code speed' '0')?
Last edit: Maarten Brock 2021-03-06
Any comments about the proposed "optimizeGoal" function?
See last post for details: [patches:#362]
https://sourceforge.net/p/sdcc/patches/362/#56f4
Related
Patches:
#362Last edit: Maarten Brock 2021-03-06
Prerequisites:
Some PACK1 & PACK2 rules make use of:
New operator 'bp' for immdInRange (updated SDCCpeeph.c)
New function "optimizeCodeSize" (updated SDCCpeeph.c)
Some rules need also the modified stm8/gen.c to work properly
Rules 8, 9 , 9a & 10:
Added missing notUsed('c').
Rules 10b & 10c:
Changed to be more generic, to match more cases (changes needed to stm8/gen.c)
See also ticket by Basil Hussain [feature-requests:#728]
#(%2 + 0) ->#(_%2)
Added missing notUsed('c').
Rules 10d, 10e & 10f:
Added new rules similar to 10b and 10c covering:
addw - ldw
addw - ld(%), a
ldw %,sp - addw - ld
Rules 18-X, 19-X, 20-X, 21-X (total 32):
Removed and replaced by new ones (PACK1).
Rules PACK 1
10 new rules to prepare different cases to match same rules
12 rules to handle bset, bres, bcpl in all its variants.
1 rule to optimize bitfield complement.
2 more rules to convert ldw-ld-op-ld where op is and, or & xor to ld-op-ld
Rule 22a:
Moved comment into corect place, was not matching because of that.
Rule 52 & 53:
Renamed to 52a & 53a respectively.
Rules 52b & 53b
Added similar rules to 52 & 53 for large memory model.
Rules j7-eq-ne & j7-ne-eq
Changed spaces to tabs like all other rules.
Rules PACK 2
2 rules to save size when restoring stack addw sp -> pop, popw (requires "optimizeCodeSize" function in SDCCpeeph.c)
2 rules for unneeded sub or cp with 0
4 rules to convert bitfield operations (and-or) to direct load and store
2 rule to convert ldw-ld into ld direct
1 rule to convert and to bcp when possible (enables/simplifies other rules)
2 rules for bitfield test (1 bit), changing and-dec to bcp
2 rules to convert shifts/swap + bcp to bcp
5 rules to simplify bit reversal and copy to carry
2 rules to change ld-and-sub-bccm to bcpl
2 rules to change ld-bcp-jreq to btjf
2 rules to chagne ld-bcp-jrne to btjt
1 rule to change ld-bcp-jreq to ld-jrpl
1 rule to change ld-bcpjrne to ld-jrmi
2 rules to change ld-srl-jrc to btjt (taken from ticket by Basil Hussain
https://sourceforge.net/p/sdcc/feature-requests/690/ )
2 rules to change ld-srl-jrnc to btjf (taken from ticket by Basil Hussain
https://sourceforge.net/p/sdcc/feature-requests/690/ )
1 rule to move ld a outside loop
3 rules for swap optimizations
1 rule to do operations directly on stack for: 'swap' 'clr' 'dec' 'inc' 'neg' 'cpl'
4 rules to remove ld into stack that will be immediately out of scope
3 rules to remove dead addw/incw
1 rule to remove redundant jr
Fixed comment about emulated two bytes skip instruction using cp/bcp
Attached, the new file and the diff with current version.
Related
Feature Requests: #728
Last edit: Maarten Brock 2021-03-06
The fixes for existing rules (8 to 10c, 22a), the port of ret rules to retf (52a, 53a) and the space harmonization (j7-ne-eq, j7-eq-ne) have been applied in [r12020]. Thanks.
Regression tests have been done using my updated stm8/peep.c:
[patches:#290]
Without that patch, the improvements may be a little lower.
Related
Patches: #290
Last edit: Maarten Brock 2021-03-06
An example in action:
Without optimization:
This is how it handles such a monstruosity step by step:
Lots of rules have been applied until reaching the final bcpl, this allows many variations of bits and compiler outputs.
Notice also, that tail call optimization normally disabled the whole optimization, because target label of the last "jp" was not found, resulting in "S4O_ABORT" in "scan4op" inside stm8/peep.c.
With the new stm8/peep.c this is treated as a return, returning S4O_TERM and allowing optimizations.
Last edit: Visenri 2021-01-25
Regarding the rules that replace load-shift-jrnc by btj?: I think these need an additional notUsedFrom condition to cover the case of a being used in the code at label %2.
Anyway, moslty due to me partially applying the patch, the situation has gotten a bit messy.
Would you mind preparing separate patches against current for your peephole packs?
After testing, I think most rules can go in, but with the 4.1.0 release coming soon, and me not having much time these days, I'll try to be very careful, and rather do it bit-by-bit insetad of applying a big patch at once.
When is the release scheduled?
I didn't find time to test all properly with the current version.
Maybe this weekend I'll have some time separate the patches and test them properly.
Would that fit the release schedule?
See https://sourceforge.net/p/sdcc/wiki/SDCC%204.1.0%20Release/.
Soft freeze (i.e. no more new features in trunk) is scheduled for Thursday next week. Even before that I'd be reluctant about big changes, unless necessary for bugfixes.
But work can always continue in branches.
You're right about the rules, but I think there are more rules affected and more conditions must be applied to be strictly correct:
notUsedFrom(%x 'a'), notUsedFrom(%x 'z'), notUsedFrom(%x 'n'), notUsedFrom(%x 'c')
for:
'and-dec-jrne' by 'bcp-jreq'
'and-dec-jreq' by 'bcp-jrne'
'ld-bcp-jrne' by 'btjf'
'ld-bcp-jrne' by 'btjt'
'ld-srl-jrc' by 'btjt'
'ld-srl-jrnc' by 'btjf'
notUsedFrom(%x 'z'), notUsedFrom(%x 'n')
for:
'ld-bcp-jreq' by 'ld-jrpl'
'ld-bcp-jrne' by 'ld-jrmi'
%x means the jump variable, of course.
About the separate patches:
Maybe I can prepare it, but it should be pretty straightforward, I would leave pack1 as one indivisible pack (because they are closely related) and pack2 could be splitted in several patches.
I've never used the condition NotUsedFrom, my fault, but I have to say that documentation is, to say the least, scarce .
So, without documentation, I have to look at source code or other examples.
After looking at the source code of SDCCpeeph.c, it seems that notUsed can accept multiple variables in one call (but no one used this functionality at least with STM8 rules).
Surprisingly, notUsedFrom has no loop to handle this condition.
So:
notUsed('a' 'c')
should work, but
notUsedFrom(%X 'a' 'c')
will show a "malformed" error.
I think it would be good to handle both cases the same way.
And if we use this feature would make rules much more readable.
I have updated the files (vs rev 12070) with all the changes requested and some more:
SDCCpeeph.c:
Modified "notUsedFrom": now can be used the same way as "notUsed" (multiple elements to check).
New function "optimizeGoal", accepts arguments like 'size', 'speed', '!size', '!speed' or combinations (except both 'speed' and 'size' for obvious reasons). I hope you like it!
New operator 'bp' for immdInRange as described previously.
Rules:
Pack1 & Pack2 remain almost as they were in my initial patch with some minor changes:
- Added fixes suggested using notUsedFrom as required (only 12 Pack2 rules affected).
- Changed all rules to use notUsedFrom & notUsed with multiple elements to check, much shorter and readable.
- Changed rules to use new function "optimizeGoal" (2 rules in Pack2).
Swap optimizations in pack2 may be useless if you merge the patches from: [patches:#363]
But until then, they're in.
I'll upload the files once I finish testing (I hope next week).
Related
Patches: #363
Last edit: Maarten Brock 2021-03-06
I can't find those updated files you mention. Did you forget an attachment?
As I said:
That means, this week, if I can't find more time, I will upload what I have right now.
But before that I just wanted to know your opinion about the new optimizeGoal as described:
Yes, I do like this design.
Can we put the word "code" in there, so it isn't just "size" and "speed"? After all we are optimizing for code size (and the option is called --opt-code-size.
We'll probably have one more optimizations option in the future, to optimize for RAM usage, possibly called --opt-data-size, so we'll want to avoid confusion between optimizing for code size vs. data size.
Finally, here you have the updated files.
All remains as stated in https://sourceforge.net/p/sdcc/patches/362/#56f4.
Additionally:
The updated version of OptimizeGoal function uses these strings:
'code-speed'
'code-size'
Added some more comments to the code.
Changed the NotUsedFrom function description to reflect the new functionality.
Regarding the rules packs:
Pack1 should be kept together.
Pack2 can be applied in steps if you want, just search for //---- and you'll see 4 sub-packs. It can be split even more, because some of those sub-packs have unrelated rules, but for me, as long as tests are passed and I see no obvious flaw, all should be good to go in.
Last edit: Visenri 2021-03-08
In [r12089] I've applied the notUsedFrom improvement and the optimizeGoal (As optimizeFor) to trunk.
For the bit position I want to have a look at both proposed approaches again before applying one.
For the peephole rules: They need to be numbered / named (like the existing ones). The peephole rule numbers help in two ways:
1) When a bug is reported for which --no-peep is a workaround, compiling with --fverbose-asm and looking at the number helps find the broken rule.
2) Running the regression tests and doing a grep for each peephole rule number helps to find rules that are not tested in regression tests (e.g. rules that have become redundant due to improvements in other parts of SDCC).
Over time, peephole numbers get a bit messy, so once in a while they all get renumbered. Still, any time a new peephole rule is introduced it should have a number.
About the peephole rules numbering. I was avoiding it on purpose for 3 reasons.
1) I didn't know if you wanted to renumber them yourself, maybe my numbering does not fit your taste.
2) I really don't feel comfortable with it to begin with, as you said, it gets messy over time, because you run out of numbers or want to add another related rule in between 10a and 10b.
3) I am already using --fverbose-asm and doing search over the asm files to find where the rules are being applied, but I use the whole rule text, for example: "peephole replaced 'xor-and' by 'cpl-and'."
If you want I can give them numbers, but for example, with PACK1, I get into trouble immediately, I have to start using numbers and letters because they are between rule number 17 and 20a, and there are 25 rules in total, how am I expected to assign numbers?.
Here you have various proposals:
1) Keep the long name to search for the rules, but make sure they are unique
This is what would identify the rule (I should change some of them to be unique):
; peephole replaced 'and' by 'bres' (compiler symbol).
2) Use short prefixes and suffixes related to the operation and then numbers, for example:
Same rule would turn into:
; peephole LAL-b1s replaced 'and' by 'bres' (compiler symbol).
L= LOAD
A= AND
L= LOAD
b= bit operation
1= version 1 of the rule (in case more match the same pattern).
s= symbol (symbol or literal with the same rule repeat a lot).
To me this is the best one, it is much easier to avoid conflicts / renumbering in the future.
3) Use short prefixes related to rule groups or more generic concepts:
; peephole BOP-1 replaced 'and' by 'bres' (compiler symbol).
BOP = bit operation
1= like current numbering schema.
4) Just keep using the messy numbers as up to now :D.
Maybe I'm making a mountain out of a molehill!