After seeing how many rules are needed for STM8 for some OR - XOR, and having a look at the 'asm' code generated, I thought it was time to review how that code was generated.
I found that many of the problems with bit operations, XORing and ORing are due to integer promotion causing unnecessary extra casts.
I have tested the attached patch with great results (no regression fails for STM8).
The idea is to reduce literals to char when possible (OR / XOR char with literal integral in range [0..255]).
I put it here instead of committing directly because I am not confident enough about the place where I did the optimization, please have a look at the patch.
Is that the right place and the right way to do this optimization?
With this sample code:
static volatile unsigned char testVar;
void testOrXorOptimization(void)
{
testVar ^= 8;
testVar |= 8;
}
This code is generated for STM8 (peephole optimization disabled):
_testOrXorOptimization:
; ..\src\cast.c: 14: testVar ^= 8;
; genCast
; genAssign
ld a, _testVar+0
clrw x
; genXor
xor a, #0x08
; genCast
; genAssign
ld _testVar+0, a
; ..\src\cast.c: 15: testVar |= 8;
; genAssign
ld a, _testVar+0
; genOr
or a, #0x08
ld _testVar+0, a
; genLabel
00101$:
You can see the casts result in:
With the attached patch this is the result (peephole optimization disabled):
_testOrXorOptimization:
; ..\src\cast.c: 14: testVar ^= 8;
; genXor
bcpl _testVar+0, #3
; ..\src\cast.c: 15: testVar |= 8;
; genOr
bset _testVar+0, #3
; genLabel
00101$:
No more unneeded casts and bit operations are generated directly, we don't even need peephole rules to optimize it!.
This is the difference in the abstract syntax tree:
..\src\test.c:14: ASSIGN(=) (0000000005369B70) type (volatile-unsigned-char fixed)
..\src\test.c:14: SYMBOL (L0:0 B0 testVar=0000000005369A30 @ 0000000005364900) type (volatile-unsigned-char fixed)
..\src\test.c:14: XOR (0000000005369AD0) type (int register)
..\src\test.c:14: SYMBOL (L0:0 B0 testVar=0000000005369850 @ 0000000005364900) type (volatile-unsigned-char fixed)
..\src\test.c:14: CONSTANT (0000000005369990) value = 8, 0x8, 8.000000 type (const-int literal)
..\src\test.c:15: ASSIGN(=) (0000000005369F30) type (volatile-unsigned-char fixed)
..\src\test.c:15: SYMBOL (L0:0 B0 testVar=0000000005369DF0 @ 0000000005364900) type (volatile-unsigned-char fixed)
..\src\test.c:15: OR (0000000005369E90) type (int register)
..\src\test.c:15: SYMBOL (L0:0 B0 testVar=0000000005369C10 @ 0000000005364900) type (volatile-unsigned-char fixed)
..\src\test.c:15: CONSTANT (0000000005369D50) value = 8, 0x8, 8.000000 type (const-int literal)
vs:
..\src\test.c:14: ASSIGN(=) (0000000005309B70) type (volatile-unsigned-char fixed)
..\src\test.c:14: SYMBOL (L0:0 B0 testVar=0000000005309A30 @ 0000000005304370) type (volatile-unsigned-char fixed)
..\src\test.c:14: XOR (0000000005309AD0) type (unsigned-char register)
..\src\test.c:14: SYMBOL (L0:0 B0 testVar=0000000005309850 @ 0000000005304370) type (volatile-unsigned-char fixed)
..\src\test.c:14: CONSTANT (000000000530A1B0) value = 8, 0x8, 8.000000 type (const-unsigned-char literal)
..\src\test.c:15: ASSIGN(=) (0000000005309F30) type (volatile-unsigned-char fixed)
..\src\test.c:15: SYMBOL (L0:0 B0 testVar=0000000005309DF0 @ 0000000005304370) type (volatile-unsigned-char fixed)
..\src\test.c:15: OR (0000000005309E90) type (unsigned-char register)
..\src\test.c:15: SYMBOL (L0:0 B0 testVar=0000000005309C10 @ 0000000005304370) type (volatile-unsigned-char fixed)
..\src\test.c:15: CONSTANT (000000000530A390) value = 8, 0x8, 8.000000 type (const-unsigned-char literal)
Even more interesting results can be seen in tests, for example 'bug-1929.c'
No patch applied, cast generates operations to finally throw away the result restoring 'a' from stack:
After patch:
I have not tested other ports, but benefits are also expected
If someone else gives me a thumbs up for this patch, I will do all the remaining tests and commit it if none fails.
Results with more complex expressions are even better (test: rotate2_andCase_0_rotateLeft_0_structVar_1_xorLiteral_1_size_8.asm).
Here, XOR is combined with SWAP:
And this was the result:
After the patch, perfect optimization:
I was expecting some improvement, but these cases are just shocking me.
More different cases can be seen in rotate2...xorLiteral_1_size_8.
The following RFE might be related:
[feature-requests:#531]
Related
Feature Requests: #531
Last edit: Maarten Brock 2021-06-19
SDCC used to have similar optimizations at the AST level; many have been removed since they caused bugs.
Here is an example of code that breaks with your patch:
According to the C standard, due to integer promotion in the usual arithmetic conversions,
testVar ^ 8andtestVar | 8are of type int, soiandjshould be set to 2. With the patch they get set to 1 instead. Using sizeof like that would be unusual, but possible in C90. With newer C standards the problem gets worse, as_Generic,typeofand, if it gets in,autowould be affected, too.Another disadvantage of doing such an optimization on AST is that it won't work on something like
(int)testVar ^ 8. While a programmer wouldn't write that literally and expect optimization, the (int) could e.g. come from argument promotion of an inline function.The functions in ctype.h would be an example. For those, a few years ago I introduced a similar optimization. It works on the iCode. That tends to be a bit more involved than doing it on the AST, but the advantage is that it applies to even more code and that it doesn't break
_Generic, etc.See optimizeOpWidth in SDCCopt.c for the details. Maybe this or / xor optimization can be done there, too?
I was aware of that limitation.
What didn't cross my mind was the sizeof operator!
I thought that as long as the result was the same, we don't care about the size, but those operators you mention are a problem.
I also agree that those use cases are unusual or do not make sense at all, but they are possible.
After having seen that, we should have tests for those operators. I will create some simple tests.
I will have a look at optimizeOpWidth in SDCCopt.c as you suggested.
I've been able to avoid the cast quite easily for the xor case with the following patch.
It was just a missing operand check for the '^' operator.
So the original example becomes:
Now comes the part where I am more or less lost, how to get from this:
To this:
Because now, the results in rotate2...xorLiteral_1_size_8 are back to its original form (less optimized)
Last edit: Visenri 2021-06-20
It seems to me that this is work for the backend or even the peephole optimizer. Not every target can perform these operations on any location. It very much depends on whether testVar is global (and if so in which memory space on e.g. mcs51) or on stack or in registers.
Maybe, but I think we should simplify the AST and iCode as much as possible to simplify and not repeat in each backend optimizations doable in a generic way.
I've applied your SDCCopt.c.patch in [r12488]
This is still a work in progress, but that part was correct for sure.
I have bad news, after writing some test, the current results (no patch applied) of sizeof operator seem wrong in many cases (or at least, I think so):
All test with literal values with some operation seem to fail:
And that is because "cheapestVal" function will generate a type of V_CHAR for all of them.
Some tests with symbols also fail.
Because the end result is of type char, and the expression is simplified to char type.
Only happens if the outermost operation is a right shift or a bitwise operation.
Test for 'host' do not have a single failure, other backends show exactly the same result (of course, the problem is in the AST).
Either we have to do a lot more work to make this work correctly or leave those cases as undefined behavior. (Or maybe I am missing something)
I haven't seen any explicit mention of use cases (for example in C99 specification: "ISO/IEC 9899:1999") of sizeof for:
.
You have to gather that from other text what would be the expected behavior for such an unusual use case:
Integer promotion must be applied, so result should be of type int, so sizeof must return sizeof(int) when used in such expressions.
But I am wondering:
Last edit: Visenri 2021-06-20
I am working on a patch that changes decoratetype to not reduce operands when the tree starts with a "SIZEOF" operator or other problematic operators.
So, it should be no problem to do optimizations in size for the rest of trees (the vast majority of real cases).
So far it seems promising, far less tests fail, I am checking what's going wrong with the rest.
Last edit: Visenri 2021-06-20
I have accomplished 0 failures (stm8 tested for now) and full optimization modifying the tree decoration process.
I am passing the value RESULT_TYPE_OTHER to resultType input of decorateType to indicate that the result type is unknown and it should use standard promotion rules without reducing types.
The first node using SIZEOF is the one that starts the process using RESULT_TYPE_OTHER, and as a result, setting resultTypeProp to the same value and one flag inside decorateType to avoid all problematic code paths.
Functions like:
valUnaryPM, valComplement, valNot, valMult, valDiv, valMod, valPlus, valMinus, valShift also get an extra parameter to avoid call to cheapestVal function.
Does anyone see a problem with this method?
Could this cause any issues in other parts of the code?
For now it seems to be working ok.
I want to write some more test to check all the operators, and test all other backends, I don't expect any major problem.
The results for now are these:
13 tests get better size and cycles for a total of 1385 saved bytes (1200 correspond to rotate2 tests).
Only 4 tests have an increased size, for a total of 14 bytes (3.5 average).
I will investigate those tests (but I am quite confident that they have to do with the RESULT_TYPE_OTHER used in other parts of the code).
If we want to avoid those cases, maybe I will need to add another parameter to indicate this cases and avoid confusion with normal RESULT_TYPE_OTHER use cases.
I did it this way, because it was less involved than modifying all calls to decorateType and also reviewing the parameters to each function using the resultTypeProp or resultType values.
Should be doable, just more time-consuming.
After confirming that the problem with those 4 cases was the use of RESULT_TYPE_OTHER (as expected), I added a boolean parameter to specifically disable size reduction for SIZEOF trees.
Now 11 tests get better size and cycles with a total reduction of 1398 bytes (1200 from rotate2 tests as before).
No tests with increased code size or cycles.
As soon as I test all the operators and backends I'll upload the patch here so other developers can have a look at it, if no one has objections I'll apply the patches.
What will happen when you take an expression with implicitly narrowed type and assign it to a variable of automatic type?
Last edit: Benedikt Freisen 2021-06-21
Does SDCC support auto variables (introduced in C++11)?
You meant something like this:
The SDCC documentation says nothing about this feature, and it gives me an error (as expected):
Compiles fine in gcc and executes tests correctly in host.
So, if this is what you meant, it's not applicable right now to SDCC.
Not a concern for me in the foreseeable future of a compiler targeting small micro-controllers (Maybe Philipp thinks otherwise).
What I can say right now is that this patch may benefit this kind of feature if we also include the auto initialization tree in the "disable size reduction" cases. It should just work.
SDCC does not yet support this use of auto. And while the wording is not yet in C23, it seems likely that it will be (see the N2735 proposal for details). So, like typeof, it is something SDCC will have to support in the future.
Wow, does
autoget a new meaning in C after 50 years?Unlike C++, the new meaning won't replace the old one, but be an additional one, i.e. both of the following declarations will be valid C:
Last edit: Maarten Brock 2021-06-23
Sorry for hijacking this thread.
But wasn't that already valid with a different meaning?
Yes, it had that meaning until C94, but from C99 to C17 it was a constraint violation.
After adding extra tests, only minor changes were needed to allow almost all operators to be OK.
Aside from adding the extra parameter to decorateType (and passing it to valXX functions) only minimal changes were needed in these operators:
So far, it passes all tests (160) involving many (almost all) operators combined with pure literals and literals with values.
The only tests that fail are boolean operators, I have to check them, just handle them as integers:
I attach the patch files with major changes, so some of you can have a look to see if I am missing something.
I also attach the preliminary tests (The vast majority of these cases fail with current trunk version).
DON'T commit, this is still a work in progress. Just in case someone wants to have a look
Results after testing other backends:
Regarding xor optimization:
Most (if not all) seem to benefit (size and cycles) from the OR-XOR optimization (similar to STM8).
pdk fails because it generates invalid code:
It was like this before the patch:
**
Philip, can you shed some light on this?
I'm not familiar with the instruction set, but I suppose doing an or with a destination other than "a" is not possible.
Could you fix it in the pdk/gen.c, or should I disable this optimization for pdk?
(you can see the generated iCode some posts ago).
I've added the patches with my changes, so, you can have a look.
If you want to test the OR-XOR optimization with pdk, you just have to copy the code after the comment:
Last edit: Visenri 2021-06-23