A couple of weeks ago I was trying to implement some operators not implemented for STM8.
I came to that after seeing that optimization for SWAP was not handled as stated in the documentation:
Nibble and Byte Swapping SDCC recognizes the following expressions: unsigned char i; unsigned int j; ... i = ((i << 4) | (i >> 4));
But after seeing the actual SDCC source code I realized that what the help doesn't say is that the compiler recognized those expressions, but it's up to each port to implement the optimized instructions for each special operator.
So I found that neither SWAP, RRC nor RLC were implemented for STM8.
It took some time to dig into the generator code to get an rough idea of how it works.
But I have some doubts.
I have working code for SWAP, RRC and RLC for 1,2 and 4 bytes (passes regression tests, but those do not cover all code flow of cases implemented).
Questions:
Is the variable allocator free to store variables bigger than 2 bytes splitted between X, Y and stack?, for example using combinations like:
XH:YL
XL:YL
XL:STACK
Will it ever call this function with mixes like this?:
Result in Y, left operand in YL:XL
Where the result ovewrites the left operand?
if so, code like this (ROTATE code for 2 bytes size) can fail because genMove may ovewrite the left operand?:
genRotate (const iCode *ic, asmop *result_aop, asmop *left_aop, int size, bool rotateLeft) if (size == 1) { ..... Size 1 code, I can put it here if you want, but I think it is ok, because variable can only be in one place. } else if (size == 2) { enum asminst rotateIns = (rotateLeft ? A_SLLW : A_SRLW); bool pushed_xy = false; bool use_reg_y = false; asmop * reg_aop = ASMOP_X; short reg_idx = X_IDX; if (aopInReg (result_aop, 0, Y_IDX)) use_reg_y = true; else if (!regDead (reg_idx, ic)) { push (reg_aop, 0, 2); pushed_xy = true; } if (use_reg_y) { reg_aop = ASMOP_Y; reg_idx = Y_IDX; } genMove_o (reg_aop, 0, left_aop, 0, 2, regDead (A_IDX, ic), !use_reg_y, use_reg_y); // Rotate emit3w (rotateIns, reg_aop, 0); enum asminst rotateIns = (rotateLeft ? A_SLLW : A_SRLW); bool pushed_xy = false; bool use_reg_y = false; asmop * reg_aop = ASMOP_X; short reg_idx = X_IDX; D (emit2 ("; genRotate2 default", "")); if (aopInReg (result_aop, 0, Y_IDX)) use_reg_y = true; else if (!regDead (reg_idx, ic)) { push (reg_aop, 0, 2); pushed_xy = true; } if (use_reg_y) { reg_aop = ASMOP_Y; reg_idx = Y_IDX; } genMove_o (reg_aop, 0, left_aop, 0, 2, regDead (A_IDX, ic), !use_reg_y, use_reg_y); // Rotate emit3w (rotateIns, reg_aop, 0); // Copy carry to the other side symbol *tlbl = (regalloc_dry_run ? 0 : newiTempLabel (NULL)); if (!regalloc_dry_run) emit2 ("jrnc", "!tlabel", labelKey2num (tlbl->key)); cost (2, 1); if (rotateLeft) emit3w (A_INCW, reg_aop, 0); else { emit2 ("addw", !use_reg_y ? "x, #0x8000" : "y, #0x8000"); cost ((reg_idx == X_IDX) ? 3 : 4, 2); } emitLabel (tlbl); genMove_o (result_aop, 0, reg_aop, 0, 2, regDead (A_IDX, ic), !use_reg_y, use_reg_y); if (pushed_xy) { pop (reg_aop, 0, 2); } } ......
The code by default will use X register, but if result will be in Y, it uses it.
If result is not in Y, it checks if X is dead, if not, it pushes it.
Next, the left operand is moved into the register used (no code is generated if both are the same).
After that the processing is done with the register, shift left and copy carry bit to the other side.
Finally a result is moved to the result operand (no code is generated if both are the same).
Does it look correct to you?
Can it fail if result is Y and left operand is in YL:XL (I don't think this condition will ever exist, because it doesn't make any sense).
Another question for testing: Is there any easy way to convince the allocator to store the left operand in a specific register (so I can write test cases to test all code paths used in this function).
Cases like these make much more sense:
Result in STACK
Left operand in XL:STACK
or
Result in STACK
Left operand in XL:YL
Do these cases happen? I think they should work with the code as it is.
But this will fail:
Result in STACK:XH
Left operand in XL:YL
Because XL will be considered not dead, X will be pushed.
X will be loaded with the left operand (but x_dead_global param will be true, possibly failing).
After that, the rotation will take place in X register.
And then X will be poped, discarting the result in X register!.
Last edit: Visenri 2021-02-04
Hello again.
I cleaned up the code, but my questions still remain unanswered:
1- Is the variable allocator free to store variables bigger than 2 bytes splitted in any possible combination between X, Y and stack?
For example using combinations like:
XH:YL
XL:YL
XL:STACK
If this is the case, more checks should to be done (at least to trigger an assertion).
2- Is there any easy way to convince the allocator to store the left operand in a specific register?
(so I can write test cases to test all code paths used in this function).
And now I have two more:
3-Is it redundant to do checks like these?:
!aopInReg (result_aop, 0, reg_idx) && !regDead (reg_idx, ic)
I mean, when result is the same operand as reg_idx, will regDead return always "true"?
4-Is my code so ugly that no one thinks it deserves any answers :D?
I have working code for SWAP, RRC and RLC for 1,2 and 4 bytes (passes regression tests, but those do not cover all code flow of cases implemented).
This is just a snippet of the code used for Rotation of 2 bytes.
The stm8 register allocator is free to do nearly anything. xh:yl, xl:stack, etc are surely possible.
However, you could disallow it from codegen for your cases. See lines 2204-2205 in gen.c for an example.
Not really. Doing some operations that are particularly fast via a specific register on the variable often works though.
A register that is overwritten by an iCode is considered dead at that iCode. The other reason for a register being dead is that nothing might be stored there.
I don't know, I haven't read the code yet. I was very busy and it didn't seem urgent (the urgent thing right now is the 4.1.0 release, which requires testing, finding and fixing bugs).
Last edit: Philipp Klaus Krause 2021-02-12
Thank you very much for the answers.
1- Pretty much confirms what I suspected. Just fake a super high cost in these cases hoping this will make the allocator think it's a bad idea, then add an assertion just in case it happens.
2- See below.
3- That was exactly my idea: So, I can remove some unneeded checks.
4- Excuse-me, It's just me being too impatient, sometimes I think I am talking alone here :D.
2- I can get easily the result in a register, but I haven't seen any way to get source operand in a register, at least for the operations tested (SWAP, RRC and RLC).
For example:
"source in stack" - "result in register"
gives:
I was hoping this one to be a "source in reg" to "result in reg", but nope, intermediate result is saved first in the stack, so we get exactly the same case here (from the genRLC point of view):
gives:
Last edit: Visenri 2021-02-13
I think it is the way SDCC generates iCode that prevents it from allocating intermediate results in a register (at least sometimes):
For example:
Note: peephole optimizer disabled to better see the code output.
Two unneeded ld's to and from stack.
If we combine operations in the same statement, we get optimal result:
Cosmic compiler has more or less the same behavior.
But IAR is able to combine operations in different statements:
Using the same code for SDCC (even with peephole-opt enabled) results in:
Unneeded ld's are still generated. One is removed by the peephole optimizer.
For your first test_combined_op, the register allocator should still allocate the intermediate result into a rather than on the stack. I'll try to find out why that doesn't happen sometime next week.
Looks like there is an issue with arguments. The register allocator never sees value, so it always stays on the stack where it was passed.
However, if you first copy value to some other local variable, everything looks fine:
The case for SWAP and rotate the problem is a little worse, because the operation is not recognized once you add the previous operation in the same statement:
As shown before, this is fully optimized:
This is almost optimal:
but this is not recognized as rotation:
I am not saying IAR is better here, because with my code for rotation and swap, SDCC can beat IAR, because IAR recognizes neither "rotation" nor "swap".
I am just talking about storage of intermediate results and its relation with combination of statements.
The interesting aspect in your last example is that one value + 1 is handled as an 8-bit type, while the other is handled as a 16-bit type.
This difference already happens in the AST, i.e. early in the frontend. Possibly some leftover optimization (IMO it would make more sense to do such optimizations later, on the iCode).
If hte return type is changed to unsigned int, both are considered to be the same, and value + 1 is calculated only once.
Casting the result of the +1 operation to unsigned char forces the +1 to be done only once.
But swap is still not recognized (no matter how many extra casts I add):
But with "unsigned int" version all works as expected, one increment and swap operation (and it also works with or without intermediate variable):
Could you post the current version of your patch here sometime?
I want to remove all the unneeded checks and add the code to disallow the values in split registers.
I'll retest all and then I'll post it here if you want, but I have also to update my current version to the newest one, so my diffs will but up to date.
I hope today I'll be able to upload something.
If you want to have a look, this is what I have right now:
This is for rotation:
And this is the calling function for both:
Called in genSTM8iCode
And modification to stm8/main.c
I also added more instructions, not implemented in
emit3cost:
A_EXG
emit3wcost:
A_EXGW
A_SWAPW
And changed places where this was hardcoded.
After writing some test cases I am getting failures because of operands splitted in registers even when I am using the trick suggested ( cost (1000, 1000) in forbidden cases).
Rotation with 4 bytes is the one triggering the failures ( assertions have been commented to see the end result):
I keep getting results like this:
; Left : a, xl, xh, (0x04, sp)
; Result : yh, a, xh, xl
Attached my "stm8/gen.c" file, the test cases code in "rotate2.c" and the results in "results-test-stm8.log".
Last edit: Visenri 2021-02-14
I tried changing the code to do the whole copy using only one call to genMove_o, but it results in an assertion failure in genMove_o ("Unimplemented."):
genMove_o (ASMOP_XY, 0, left_aop, 0, 4, regDead (A_IDX, ic), regDead (X_IDX, ic), regDead (Y_IDX, ic));
gen/stm8/rotate2/rotate2_size_32_andCase_1_xorLiteral_0_rotateLeft_-1.c:114: error 9: FATAL Compiler Internal Error in file 'gen.c' line number '2138' : Unimplemented.
Corresponding to these lines:
After more testing, this is the problematic case that triggers the assertion in genMove_o ("Unimplemented."):
The rotate function input asked for these operands:
The rotate function calls genMove_o with the closest combination of X and Y (to do the rotation), in this case Y:X:
The copy fails because it contains a permutation:
XH is already in place
A is copied into YH
the remaining YL-XL form a permutation, so none can be copied without overwriting the source data for the other.
As a proof of concept I've modified the code of genMove_o to check if any other reg can be used to mirror the value of YL, so the copy won't end in a dead lock.
It uses YH and the test compiles and passes the regression test.
The copy suceeds because it's no longer a permutation:
XH is already in place
YL is copied into YH (temp copy of YL, banned from being written until used)
XL is copied into YL
YH is copied into XL (YH write is enabled after being used)
A is copied into YH
The general concept would be:
If the copy is not doable, try it again using a free register as temporary storage (one free or used in the result and not involved in the permutation).
Last edit: Visenri 2021-02-16
For simple cases where A is not involved in the permutation (like this one) even an easier approach would be to use A to perform it.
I have tested something like this:
where soffset0 and soffset1 represent the offsets of the registers to be exchanged.
gives:
vs the mirror method explained previously:
Both examples pass the regression test.
Last edit: Visenri 2021-02-16
Right now I have a fully working version of this operands working with build 12014, I am compiling now build 12070, I hope it works without any issues.
My code for SWAP, RRC and RLC has been a hardtest for "genCopy" function.
After adding more cases to my test suite, I found other permutations making "genCopy" fail.
I have implemented a "simple" permutation algorithm processed before doing reg to reg copy, the order is as follows:
.
The algorithm is something like this:
The new check for permutations is only done if not in a dry run, to save processing time and to normally avoid use of permutation (99% of cases).
For now it only handles one permutation, I mean, just one backup must unlock the copy operation.
But it could be expanded easily wrapping it all in a loop trying multiple times until no more copy operations take place.
It only handles the backup using A, X or Y register.
For now this covers all cases seen so far.
Examples found in my tests and handled correctly by the algorithm:
If all regression test pass with 12070 I'll upload it here later.
Here you have the patches, including updated files and diffs.
I added also a new test for rotations/swap: rotate2.c.
gen.c
Summary of changes:
Implemented instruction cost
A_EXG
A_EXGW
A_SWAPW
Added asmop ASMOP_YX to be used for swap.
Added regBytesUsedByAop (may be simplified after seeing other code).
Added functions isAopAssignableRegPos and AopAssignRegToStackOverWritePos to be used in genCopy.
Added support for permutations in genCopy.
Added genSwap, genRotate and genRotateOperation functions (SWAP, RRC and RLC).
There is a lot of debug output, feel free to remove what is too much for you.
Comparison of results for rotate2 test
build 12070
rotate2 (f: 0, t:1620, c: 108, b: 303577, T: 638263)
after patches:
rotate2 (f: 0, t:1620, c: 108, b: 270660, T: 601200)
Last edit: Visenri 2021-03-03
I'll look at the actual functionality when I find time later.
For now, I've picked the exg / exg handling improvement and the rotate2.c test into the sdcc-next branch (to be merged to trunk after the 4.1.0 release).
If you have looked at the log files, you may have noticed that what I wrote about the results was really the opposite of what you get.
I have corrected the previous post.
Of course with the patches it uses less code and cycles, not more.
rotate2 test (stm8):
revision 12070
rotate2 (f: 0, t:1620, c: 108, b: 303577, T: 638263)
after patches:
rotate2 (f: 0, t:1620, c: 108, b: 270660, T: 601200)
I've done minor changes to the patch (only to genSwap function), here you have the updated gen.c compared against the next branch 12085.