Small Device C Compiler (SDCC) / Patches / #466 mcs51: improve peephole optimizations

Maarten Brock - 2023-11-16

+ if (opdat->regIdx1 >= B0_IDX && opdat->regIdx1 <= B7_IDX) + aln->regsRead = bitVectSetBit (aln->regsRead, BITS_IDX);

Why is there no regsWritten variant of this piece of code?

I know I introduced b0 - b7 in the past, but am now wondering if it could not be better handled by bits.0 - bits.7 directly. It would make this patch simpler I believe.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Endo - 2023-11-17
  
  Why is there no regsWritten variant of this piece of code?
  
  If a write to b0 implies a write to bits then a write to any bit b0-b7 implies a write to any other bit. This will prevent it from eliminating dead stores of unrelated bits. That's what I recall.
  
  I know I introduced b0 - b7 in the past, but am now wondering if it could not be better handled by bits.0 - bits.7 directly. It would make this patch simpler I believe.
  
  I thought b0-b7 are some kind of allocatable "special 1-bit registers"?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Maarten Brock - 2023-11-18
    
    bits is a byte in area BIT_BANK which gets placed in the bit-addressable memory range 0x20-0x2F. b0-b7 are EQUs for bits[0] to bits[7]. They should maybe be renamed to bits.0 etc. They are special in the sense that the compiler uses them as bit registers and can/will push and pop them using bits when necessary, just like r0 to r7. This means that bits gets written using pop bits which should invalidate b0 to b7. Also bits gets written when passing bit parameters. First all passed bits are assigned to B.x and then B is assigned to bits.
    
    __bit foo(__bit a, __bit b); __bit bar(__bit b, __bit a) { foo(a, b); return a; }
    
    ;-------------------------------------------------------- ; overlayable bit register bank ;-------------------------------------------------------- .area BIT_BANK (REL,OVR,DATA) bits: .ds 1 b0 = bits[0] b1 = bits[1] b2 = bits[2] b3 = bits[3] b4 = bits[4] b5 = bits[5] b6 = bits[6] b7 = bits[7] ;------------------------------------------------------------ ;Allocation info for local variables in function 'bar' ;------------------------------------------------------------ ;a Allocated to registers b1 ;b Allocated to registers b0 ;------------------------------------------------------------ ; testbits.c:6: __bit bar(__bit b, __bit a) ; ----------------------------------------- ; function bar ; ----------------------------------------- _bar: ; testbits.c:8: foo(a, b); mov c,b1 mov b.0,c mov c,b0 mov b.1,c push bits mov bits,b ; write B to bits lcall _foo pop bits ; write bits from stack ; testbits.c:9: return a; mov c,b1 ; testbits.c:10: } ret
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-10-28

10: Applied in [r14395].

Related

Commit: [r14395]

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-10-28

11: At first sight, I don't see why you want to attach this information to the ic. It seems redundant: The ic would be a CALL or PCALL; the type of the function called is known there. And the calling convention already tells us how arguments and return values are passed, depending on the type. Other ports (stm8 and the z80-related ones) use the aopArg and aopRet (with stm8IsParmInCall, etc layered on top) helper functions to access this information conveniently.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oleg Endo - 2023-10-29

8: Looks ok to me, but I didn't apply it yet, since it is apparently needed for 9 only.

That's right.

9: The idea is ok, but IMO the patch needs some minor changes: The MCS-51 SFR are a special way of doing I/O (separate I/O space that does not support indirect addressing). Not just some "special register". Some other architectures supported by SDCC have similar concepts in the hardware or the port, others don't.

Honestly, I did not consider how other targets/ports might interpret the naming of this new function. I can rework that patch to address your points,. However, it will also require fix-up and re-submission of some of the following patches. Alternatively, I can submit a follow up patch to be applied after the original patch series. The latter seems easier to handle. Please let me know how you want to handle it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-10-29
  
  Please modify the patch before it gets applied.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Oleg Endo - 2023-10-30
    
    Got it. I will then merge patch 8 and patch 9, OK?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Philipp Klaus Krause - 2023-10-30
      
      ok.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Oleg Endo - 2023-10-29

11: At first sight, I don't see why you want to attach this information to the ic. It seems redundant: The ic would be a CALL or PCALL; the type of the function called is known there.

I've tried to re-use the exising 'rUsed' bit vector, but it somehow conflicted with the way it's used in other places. It somehow ended up having the wrong information in some cases (and would generate wrong code). One particular case I remember strongly is function 'mcs51/gen.c (getFreePtr)'. It would somehow wrongly think that R0 / R1 is used by the call-insn and would require a workaround to keep using 'rUsed'.

I agree with you, ideally it shouldn't be necessary to add the 'rCallUsed', but it was the easiest way without turning everything else upside down. Also, I didn't (and perhaps still don't) fully understand how all the data structures are used and all the corner cases. So that was the easiest way to add the new functionality.

And the calling convention already tells us how arguments and return values are passed, depending on the type. Other ports (stm8 and the z80-related ones) use the aopArg and aopRet (with stm8IsParmInCall, etc layered on top) helper functions to access this information conveniently.

Having to check this over and over again in multiple places seems inefficient and difficult to maintain. At that time,I haven't looked at other ports. Looking at it now, I agree, 'aopArg' / 'aopRet' uses in z80 look tidy and seems a good way to isolate it. My 'rCallUsed' could be seen as a cache to for that. It's also absolutely necessary in order to support multiple calling conventions without going mad. The mcs51 port had several places with implicit assumptions on register usage around call insns. So my way to isolate it was to set the 'call used' bits when emitting the call insns. At least it puts the calling convention relevant parts into one place and becomes transparent to further optimizations....... well except for the special library calls which are emitted in several places and don't follow the usual call insn expansion and calling conventions.

I get your point and understand that it's easier to have all ports follow the same scheme of doing the same things. But right now I don't see an easy way to get mcs51 port over 'to the other side' without major refactoring. Any suggestions?

BTW, one thing that could also be improved is knowing which registers are preserved and clobbered by a call insn. This information is not there yet but could be also useful. I think it would allow optimizing some corner cases. I was actually going to add bit vectors for that, too later on.

Last edit: Oleg Endo 2023-10-29

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-10-29
  
  Regarding preserved and clobbered registers:
  Information about that is implemented as an attribute of the called function. See the preserved_regs member in SDCCsymt.h. The z80 and related ports currently use that information in register allocation, code generation and the peephole optimizer. I never implemented it for hc08, s08, stm8 and pdk, since these targets have so few registers that nearly every function uses all registers, anyway. It is definitely something that would make sense to have for mcs51.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2023-10-29
  
  Migrating mcs51 to aopArg, etc isn't the "easy way", but IMO, we'll need it later anyway, for the new register allocator and future calling convention experiments. So I'll look into that next week.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-10-29

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-10-29

15: The peep.c file already has the D macro, so I propose to reuse it for the new debug logs (modified patch attached). Otherwise we could apply this one once its prerequesites are in place.

Last edit: Philipp Klaus Krause 2023-10-29

0015-mcs51-notUsed.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Endo - 2023-10-30
  
  OK. I'll replace the patch in my local tree with your version.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Oleg Endo - 2023-10-31
    
    I've just tried that, but noticed that later on, I'm also adding dbglog_scan4op in patch 14 and it's used like that:
    
    dbglog_scan4op ( { port->peep.getRegsRead (*pl); port->peep.getRegsWritten (*pl); printf (" regs rd: "); bitVectPrint (stdout, port->peep.getRegsRead (*pl)); printf ("\n regs wr: "); bitVectPrint (stdout, port->peep.getRegsWritten (*pl)); printf ("\n regs call rd: "); if ((*pl)->ic) bitVectPrint (stdout, (*pl)->ic->rCallUsed); printf ("\n"); });
    
    While the D macro is already there, I'd be the first one to use it in this file...
    So actually I'd like to keep the separate dbglog_deadmove and not lump everything together into a single D. If everything is behind a single D, whenever we have to go back to debug something in this code, we have to go through and comment out unwanted logging and so on. Happened to me several times and it's an annoying thing. I find it more useful to have the logging macros already grouped in some way. It makes it easier to jump into the code.
    
    Last edit: Oleg Endo 2023-10-31
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-10-29

16: Could you split this into multiple patches, so we have one patch per "group" (i.e. a set of a few similar rules) of new rules? New peephole rules have a tendency to contain bugs and break something for someone. Splitting this into multiple patches (and commits) would help debug any issues that come up (often users bisect to track down a regression they found, then sdcc devs find out which rule caused it).

Also:

Does notUsed at this point work well enough that the old peephole rules 302 and 303 could be replaced by the following rule?

replace { mov %1,%2 } by { ; Peephole 302 removed dead mov r%1,%2. } if notVolatile(%2), notVolatile(%1), notUsed(%1)

Last edit: Philipp Klaus Krause 2023-10-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Endo - 2023-10-30
  
  What I have there instead at the moment is this:
  
  replace restart { mov ar%1,%2 } by { ; Peephole 303 mov ar%1,%2 removed } if notVolatile(%2), deadMove(%1) replace restart { mov %1,%2 } by { ; Peephole 304 mov %1,%2 removed } if notVolatile(%2), same(%1 'a' 'dpl' 'dph' 'b' 'r0' 'r1' 'r2' 'r3' 'r4' 'r5' 'r6' 'r7'), notUsed(%1) // pattern 304 catches most dead loads of immediates. // however direct addresses of variables might not always pass the notVolatile // it seems and will get stuck around. // FIXME: dptr not supported as register, but it should be. replace restart { mov %1,#%2 } by { ; peephole 304.1 } if same(%1 'a' 'dpl' 'dph' 'b' 'r0' 'r1' 'r2' 'r3' 'r4' 'r5' 'r6' 'r7'), notUsed(%1)
  
  I've tried to use your pattern instead of my current 303 and 304. It makes the function 'scan4op' complain (warning, not hard error) because '%1' is passed into 'scan4op' and it's expected to be a valid register name. Without any constraints, it will also get fed variable names and unsupported registers like dptr, sp, @r0, @r1, psw. Other than that it seems to make no difference in the generated code. It even makes it catch one more mysteriously sticky dead store "clr a" case.
  
  Since the original submission of this patch I've accumulated several more rounds of peephole pattern patches in my local version. Maybe I should re-submit all of them in a new patch set. I went through buggy peephole debugging myself with this one, so I clearly understand why you want to have them split into multiple patches. I'll have a look at that again once all the other prerequisites are in place.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2023-11-09
    
    IMO, it would make sense for notUsed to not give an error if it gets something that it can't really handle. It could just err on the safe side, returning false for that case.
    That way, the peephole rules can be simpler, and when later, support is implemented for currently not-yet-supported registers, the peephole rules become more powerful automatically.
    P.S.: I mostly see a use for checking for dptr (there notUsed('dptr) should essentiall return return notUsed('dpl) && notUsed('dph)).
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Oleg Endo - 2023-11-09
      
      IMO, it would make sense for notUsed to not give an error if it gets something that it can't really handle. It could just err on the safe side, returning false for that case.
      
      That's what it already does with my patch series. It prints a warning-level thing returns "false" (so 'notUsed' will not eliminate the register) and goes on.
      
      The warnings are extremely useful when working on peephole patterns, because there are no compile time checks of the peephole pattern definitions. It's very easy to have a typo or even thinko in there, so any debugging hint from that machinery is highly useful.
      
      P.S.: I mostly see a use for checking for dptr (there notUsed('dptr) should essentiall return return notUsed('dpl) && notUsed('dph)).
      
      Yes, notUsed('dptr') might be implemented like that. But there aren't many insns that can use 'dptr' directly. I don't think I've seen a lot of dead load/store code of dptr itself.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2023-10-29

17: Is this worth it? I understand that by doing the tail call optimization, we are gaining a bit in code size and speed at each optimized tail call. On the other hand, we are adding the additional trampoline for tail calls to each binary. This might be worth it for large programs with many banked tail calls, but there might also be cases where the overhead is more than what is saved at the calls. Do you have some numbers for this patch?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Endo - 2023-10-30
  
  Is this worth it?
  
  Well, I think it is!
  
  First, we have normal function calls being converted into tailcalls. Why should not the same be done for banked calls? Tailcalls have some interesting properties.
  
  The additional code size of the trampoline argument is kinda moot. Non-banked programs will not include those trampolines and helper functions at all, so there is no code size regression. As for banked code programs, those systems having > 64 KByte code space usually don't have code size issues and can easily handle the additional ~16 bytes for the whole program that is > 64 KByte.
  
  Since we can save one ljmp __sdcc_banked_ret for each tailcall, the addtional trampoline size is outweighted by just ~6 tailcalls in the whole program.
  
  Grepping through my local SDCC build folder for __sdcc_banked_jmp, I get 83 hits in "device/lib/huge".
  
  In my own software build, I'm getting 89 hits for __sdcc_banked_jmp and 1054 for __sdcc_banked_call, just to put it into perspective.
  
  Having properly working tailcalls will be interesting if we get to the calling conventions issues later. But even with the current calling convention, it allows for a flatter runtime / call stack, in particular when using 'noreturn' functions.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Philipp Klaus Krause - 2023-10-30
    
    Thanks. That data convinces me that we want tail call optimization for banked calls on mcs51.
    
    I'm not sure about that ic->tailcall stuff yet.
    
    Also, there probably still will be some banked programs with no optimizeable tail calls. I'd suggest to put the tailcall trampoline into a separate assembler file (so the linker will only link it into programs that actually use it). Also, I'd suggest the name in asm to start with three underscores (corresponding to a C identifier starting with two underscores), as this is somewhat cleaner wrt. identifiers reserved for the implementation.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Oleg Endo - 2023-10-30
      
      Also, there probably still will be some banked programs with no optimizeable tail calls. I'd suggest to put the tailcall trampoline into a separate assembler file (so the linker will only link it into programs that actually use it).
      
      I know what you mean. However...
      - Right now the other two existing functions __sdcc_banked_calland__sdcc_banked_ret are all in the same file, so I've added it there
      - The file lib/mcs51/crtbank.asm is just an example implementation. As far as I can see, it's not even referenced by anything in the device lib build. Because the bank switching implementation is so MCU / system dependent, it's expected to be provided by the user application / board support library.
      - In some cases it might be possible to actually reduce the comulated code size of __sdcc_banked_call and __sdcc_banked_jmp by having them in one asm file
      - Again, we're talking about saving 16 bytes in a program that is potentially > 64 KByte in size (if the program is < 64 KByte then why turn on code banking in the first plcace ...)
      
      Also, I'd suggest the name in asm to start with three underscores
      
      I just followed the existing __sdcc_banked_call and __sdcc_banked_ret. If using 3 underscores, then all of them should be renamed at once, IMO for sake of clarity and consistency. Otherwise it's just seems random (from user's point of view).
      
      I'm not sure about that ic->tailcall stuff yet.
      
      Me neither. But this information has to be there somehow, or else it's impossible to distinguish a regular 'jmp' from a tailcall 'jmp'. When expanding a 'noreturn' call, it can already be set directly. But when the peephole transformations kick in, we also need a way to set it afterwards.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Maarten Brock - 2023-11-16
        
        I had put __sdcc_banked_call and __sdcc_banked_ret in the same file because the one is very unlikely to appear without the other. This does not hold for __sdcc_banked_jmp. But I agree those few bytes of code space should not matter when code banking is used. And unless the user has a C8051F12x he/she has to create his/her own anyway.
        
        I see I named them with two underscores almost 20 years ago, but I now doubt if they should not lose those leading underscores altogether. These functions should not exist in the C namespace.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mcs51: improve peephole optimizations

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#466 mcs51: improve peephole optimizations

Related

Discussion

Related