If I did not miscalculate it, lonesha256 should have 676B of local variables.
const uint32_t K[64] makes up 256B, but this could be made global, since it's constant.
Without it there would still be 420B of local variables, so still enough to have the >128B situation.

gbz80 still fails for test #4, which is not part of the regression test, and already failed that one with sdcc 4.0.0 before this regression was introduced. It's suspected that's also due to overflowing WRAM and gameboy has twice as much working RAM as DS80C390.

But I can't see yet how this function would overflow 4KiB or even 8KiB of RAM