Small Device C Compiler (SDCC) / Feature Requests / #464 Better, faster rand()

alvin - 2015-09-25

Philip you may want to have a look at this:

http://z88dk.cvs.sourceforge.net/viewvc/z88dk/z88dk/libsrc/_DEVELOPMENT/stdlib/z80/random/

We're using a Marsaglia xor generator (asm_random_uniform_xor_32.asm) for rand() which is very fast on the z80 and will be fast on any small uP. A link to Marsaglia's paper is included in the program source. This rand() is likely hundreds of times faster than sdcc's current LCG generator.

The random number generator is kept separate from rand() so that library code can generate random numbers without impacting on the program's seed. The two other random number generators there are mainly for people writing games as they return 8-bit random numbers and since this is not related to the C standard this may not be applicable to sdcc but the CMWC is a surprisingly compact, fast and very high quality generator.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2015-10-06

Here are two suggestions implemented in C:

unsigned long y=2463534242; int xor() { yˆ=(y<<13); y^=(y>>17); yˆ=(y<<5); return (y & 0x7fff); } unsigned y16 = 0x8000; unsigned xorshift16(void) { y16 ^= (y16 << 13); y16 ^= (y16 >> 9); return y16 ^= (y16 << 7); }

The first one uses the lower bits of a generator from Marsaglia's paper, the second one uses a cheaper generator of similar style. The first has a period of 2^32-1, while the second has just 2^16-1. The second one is faster. Code size for the mcs51, ds390, hc08 and s08 ports is smaller for the second one, while code size is about the same for both in the stm8, z80, z180, r2k, r3ka, tlcs90 ports.
Both generators are much faster than the current one, and don't introduce a dependency on long multiplication.

Philipp
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2017-02-08

status: open --> closed-fixed

assigned_to: Philipp Klaus Krause

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Philipp Klaus Krause - 2017-02-08

In #9838, I implemented an xorshift 32 with a = 10, b = 9, c = 25. rand() returns the lower 15 bits.

Philipp

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Philip,

the article lists a few possible shift arguments. Among them are some which include byte-wise shifts (by 8, 16 or 24) which should be good for compact code.

Actually I had high hopes for the one with a shift of 24 (because that leaves only one byte to be handled for that shift), but actually it worked out worse than the 10 9 25 version you have chosen:)

So I tried some others and did a crude check for the code size covering some ports:

# Now in svn: 10 9 25 of the "Xorshift RNGs" George Marsaglia paper
sdcc> grep "= t " ./device/lib/rand.c 
        t ^= t >> 10;
        t ^= t << 9;
        t ^= t >> 25;
sdcc> find . -name rand.sym -exec grep -i cseg {} \;
   E CSEG                                       size     9B   flags   20
   5 CSEG                                       size   CE   flags   20
  17 CSEG                                       size     A0   flags   20
   F CSEG                                       size     9B   flags   20
   5 CSEG                                       size   D7   flags   20
  16 CSEG                                       size     C9   flags   20
  17 CSEG                                       size     9E   flags   20
  16 CSEG                                       size     E2   flags   20
  17 CSEG                                       size     7E   flags   20

sdcc> joe ./device/lib/rand.c
sdcc> # 8 7 23
sdcc> make -j 8
sdcc> find . -name rand.sym -exec grep -i cseg {} \;
   E CSEG                                       size     A1   flags   20
   5 CSEG                                       size   D5   flags   20
  17 CSEG                                       size     A5   flags   20
   F CSEG                                       size     A1   flags   20
   5 CSEG                                       size   DE   flags   20
  16 CSEG                                       size     D3   flags   20
  17 CSEG                                       size     A7   flags   20
  16 CSEG                                       size     EC   flags   20
  17 CSEG                                       size     87   flags   20

sdcc> # 8 9 23
   E CSEG                                       size     8E   flags   20
   5 CSEG                                       size   C6   flags   20
  17 CSEG                                       size     90   flags   20
   F CSEG                                       size     8E   flags   20
   5 CSEG                                       size   CF   flags   20
  16 CSEG                                       size     BB   flags   20
  17 CSEG                                       size     91   flags   20
  16 CSEG                                       size     D4   flags   20
  17 CSEG                                       size     71   flags   20

sdcc> # 5 27 8
   E CSEG                                       size     A7   flags   20
   5 CSEG                                       size   DE   flags   20
  17 CSEG                                       size     AB   flags   20
   F CSEG                                       size     A7   flags   20
   5 CSEG                                       size   E7   flags   20
  16 CSEG                                       size     D3   flags   20
  17 CSEG                                       size     AD   flags   20
  16 CSEG                                       size     EC   flags   20
  17 CSEG                                       size     8D   flags   20

sdcc> # 3 25 24
   E CSEG                                       size     A3   flags   20
   5 CSEG                                       size   D5   flags   20
  17 CSEG                                       size     A7   flags   20
   F CSEG                                       size     A3   flags   20
   5 CSEG                                       size   DE   flags   20
  16 CSEG                                       size     D1   flags   20
  17 CSEG                                       size     A8   flags   20
  16 CSEG                                       size     EA   flags   20
  17 CSEG                                       size     88   flags   20

sdcc> # 3 21 16
   E CSEG                                       size     B0   flags   20
   5 CSEG                                       size   E0   flags   20
  17 CSEG                                       size     B3   flags   20
   F CSEG                                       size     B0   flags   20
   5 CSEG                                       size   E9   flags   20
  16 CSEG                                       size     DE   flags   20
  17 CSEG                                       size     B5   flags   20
  16 CSEG                                       size     F7   flags   20
  17 CSEG                                       size     95   flags   20

# the above CSEG list is not complete: it does only include the following ports:
sdcc> find . -name rand.sym -exec grep -i cseg -l {} \;
./device/lib/ds390/rand.sym
./device/lib/hc08/rand.sym
./device/lib/medium/rand.sym
./device/lib/ds400/rand.sym
./device/lib/s08/rand.sym
./device/lib/small-stack-auto/rand.sym
./device/lib/large/rand.sym
./device/lib/large-stack-auto/rand.sym
./device/lib/small/rand.sym

This (for current SDCC code generation) would suggest 8 7 23 as sweet spot.
Thanks, that was fun (distracting me from Lexica on Android:)

Frieder Ferlemann - 2017-02-09

sorry, meant to say 8 9 23. (8 7 23 actually is worse)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Philipp Klaus Krause - 2017-02-10
  
  Thanks for the data. I only looked into code size for z80 (where 8 9 23 is also better) and stm8 (where it is worse).
  My goal was to allow generation and writing of efficient code.
  
  For the z80 I wanted all shifts to be by at least 8 (both 10 9 25 and 8 9 23 are ok there), so that with 7 registers no spilling would be needed when only keeping the nonzero byzes for the xor. While this would allow a great implementation in assembler, SDCC cannot generate such spill-free code yet.
  
  For speed, I wanted to minimize the number of byte shifts. I.e. the shifts of a byte by one bit) At least z80 and stm8 only do shifts for nonzero bytes (I didn't chek the other ports but assume they are the same). 10 9 25 needs 10 such instructions (2 * 3 for the 10, 1 * 3 for the 9, 1 * 1 for the 25). 8 9 23 needs 17 (0 * 3 for the 8, 1 * 3 for the 9, 7 * 2 for the 23). So 8 9 23 needs more shifts, which will impact speed, but not code size (since 14 of these shifts would be implemented as 2 shifts in a loop executed 7 times).
  
  Philipp
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Maarten Brock - 2017-02-11
    
    Are you saying that a shift by a literal 7 is not implemented as a shift by 1 in the opposite direction?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Philipp Klaus Krause - 2017-02-11
      
      Depends.
      
      A shift by 7 could be done as one rotate followed by one and with mask 0x01. But sometimes there are problems with rotate and and:
      
      On stm8, we do not have a plain rotate instruction. We only have a rotate through carry, and would thus need two rotate instructions. Instead of the and-mask we could then do a clear in between the two. But that already doesn't generalize to implementing a shift by 6: There we would need three rotations, and can't use clear in between them. We'd have to and-mask. But and is only available on the 8-bit register a, while shift and roate through carry are available for other registers and memory, too. On register a and on memory bytes we also have the 4-bit rotate instruction swap. There also is a multiplication instruction and sometimes multiplying by a poer of two is actually cheaper than doing shifts.
      
      On z80, we do have the desired rotate. We do not have a cheap clear instruction (except for register a, where we coulduse xor a,a). Again and only works on the 8-bit register a. Shifts and rotates are available for all 8-bit registers and for the 16-bit register hl¹. On a, we also have have the swap instruction, which is a 4-bit rotate.
      
      In the backends (stm8 and z80-related) there are two different code generation functions for each shift direction:
      
      genLeftShiftLiteral / genRightShiftLiteral: Generates efficent code using everything the processor has to offer (shift, roate, swap, mult, and-masking, etc). Handles only shifts with 8 or 16 bit results by literals.
      
      genLeftShift / genRightShift: Generates correct code using mostly shift instructions and moving byte around. Handles shifty by unknown amounts and shifts with 32-bit or wider results.
      
      Philipp
      
      Edit: hl has left shift and rotate only (via add hl, hl and adc hl, hl), but not right shift and rotate.
      
      Last edit: Philipp Klaus Krause 2017-02-11
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Better, faster rand()

The Small Device C Compiler (SDCC), targeting 8-bit architectures

Group

Searches

Help

#464 Better, faster rand()

Discussion