Practically Random / Discussion / Open Discussion: New RandGens by me, might be good

/**** 
Several 64-bit random-word generators by Warren D. Smith 2019.
Compile with  gcc -Wall -O3 -DNDEBUG 
****/

#include <stdio.h>     
#include <assert.h>

#include <inttypes.h>  

#define uint unsigned int
#define uint128 __uint128_t
#define uint64 uint64_t
#define uint32 uint32_t
#define uint16 uint16_t
#define uint8  uint8_t
#define XOR ^

#define U64(y)  UINT64_C(y)   //allows long constants like U64(0x12345678abcdef0) without overflow
//Also available:   INT8_C,  INT16_C,  INT32_C,  INT64_C,  signed
//                 UINT8_C, UINT16_C, UINT32_C, UINT64_C,  unsigned

#define U128(hi,lo) ((((uint128)U64(hi)) << 64) + U64(lo))  //allows 128-bit constants via 64-bit hi,lo halves

#define real32 float
#define real64 double

inline uint64 Rot64(uint64 x, uint h){ //rotates 64-bit word x left (toward MS end) h hops 
  assert(h<64); return( (x<<h) | x>>(64-h) );
}

// the 8 bytes of x are permuted into reversed order: 
inline uint64 Bswap64(uint64 x){ return( __builtin_bswap64(x) ); }

uint128 QCGstateS = 873;  //stores 128 bits of state
uint128 MWCstateZ = 552;  //stores 128 bits of state
uint128 MWCstateY = 1009; //stores 128 bits of state
uint64 Brent64state = 945; //stores 64 bits of state

/********* SMWC ("scrambled multiply with carry") generator of psu-random 64-bit words.
Period = 164112598795790450231783502974331912191
  = 13757*494063303*4902566406487*4925066547083
  = 0.48228 * 2^128  approximately.
Note this period has no small prime factors, enabling easy
combination with other random generators if desired.
Based on the Marsaglia recurrence (c*2^64+x)_new = A*x+c 
where A = 17793123614663798499 = 0xF6EDDFA7D0A66EE3,
and note A*2^64-1 = 328225197591580900463567005948663824383 = prime = P.
The period is (P-1)/2.  Outputs (x XOR c).
Each bit of the output word has full period by itself.
Runtime<2.230nanosec on my system (which is 2930 MHz intel core i7 iMac, so 6.5 clock cycles).
*********************************************/
uint64 SMWC64a(){
  uint64 c = MWCstateZ>>64;   //hi half
  uint64 x = MWCstateZ;       //lo half
  MWCstateZ = x;
  MWCstateZ *= U64( 0xF6EDDFA7D0A66EE3 ); 
  MWCstateZ += c;
  return( x XOR c );
}

/**** Same generator but using   A = f0f32cea32ee767b  instead, which yields
period=229*421*13999*302003823045413*392888402618533 = 0.470605*2^128: ****/
uint64 SMWC64b(){
  uint64 c = MWCstateY>>64;
  uint64 x = MWCstateY;
  MWCstateY = x;
  MWCstateY *= U64( 0xf0f32cea32ee767b ); 
  MWCstateY += c;
  return( x XOR c );
}

/********* A GF2-linear generator of psu-random 64-bit words.
Period = 2^64 - 1 = 18446744073709551615 = 3*5*17*257*641*65537*6700417
consists of every nonzero word (so exactly equidistributed aside from missing 0).  
Each bit of output word has full period by itself.
Runtime<1.722nanosec on my system (5 clock cycles).
***********************************/
uint64 Brent64(){
  Brent64state ^= Brent64state<<7;
  Brent64state ^= Brent64state>>9;
  return(Brent64state);
}

/***** PQCG ("permuted quadratic congruential") generator of psu-random 64-bit words.
x := (a*x+b)*x+c  mod  2^w  has  max period =2^w iff
a=even, b=odd, c=odd, and (b-a) mod 4 = 1.
For example if  b mod 4=3  (so last digit in hex is 3, 7, B, E) we may use  (2*x+b)*x+c.
So the following generator has period = 2^128
(and given that it has 128 bits of state, is exactly equidistrubuted)
and outputs 64-bits via a certain "scrambling" operation partly new and partly
inspired by Melissa O'Neill.
Each bit of the output word has period=2^128 by itself and the 64-bit
output words are exactly equidistributed.
This generator is quite "nonlinear."
Runtime<3.570nanosec on my system (10.5 clocks).
******************/
uint64 PQCG64(){
  uint64 x, y;
  y = QCGstateS>>64;  //hi half
  x = QCGstateS;      //lo half
#define b U64(0x56d59264c70b44de)
#define c U64(0x4907130a7beda5e1)
  QCGstateS *= (2*QCGstateS+b);  QCGstateS += c;
#undef b
#undef c
  return( Rot64( Bswap64(x) XOR y, y>>58 ) );
}

/********
I suspect that any of the generators above, except for Brent64, 
should be good enough to pass all the tests in such packages as PracRand and U01 supercrush.
But I have not verified that.

All 4 of the above generators have relatively-prime periods, hence can be combined
whereupon their periods multiply.  Since the last 3 generators all work via entirely different 
principles, combining them ought to increase the randomness.  
  ComboRandGenA thus has period = (2^64-1)*13757*494063303*4902566406487*4925066547083
           = 0.48228 * 2^192  approximately and should be highly random in all dimensions up to 3.
  ComboRandGenB combining all 4 generators has period = 2^128*(2^64-1)* 
       13757*494063303*4902566406487*4925066547083*229*421*13999*302003823045413*392888402618533
           = 0.2269652 * 2^448  approximately, should be highly random in all dimensions up to 7,
                   and probably is good enough even for cryptography.

It perhaps is possible to rewrite all these algorithms in a way that yields less latency on micro-parallel
processors. I have not attempted to optimize that.  Notice both these combo generators run in less time than
the sum of their component parts, so obviously the C compiler has exploited parallelism.
*******/

uint64 ComboRandGenA(){ return( Brent64() + SMWC64a() ); }  //Runtime<2.418nanosec on my system (7 clock cycles).

//Runtime<5.765nanosec on my system (17 clock cycles):
uint64 ComboRandGenB(){ return( SMWC64b() - ( SMWC64a() XOR ( Brent64() + PQCG64() ) ) ); }

void main(){   //driver to demonstrate usage
  uint i;
  for(i=0; i<16; i++){  ComboRandGenB(); }   //"warm up"
  printf("Here are 16 highly random 64-bit numbers:\n");
  for(i=0; i<16; i++){  printf("%016llX\n", ComboRandGenB() ); }
  printf("Here are 16 less random (but faster) 64-bit numbers:\n");
  for(i=0; i<16; i++){  printf("%016llX\n", ComboRandGenA() ); }
  printf("Here are 16 still less random (but still faster) 64-bit numbers:\n");
  for(i=0; i<16; i++){  printf("%016llX\n", SMWC64a() ); }
  //  for(i=999999999; i>0; i--){ ComboRandGenD(); }
}

(note: orz has editted your post to be more readable)

Last edit: 2019-02-13

Another generator (use WeylIncrem = U64(7640891563999686221) ):

uint64 XoroS[5] = {346, 925, 630, 787, 932};  //state

/****** This generator has period=(2^256-1)*2^64>0.999999*2^320 and has 320 bits of state.                                                                                                                                     
It uses only +, shifts, XORs, and bswap, i.e. is multiplication-free.                                                                                                                                                          
XoroS[0..3] are updated GF2-linearly while XoroS[4] is a Weyl generator.  The                                                                                                                                                  
output is scrambled using both word-addition, bswap, and variable-distance rotation                                                                                                                                            
to obliterate GF2-linear relations.  The Weyl component should obliterate "zeroland" issues.                                                                                                                                   
Runtime<1.374nanosec  on my 2390 MHz intel core i7 iMac system (4.03 clock cycles).                                                                                                                                            
This is the fastest 64-bit generator I know of with this great a period and that passes                        PracRand (if it does).                                                                                                                                                                     *********/
uint64 Xoro256(){
  uint64 r = Bswap64(XoroS[0] + XoroS[3]) + XoroS[4] + XoroS[2];
  const uint64 t = XoroS[1] << 17;
  XoroS[2] ^= XoroS[0];   XoroS[3] ^= XoroS[1];
  r = Rot64(r, XoroS[1]>>58);
  XoroS[4] += WeylIncrem;  XoroS[2] ^= t;
  XoroS[1] ^= XoroS[2];  XoroS[0] ^= XoroS[3];
  XoroS[3] = Rot64(XoroS[3], 45);
  return( r );
}

(note: orz has editted your post to be more readable)

Last edit: 2019-02-13

Warren D. Smith - 2019-02-13

Warning: this stupid sourceforge netpost gizmo is editing what I post to mess it up.
I think it is pretty obvious what I actually said, but it posts things different from what I say.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2019-02-13

I think it is pretty obvious what I actually said, but it posts things different from what I say.

Not a safe assumption. Your multiplies-equal operations were coming out as simple assignment. I've editted your posts to probably what you actually meant.

My brief commentary:

I expect those (with the exception of Brent64) to pass output tests, and likely several of them will pass by significant margins. I don't think any of them are particularly fast for their quality level, though none are particularly slow for their quality level either (unless on hardware that doesn't suite them well).

I try to avoid of some of the operations you're using. byteswap64 and 64to128 multiplication, while useful, can have rather different speeds on different CPUs, even when restricting yourself to only mainstream 64 bit x86s, and that can get severe on non-64-bit CPUs or ones targetted at more specific niches. Variable shifts/rotates in output functions of PRNGs, on the other hand, I avoid becaue I feel it has the potential to make their PRNGs perform better in testing than in practice.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Warren D. Smith - 2019-02-13

thanks for the edits. I was actually amazed how fast Xoro256() was, only 4 clock cycles,
despite appearing to the naked eye to employ at least 14 operations. (In particular it was the same or faster than sfc64 on my system.) This I presume is due to a lot (indeed, more than I would have thought possible) of instruction level parallelism. It might be that is special to my machine & compiler (other machines & compilers will make it super slow?) - I have no idea.
Phenomena like this leave me basically clueless about how and how not to go for speed.

The purpose of the byteswap64 was so that addition would be GF2-nonlinear. That is, addition is GF2-linear (just XOR) in the least significant bit. If we add, byteswap, then do a second add, the result is not GF2-linear in any bit. The variable rotation also makes every output bit GF2-nonlinear by a different mechanism. I can address your criticism ("I feel it has the potential to make their PRNGs perform better in testing than in practice"), to some extent at least, if I alter my algorithm a bit to change the chronological order, making the variable-rotation happen before not after the output final add.
That alteration turns out not to hurt the speed, so I did it in my own code.

I like 64.64=128 multiply, but it must be admitted that (at least on my machine) my randgen that used it takes 7 clocks, not 4 clocks. It looks way faster & simpler on paper, but it is slower in reality.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

New RandGens by me, might be good

statistical tests & psuedo- random number generators (RNGs, PRNGs)

Forums

Help

New RandGens by me, might be good

New RandGens by me, might be good

statistical tests & psuedo- random number generators (RNGs, PRNGs)

Forums

Help

New RandGens by me, might be good document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

New RandGens by me, might be good