Menu

New RandGens by me, might be good

2019-02-12
2019-02-13
  • Warren D. Smith

    Warren D. Smith - 2019-02-12
    /**** 
    Several 64-bit random-word generators by Warren D. Smith 2019.
    Compile with  gcc -Wall -O3 -DNDEBUG 
    ****/
    
    #include <stdio.h>     
    #include <assert.h>
    
    #include <inttypes.h>  
    
    #define uint unsigned int
    #define uint128 __uint128_t
    #define uint64 uint64_t
    #define uint32 uint32_t
    #define uint16 uint16_t
    #define uint8  uint8_t
    #define XOR ^
    
    #define U64(y)  UINT64_C(y)   //allows long constants like U64(0x12345678abcdef0) without overflow
    //Also available:   INT8_C,  INT16_C,  INT32_C,  INT64_C,  signed
    //                 UINT8_C, UINT16_C, UINT32_C, UINT64_C,  unsigned
    
    #define U128(hi,lo) ((((uint128)U64(hi)) << 64) + U64(lo))  //allows 128-bit constants via 64-bit hi,lo halves
    
    #define real32 float
    #define real64 double
    
    inline uint64 Rot64(uint64 x, uint h){ //rotates 64-bit word x left (toward MS end) h hops 
      assert(h<64); return( (x<<h) | x>>(64-h) );
    }
    
    // the 8 bytes of x are permuted into reversed order: 
    inline uint64 Bswap64(uint64 x){ return( __builtin_bswap64(x) ); }
    
    uint128 QCGstateS = 873;  //stores 128 bits of state
    uint128 MWCstateZ = 552;  //stores 128 bits of state
    uint128 MWCstateY = 1009; //stores 128 bits of state
    uint64 Brent64state = 945; //stores 64 bits of state
    
    /********* SMWC ("scrambled multiply with carry") generator of psu-random 64-bit words.
    Period = 164112598795790450231783502974331912191
      = 13757*494063303*4902566406487*4925066547083
      = 0.48228 * 2^128  approximately.
    Note this period has no small prime factors, enabling easy
    combination with other random generators if desired.
    Based on the Marsaglia recurrence (c*2^64+x)_new = A*x+c 
    where A = 17793123614663798499 = 0xF6EDDFA7D0A66EE3,
    and note A*2^64-1 = 328225197591580900463567005948663824383 = prime = P.
    The period is (P-1)/2.  Outputs (x XOR c).
    Each bit of the output word has full period by itself.
    Runtime<2.230nanosec on my system (which is 2930 MHz intel core i7 iMac, so 6.5 clock cycles).
    *********************************************/
    uint64 SMWC64a(){
      uint64 c = MWCstateZ>>64;   //hi half
      uint64 x = MWCstateZ;       //lo half
      MWCstateZ = x;
      MWCstateZ *= U64( 0xF6EDDFA7D0A66EE3 ); 
      MWCstateZ += c;
      return( x XOR c );
    }
    
    /**** Same generator but using   A = f0f32cea32ee767b  instead, which yields
    period=229*421*13999*302003823045413*392888402618533 = 0.470605*2^128: ****/
    uint64 SMWC64b(){
      uint64 c = MWCstateY>>64;
      uint64 x = MWCstateY;
      MWCstateY = x;
      MWCstateY *= U64( 0xf0f32cea32ee767b ); 
      MWCstateY += c;
      return( x XOR c );
    }
    
    /********* A GF2-linear generator of psu-random 64-bit words.
    Period = 2^64 - 1 = 18446744073709551615 = 3*5*17*257*641*65537*6700417
    consists of every nonzero word (so exactly equidistributed aside from missing 0).  
    Each bit of output word has full period by itself.
    Runtime<1.722nanosec on my system (5 clock cycles).
    ***********************************/
    uint64 Brent64(){
      Brent64state ^= Brent64state<<7;
      Brent64state ^= Brent64state>>9;
      return(Brent64state);
    }
    
    /***** PQCG ("permuted quadratic congruential") generator of psu-random 64-bit words.
    x := (a*x+b)*x+c  mod  2^w  has  max period =2^w iff
    a=even, b=odd, c=odd, and (b-a) mod 4 = 1.
    For example if  b mod 4=3  (so last digit in hex is 3, 7, B, E) we may use  (2*x+b)*x+c.
    So the following generator has period = 2^128
    (and given that it has 128 bits of state, is exactly equidistrubuted)
    and outputs 64-bits via a certain "scrambling" operation partly new and partly
    inspired by Melissa O'Neill.
    Each bit of the output word has period=2^128 by itself and the 64-bit
    output words are exactly equidistributed.
    This generator is quite "nonlinear."
    Runtime<3.570nanosec on my system (10.5 clocks).
    ******************/
    uint64 PQCG64(){
      uint64 x, y;
      y = QCGstateS>>64;  //hi half
      x = QCGstateS;      //lo half
    #define b U64(0x56d59264c70b44de)
    #define c U64(0x4907130a7beda5e1)
      QCGstateS *= (2*QCGstateS+b);  QCGstateS += c;
    #undef b
    #undef c
      return( Rot64( Bswap64(x) XOR y, y>>58 ) );
    }
    
    /********
    I suspect that any of the generators above, except for Brent64, 
    should be good enough to pass all the tests in such packages as PracRand and U01 supercrush.
    But I have not verified that.
    
    All 4 of the above generators have relatively-prime periods, hence can be combined
    whereupon their periods multiply.  Since the last 3 generators all work via entirely different 
    principles, combining them ought to increase the randomness.  
      ComboRandGenA thus has period = (2^64-1)*13757*494063303*4902566406487*4925066547083
               = 0.48228 * 2^192  approximately and should be highly random in all dimensions up to 3.
      ComboRandGenB combining all 4 generators has period = 2^128*(2^64-1)* 
           13757*494063303*4902566406487*4925066547083*229*421*13999*302003823045413*392888402618533
               = 0.2269652 * 2^448  approximately, should be highly random in all dimensions up to 7,
                       and probably is good enough even for cryptography.
    
    It perhaps is possible to rewrite all these algorithms in a way that yields less latency on micro-parallel
    processors. I have not attempted to optimize that.  Notice both these combo generators run in less time than
    the sum of their component parts, so obviously the C compiler has exploited parallelism.
    *******/
    
    uint64 ComboRandGenA(){ return( Brent64() + SMWC64a() ); }  //Runtime<2.418nanosec on my system (7 clock cycles).
    
    //Runtime<5.765nanosec on my system (17 clock cycles):
    uint64 ComboRandGenB(){ return( SMWC64b() - ( SMWC64a() XOR ( Brent64() + PQCG64() ) ) ); }
    
    void main(){   //driver to demonstrate usage
      uint i;
      for(i=0; i<16; i++){  ComboRandGenB(); }   //"warm up"
      printf("Here are 16 highly random 64-bit numbers:\n");
      for(i=0; i<16; i++){  printf("%016llX\n", ComboRandGenB() ); }
      printf("Here are 16 less random (but faster) 64-bit numbers:\n");
      for(i=0; i<16; i++){  printf("%016llX\n", ComboRandGenA() ); }
      printf("Here are 16 still less random (but still faster) 64-bit numbers:\n");
      for(i=0; i<16; i++){  printf("%016llX\n", SMWC64a() ); }
      //  for(i=999999999; i>0; i--){ ComboRandGenD(); }
    }
    

    (note: orz has editted your post to be more readable)

     

    Last edit: 2019-02-13
    • Warren D. Smith

      Warren D. Smith - 2019-02-12

      Another generator (use WeylIncrem = U64(7640891563999686221) ):

      uint64 XoroS[5] = {346, 925, 630, 787, 932};  //state
      
      /****** This generator has period=(2^256-1)*2^64>0.999999*2^320 and has 320 bits of state.                                                                                                                                     
      It uses only +, shifts, XORs, and bswap, i.e. is multiplication-free.                                                                                                                                                          
      XoroS[0..3] are updated GF2-linearly while XoroS[4] is a Weyl generator.  The                                                                                                                                                  
      output is scrambled using both word-addition, bswap, and variable-distance rotation                                                                                                                                            
      to obliterate GF2-linear relations.  The Weyl component should obliterate "zeroland" issues.                                                                                                                                   
      Runtime<1.374nanosec  on my 2390 MHz intel core i7 iMac system (4.03 clock cycles).                                                                                                                                            
      This is the fastest 64-bit generator I know of with this great a period and that passes                        PracRand (if it does).                                                                                                                                                                     *********/
      uint64 Xoro256(){
        uint64 r = Bswap64(XoroS[0] + XoroS[3]) + XoroS[4] + XoroS[2];
        const uint64 t = XoroS[1] << 17;
        XoroS[2] ^= XoroS[0];   XoroS[3] ^= XoroS[1];
        r = Rot64(r, XoroS[1]>>58);
        XoroS[4] += WeylIncrem;  XoroS[2] ^= t;
        XoroS[1] ^= XoroS[2];  XoroS[0] ^= XoroS[3];
        XoroS[3] = Rot64(XoroS[3], 45);
        return( r );
      }
      

      (note: orz has editted your post to be more readable)

       

      Last edit: 2019-02-13
  • Warren D. Smith

    Warren D. Smith - 2019-02-13

    Warning: this stupid sourceforge netpost gizmo is editing what I post to mess it up.
    I think it is pretty obvious what I actually said, but it posts things different from what I say.

     
  • - 2019-02-13

    I think it is pretty obvious what I actually said, but it posts things different from what I say.

    Not a safe assumption. Your multiplies-equal operations were coming out as simple assignment. I've editted your posts to probably what you actually meant.

    My brief commentary:

    I expect those (with the exception of Brent64) to pass output tests, and likely several of them will pass by significant margins. I don't think any of them are particularly fast for their quality level, though none are particularly slow for their quality level either (unless on hardware that doesn't suite them well).

    I try to avoid of some of the operations you're using. byteswap64 and 64to128 multiplication, while useful, can have rather different speeds on different CPUs, even when restricting yourself to only mainstream 64 bit x86s, and that can get severe on non-64-bit CPUs or ones targetted at more specific niches. Variable shifts/rotates in output functions of PRNGs, on the other hand, I avoid becaue I feel it has the potential to make their PRNGs perform better in testing than in practice.

     
  • Warren D. Smith

    Warren D. Smith - 2019-02-13

    thanks for the edits. I was actually amazed how fast Xoro256() was, only 4 clock cycles,
    despite appearing to the naked eye to employ at least 14 operations. (In particular it was the same or faster than sfc64 on my system.) This I presume is due to a lot (indeed, more than I would have thought possible) of instruction level parallelism. It might be that is special to my machine & compiler (other machines & compilers will make it super slow?) - I have no idea.
    Phenomena like this leave me basically clueless about how and how not to go for speed.

    The purpose of the byteswap64 was so that addition would be GF2-nonlinear. That is, addition is GF2-linear (just XOR) in the least significant bit. If we add, byteswap, then do a second add, the result is not GF2-linear in any bit. The variable rotation also makes every output bit GF2-nonlinear by a different mechanism. I can address your criticism ("I feel it has the potential to make their PRNGs perform better in testing than in practice"), to some extent at least, if I alter my algorithm a bit to change the chronological order, making the variable-rotation happen before not after the output final add.
    That alteration turns out not to hurt the speed, so I did it in my own code.

    I like 64.64=128 multiply, but it must be admitted that (at least on my machine) my randgen that used it takes 7 clocks, not 4 clocks. It looks way faster & simpler on paper, but it is slower in reality.

     

Log in to post a comment.