Another generator (use WeylIncrem = U64(7640891563999686221) ):
uint64 XoroS[5] = {346, 925, 630, 787, 932}; //state
/****** This generator has period=(2^256-1)*2^64>0.999999*2^320 and has 320 bits of state.
It uses only +, shifts, XORs, and bswap, i.e. is multiplication-free.
XoroS[0..3] are updated GF2-linearly while XoroS[4] is a Weyl generator. The
output is scrambled using both word-addition, bswap, and variable-distance rotation
to obliterate GF2-linear relations. The Weyl component should obliterate "zeroland" issues.
Runtime<1.374nanosec on my 2390 MHz intel core i7 iMac system (4.03 clock cycles).
This is the fastest 64-bit generator I know of with this great a period and that passes PracRand (if it does). *********/
uint64 Xoro256(){
uint64 r = Bswap64(XoroS[0] + XoroS[3]) + XoroS[4] + XoroS[2];
const uint64 t = XoroS[1] << 17;
XoroS[2] ^= XoroS[0]; XoroS[3] ^= XoroS[1];
r = Rot64(r, XoroS[1]>>58);
XoroS[4] += WeylIncrem; XoroS[2] ^= t;
XoroS[1] ^= XoroS[2]; XoroS[0] ^= XoroS[3];
XoroS[3] = Rot64(XoroS[3], 45);
return( r );
}
(note: orz has editted your post to be more readable)
Last edit: 2019-02-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Warning: this stupid sourceforge netpost gizmo is editing what I post to mess it up.
I think it is pretty obvious what I actually said, but it posts things different from what I say.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-
2019-02-13
I think it is pretty obvious what I actually said, but it posts things different from what I say.
Not a safe assumption. Your multiplies-equal operations were coming out as simple assignment. I've editted your posts to probably what you actually meant.
My brief commentary:
I expect those (with the exception of Brent64) to pass output tests, and likely several of them will pass by significant margins. I don't think any of them are particularly fast for their quality level, though none are particularly slow for their quality level either (unless on hardware that doesn't suite them well).
I try to avoid of some of the operations you're using. byteswap64 and 64to128 multiplication, while useful, can have rather different speeds on different CPUs, even when restricting yourself to only mainstream 64 bit x86s, and that can get severe on non-64-bit CPUs or ones targetted at more specific niches. Variable shifts/rotates in output functions of PRNGs, on the other hand, I avoid becaue I feel it has the potential to make their PRNGs perform better in testing than in practice.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thanks for the edits. I was actually amazed how fast Xoro256() was, only 4 clock cycles,
despite appearing to the naked eye to employ at least 14 operations. (In particular it was the same or faster than sfc64 on my system.) This I presume is due to a lot (indeed, more than I would have thought possible) of instruction level parallelism. It might be that is special to my machine & compiler (other machines & compilers will make it super slow?) - I have no idea.
Phenomena like this leave me basically clueless about how and how not to go for speed.
The purpose of the byteswap64 was so that addition would be GF2-nonlinear. That is, addition is GF2-linear (just XOR) in the least significant bit. If we add, byteswap, then do a second add, the result is not GF2-linear in any bit. The variable rotation also makes every output bit GF2-nonlinear by a different mechanism. I can address your criticism ("I feel it has the potential to make their PRNGs perform better in testing than in practice"), to some extent at least, if I alter my algorithm a bit to change the chronological order, making the variable-rotation happen before not after the output final add.
That alteration turns out not to hurt the speed, so I did it in my own code.
I like 64.64=128 multiply, but it must be admitted that (at least on my machine) my randgen that used it takes 7 clocks, not 4 clocks. It looks way faster & simpler on paper, but it is slower in reality.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(note: orz has editted your post to be more readable)
Last edit: 2019-02-13
Another generator (use WeylIncrem = U64(7640891563999686221) ):
(note: orz has editted your post to be more readable)
Last edit: 2019-02-13
Warning: this stupid sourceforge netpost gizmo is editing what I post to mess it up.
I think it is pretty obvious what I actually said, but it posts things different from what I say.
Not a safe assumption. Your multiplies-equal operations were coming out as simple assignment. I've editted your posts to probably what you actually meant.
My brief commentary:
I expect those (with the exception of Brent64) to pass output tests, and likely several of them will pass by significant margins. I don't think any of them are particularly fast for their quality level, though none are particularly slow for their quality level either (unless on hardware that doesn't suite them well).
I try to avoid of some of the operations you're using. byteswap64 and 64to128 multiplication, while useful, can have rather different speeds on different CPUs, even when restricting yourself to only mainstream 64 bit x86s, and that can get severe on non-64-bit CPUs or ones targetted at more specific niches. Variable shifts/rotates in output functions of PRNGs, on the other hand, I avoid becaue I feel it has the potential to make their PRNGs perform better in testing than in practice.
thanks for the edits. I was actually amazed how fast Xoro256() was, only 4 clock cycles,
despite appearing to the naked eye to employ at least 14 operations. (In particular it was the same or faster than sfc64 on my system.) This I presume is due to a lot (indeed, more than I would have thought possible) of instruction level parallelism. It might be that is special to my machine & compiler (other machines & compilers will make it super slow?) - I have no idea.
Phenomena like this leave me basically clueless about how and how not to go for speed.
The purpose of the byteswap64 was so that addition would be GF2-nonlinear. That is, addition is GF2-linear (just XOR) in the least significant bit. If we add, byteswap, then do a second add, the result is not GF2-linear in any bit. The variable rotation also makes every output bit GF2-nonlinear by a different mechanism. I can address your criticism ("I feel it has the potential to make their PRNGs perform better in testing than in practice"), to some extent at least, if I alter my algorithm a bit to change the chronological order, making the variable-rotation happen before not after the output final add.
That alteration turns out not to hurt the speed, so I did it in my own code.
I like 64.64=128 multiply, but it must be admitted that (at least on my machine) my randgen that used it takes 7 clocks, not 4 clocks. It looks way faster & simpler on paper, but it is slower in reality.