Feature request: "-tlshow-every" for progress reports on a linear schedule
Compiled 0.96 on Windows 10 with MinGW-w64 13.0.0 and RNG_test.exe seemed to randomly fail with message "error reading from file". It turns out that because MinGW does not define WIN32, it uses _WIN32, RNG_test.cpp will not set stdin to binary mode. So, if -DWIN32 is passed to g++ it will work.
I uploaded tarball for *nix, with GNUMakefile in `tools/'. Also GCC reports few errors, namely overflows, shift-count-overflows and shift-count-negatives. They deserve to be fixed. May it should use stdint.h, even if it required to put some simplified version of it in `include/' on windoze.
I made GNU Makefile. It will help compile in *nix. It doesn't build RNG_to_TestU01, I didn't have inspiration but what is ready is already good.
No, I haven't made a git repository or anything for it. An automated merger could be difficult, as there were a number of cases where files were split in two, renamed, or otherwise had pieces moved around.
Is the source code version-controlled? A couple years ago I forked version pre0.95 and made many QoL changes, such as fixing warnings found from linting, and modernizing C++ code. I'd like to merge the 0.96 changes, and it would be convenient if git could use the source code history to help with the merge.
Your code was very helpful, some adapted versions of it are present in xbg_helpers.cpp at the moment. It's working, or was a few weeks ago, I might have broken something while removing my LFSRs from the recommended PRNG set. I just ended up reviewing my LFSR-based algorithms at the end of everything and deciding I couldn't rate even the largest of them higher than 1 star on quality even with output hashing, largely due to excessive interstate correlations. I don't understand why things like xoroshiro...
Your code was very helpful, some adapted versions of it are present in xbg_helpers.cpp at the moment. It's working, or was a few weeks ago, I might have broken something while removing my LFSRs from the recommended PRNG set. I just ended up reviewing my LFSR-based algorithms at the end of everything and deciding I couldn't rate even the largest of them higher than 1 star on quality even with output hashing, largely due to excessive interstate correlations. I don't understand why things like xoroshiro...
Hopefully I didn't screw anything up, it's been a while since I've done a release. There are a couple new PRNGs there - very small, very fast, very light-weight ones, but requiring fast multiplication. mrsf64 & mrsf32 are so small you can keep their state in just 2 registers. mrc64 & mrc32 & mrc16 are a little bigger and a little slower, but slightly higher quality. Notably I did not include new random access PRNGs like I had intended to. I did incorporate code for random access to LFSRs / GFSRs...
I've written about characteristic polynomials in the attachment.
Hello PractRand team, I’m submitting a new stream generator for external evaluation. The generator is an ARX based design intended for high throughput keystream generation. I’ve run extensive local testing across PractRand, Gjrand, Dieharder, NIST STS, and TestU01, including repeated independent runs and long stream tests up to the terabyte scale, with no observed anomalies outside expected statistical behavior. A summary of results and reproduction commands are included. This work is part of a broader...
Hello PractRand team, I’m submitting a new stream generator for external evaluation. The generator is an ARX based design intended for high throughput keystream generation. I’ve run extensive local testing across PractRand, Gjrand, Dieharder, NIST STS, and TestU01, including repeated independent runs and long stream tests up to the terabyte scale, with no observed anomalies outside expected statistical behavior. A summary of results and reproduction commands are included. This work is part of a broader...
Hello PractRand team, I’m submitting a new stream generator for external evaluation. The generator is an ARX based design intended for high throughput keystream generation. I’ve run extensive local testing across PractRand, Gjrand, Dieharder, NIST STS, and TestU01, including repeated independent runs and long stream tests up to the terabyte scale, with no observed anomalies outside expected statistical behavior. A summary of results and reproduction commands are included. This work is part of a broader...
Hopefully I didn't screw anything up, it's been a while since I've done a release. There are a couple new PRNGs there - very small, very fast, very light-weight ones, but requiring fast multiplication. mrsf64 & mrsf32 are so small you can keep their state in just 2 registers. mrc64 & mrc32 & mrc16 are a little bigger and a little slower, but slightly higher quality. Notably I did not include new random access PRNGs like I had intended to. I did incorporate code for random access to LFSRs / GFSRs...
Hopefully I didn't screw anything up, it's been a while since I've done a release. There are a couple new PRNGs there - very small, very fast, very light-weight ones, but requiring fast multiplication. mrsf64 & mrsf32 are so small you can keep their state in just 2 registers. mrc64 & mrc32 & mrc16 are a little bigger and a little slower, but slightly higher quality. Notably I did not include new random access PRNGs like I had intended to. I did incorporate code for random access to LFSRs / GFSRs...
v0.96 is up
rarns64::raw64() bug
Different results when 1 byte is skipped
Big bug in Practrand 0.94 and 0.95 ?
On further thought, I have that naming scheme I may rename them to randi32, randi32_fast, randi64, and, if I add it, randi64_fast. I think I avoided those kinds of names to maximize the difference relative to raw8/raw16/raw32/raw64, but at this point that doesn't matter anymore. To be clear, randli_fast / randi64_fast could be added, I'd just have to #ifdef different codepaths for different compilers, and in rare cases (mostly 32 bit compilers) randli_fast aka randi64_fast would end up super slow...
I think I heard that most pc processors always calculate the upper mult, even when not asked... not sure for what % that is true, but it does seem to be quite fast on most CPUs. Sounds unlikely to me. All of this is limited by understanding, but this is how I understand the general picture. Integer wide multiplication almost always involves a different opcode than regular low multiplication. That makes it easy to save power by not calculating the upper half for low integer multiplication, which is...
I think I heard that most pc processors always calculate the upper mult, even when not asked... not sure for what % that is true, but it does seem to be quite fast on most CPUs. https://github.com/wangyi-fudan/wyhash The mix is prime number where each byte has 4 zeros and ones. As I mentioned, I tried to tinker with it mulitple times but I just end up destroying it's speed. https://tom-kaitchuck.medium.com/designing-a-new-prng-1c4ffd27124d And this was my intro to MWC, but the math is a bit above...
Well, "slow" relative to the fastest decent simple PRNGs around. On my CPU. In the loops I measure performance in, which are different from those others use in benchmarks. In practice, anything over about 2.5 to 3.0 GB/s is difficult for me and I'm not quite sure why, past about twice that is absurdly difficult, and past 13 GB/s seems impossible - and multwistfast64 occasionally manages 10 GB/s. Though I am a bit surprised that wide multiplication is considered an appropriate operation for high speed...
I think you got how they work. They are still available on the link I posted. Wrote C long ago, here is LLM translation, it assumes __uint128_t #define WYR64_MIX 0x8bb84b93962eacc9ULL #define WYR64_INC 0x2d358dccaa6c78a5ULL #define ODD_PHI_64 0x9e3779b97f4a7c15ULL typedef uint64_t WYR64; WYR64 WYR64_init(uint64_t seed) { return (seed + WYR64_MIX) * WYR64_INC * ODD_PHI_64; } uint64_t WYR64_u64(WYR64 *r) { *r += WYR64_INC; __uint128_t mix = (__uint128_t)(*r) * (__uint128_t)(*r ^ WYR64_MIX); return...
First, updates on my last post: The first two multiplication-based PRNGs I listed, I'm not thrilled with. They have incredibly bad avalanche properties. The output tests may be decent, but with avalanche that bad I just don't trust the output test results. I tried some variations and found one that was both faster and had better avalanche behavior... except then when I went back and benchmarked it again later it was way slower. I don't know if I somehow screwed up the earlier benchmarks, or if my...
Sorry for the late reply! Interesting ones, hope you had fun. This is how fast they run on my M2 Pro (ns/out): Mul1: 0.7171630859375 Mul2: 0.8544921875 Mul3: 1.129150390625 Mul4: 1.129150390625 WYRand: 0.3509521484375 MWCP: 0.5645751953125 SFC: 0.885009765625 WYRand: r^ += WYR64_INC mix := u128(r^) * u128(r^ ~ WYR64_MIX) return u64(mix >> 64) ~ u64(mix) MWCP: mix := u128(r[1]) * u128(0xfeb344657c0af413) + u128(r[0]) r^ = {u64(mix >> 64), r[2], r[3], u64(mix)} return u64(mix >> 64) + u64(mix) Odin...
Sorry for the late reply! Interesting ones, hope you had fun. This is how fast they run on my M2 Pro (ns/out): ` Mul1: 0.7171630859375 Mul2: 0.8544921875 Mul3: 1.129150390625 Mul4: 1.129150390625 WYRand: 0.3509521484375 MWCP: 0.5645751953125 SFC: 0.885009765625 WYRand: r^ += WYR64_INC mix := u128(r^) * u128(r^ ~ WYR64_MIX) return u64(mix >> 64) ~ u64(mix) MWCP: mix := u128(r[1]) * u128(0xfeb344657c0af413) + u128(r[0]) r^ = {u64(mix >> 64), r[2], r[3], u64(mix)} return u64(mix >> 64) + u64(mix) `...
I've been looking at multiplication-based PRNGs again lately. Some I've found interesting: Uint64 old = a + b; a = b * 0x9e3779b97f4a7c15ull; b = rotate64(old, 23); return a; That one is the fastest decent-ish PRNG I have in testing at the moment. Quality isn't particularly good - the 16 bit variant fails at a measly 8 megabytes, and the 32 bit variant fails at 64 GB. Ultimately, only the full 64 bit one seems like a quality I could possibly recommend, and even then I'd worry that I missed something...
I've been looking at multiplication-based PRNGs again lately. Some I've found interesting: Uint64 old = a + b; a = b * 0x9e3779b97f4a7c15ull; b = rotate64(old, 23); return a; That one is the fastest decent-ish PRNG I have in testing at the moment. Quality isn't particularly good - the 16 bit variant fails at a measly 8 megabytes, and the 32 bit variant fails at 64 GB. Ultimately, only the full 64 bit one seems like a quality I could possibly recommend, and even then I'd worry that I missed something...
Indeed seeding in 0.95 does use some uninitialized variables, though I would say the problem lies elsewhere, starting with line 252 (mask should have been initialized to 0, not 1), and continuing inside the loop where the code isn't fully adapted - it was supposed to start with an effective ITERATION_SIZE and INDIRECTION_SIZE of 1, use that to double the size of the initialized section of each array, then double the effective array sizes and repeat until both arrays are entirely populated, but I...
efiix8x48::seed initialisation problem
That does indeed seeding in 0.95 does use some uninitialized variables, though I would say the problem lies elsewhere, starting with line 252 (mask should have been initialized to 0, not 1), and continuing inside the loop where the code isn't fully adapted - it was supposed to start with an effective ITERATION_SIZE and INDIRECTION_SIZE of 1, use that to double the size of the initialized section of each array, then double the effective array sizes and repeat until both arrays are entirely populated,...
efiix8x48::seed initialisation problem
found another potential issue, I'll just write it here since there is a re-write: in void PractRand::RNGs::Raw::efiix8x48::seed(Uint64 s1, Uint64 s2, Uint64 s3, Uint64 s4) I would suggest replacing: iteration_table[0] = 0; indirection_table[0] = 0; in lines 243 and 244 with: for (unsigned long y = 0; y < ITERATION_SIZE; y++) { iteration_table[y] = 0; } for (unsigned long y = 0; y < INDIRECTION_SIZE; y++) { indirection_table[y] = 0; } this is because in the algo afterwards all the values are NOT initialised,...
New version of rarns algorithm looks like: enum { S1 = 3, S2 = 8, S3 = 31, OUTROT = 32 }; Uint64 rv = rotate64(xs1 + xs3, OUTROT); xs1 ^= xs2; xs2 ^= xs3; xs3 ^= xs1; xs2 = rotate64(xs2, S2); xs3 = rotate64(xs3, S3); xs1 ^= xs1 >> S1; return rv + xs3; Though the seeding functions and fast-forwarding/rewinding are still being worked on. Shifts are 2/11/26/16 at 32 bit, and 1/3/10/8 at 16 bit. Those are all maximum period xor-based-generators with outputs hashed well enough to pass tests.
It is a bug. I'd say it will be fixed in the next release, but it's more accurate to say that all rarns code has been rewritten for the next release.
*rans64::raw64
rarns64::raw64() bug
Possibly confusing 'rng_test -help' tlmax typo
isaac64x256::_advance_state bug
You are correct, thank you. The issue will be addressed for version 0.96. In Bob's original code the xor happens in the macro for 32 bit ISAAC, and in the macro call for 64 bit ISAAC, and apparently I somehow ended up doing both in my implementation of the 64 bit variant.
isaac64x256::_advance_state bug
I've had a little time and energy in the last few months to work on PractRand. One of the results: In trying to have a really broad range of PRNGs tested with PractRand and other test suites for comparison, I noticed that PR0.95 was badly underperforming on non-power-of-2 combined LCGs - they were failing TestU01 SmallCrush in a few seconds while passing multiple multiple days of PractRand testing with clean bills of health. This was due to the heavy use of birthday tests in TestU01. So I've gone...
PractRand's RNG_test does broadly the same sort of stuff as TestU01's SmallCrush/Crush/BigCrush. The default parameters are generally about right for most uses, except that you usually want to limit the amount of testing done. The command line parameter "-tlmax 40" will tell it to stop after 1 terabyte, which ends up using about the same amount of CPU time as BigCrush, assuming the RNG is fast.
Hi, I've tried and failed to get the TestU01 Bigcrush test to work on my RNG. Does PractRand have the equivalent or similar test? Thanks.
Thank you very much for this detailed commentary, it is very much appreciated!
I'm glancing through https://vigna.di.unimi.it/ftp/papers/LXM.pdf right now. I expect overall output quality to be decent. While the component generators are weak, the mixing function is probably capable of thoroughly hiding their weakness. The large statespace and effective seedspace should be sufficient for splitting. Because each sub-PRNG supports random access, the overall PRNG can be random-access as well. I'd expect speed to suffer from having to run both sub-PRNGs and the mixing function,...
I'm glancing through https://vigna.di.unimi.it/ftp/papers/LXM.pdf right now. I expect overall output quality to be decent. While the component generators are weak, the mixing function is probably capable of thoroughly hiding their weakness. The large statespace and effective seedspace should be sufficient for splitting. Because each sub-PRNG supports random access, the overall PRNG can be random-access as well. I'd expect speed to suffer from having to run both sub-PRNGs and the mixing function,...
I'm glancing through https://vigna.di.unimi.it/ftp/papers/LXM.pdf right now. I expect overall output quality to be decent. While the component generators are weak, the mixing function is probably capable of thoroughly hiding their weakness. The large statespace and effective seedspace should be sufficient for splitting. Because each sub-PRNG supports random access, the overall PRNG can be random-access as well. I'd expect speed to suffer from having to run both sub-PRNGs and the mixing function,...
I don't know if this is of interest to anyone, but the Java programming language includes since version 17 (which came out in 2021) RNGs form the LXM family by Vigna. See https://dl.acm.org/doi/10.1145/3485525 and https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/random/package-summary.html Anybody has any opinion of these generators? Are they good? Here are some benchmarks (for generating 64-bit floating point numbers) which include all RNGs included by standard in Java: Intel...
Mine is a left rotate too. The implementation code is listed (though the compiler hopefully turns that in to a single opcode). My thinking was that if you skipped seeding, zeroed state, and then looked at it through the first few calls, it should be apparent exactly which part of our algorithms differ. And yeah, the problem with using LEA like that is that it performs poorly on anything other than x86. I'm currently considering this variation: Uint64 tmp = a + b; a = rotate(a, 25); a ^= old; a +=...
I got it... I said first 5 outputs after the init, while you are showing the state after init... this is good then: d8 1c 84 05 51 As for why we are getting different tests results is eather that your new version for mod3n is different or that we feed the test differently.... So, when is the new version coming out? Haha
Ah, you are using left rotate... then I am really confused, I will re-check. Sorry for taking you time like this haha.
Funny we posted at the same time... Maybe the rotates are different, I am using left rotate...
const Uint64 K = 0x92ec64765925a395ull; a ^= b; a = rotate(a, 13); b = b + (b << 3) + Uint32(K); a *= K; return a; This is more than twice slower than MWC on my Arm Mac. But it is an interesting approach, thanks for sharing! I see you are also not a fan of LFSRs haha. Uint16 state[5]; static Uint16 _bswap16(Uint16 v) { return (Uint16((v >> 0) & 15) << 12) | (Uint16((v >> 4) & 15) << 8) | (Uint16((v >> 8) & 15) << 4) | (Uint16((v >> 12) & 15) << 0); } Uint16 raw16() { Uint16 tmp = state[0] - state[1];...
I wonder if your implementation is identical... Here are hex values for first 5 outputs: Is that for seed zero? That's not what I'm getting. Here's the 8-bit version's code I'm using: Uint8 tmp = state[0] - state[1]; state[0] = state[1] + state[4]; state[1] = _bswap8(state[2]); state[2] = state[3] ^ tmp; state[3] = rotate8(tmp, 5); state[4] += (0x9e3779b97f4a7c15ull >> (64 - 8 * sizeof(tmp))) | 1; return state[0]; and the functions used: static Uint8 rotate8(Uint8 v, int bits) { return (v << bits)...
32 bit version passed 32 TB, I stopped the test there. I haven't tried comparing our output to figure out why I'm getting mod3n failures with -++ while you're not. My implementation looks roughly like this: Uint16 state[5]; static Uint16 _bswap16(Uint16 v) { return (Uint16((v >> 0) & 15) << 12) | (Uint16((v >> 4) & 15) << 8) | (Uint16((v >> 8) & 15) << 4) | (Uint16((v >> 12) & 15) << 0); } Uint16 raw16() { Uint16 tmp = state[0] - state[1]; state[0] = state[1] + state[4]; state[1] = _bswap16(state[2]);...
32 bit version passed 32 TB, I stopped the test there. I haven't tried comparing our output to figure out why I'm getting mod3n failures with -++ while you're not. My implementation looks roughly like this: Uint16 state[5]; static Uint16 _bswap16(Uint16 v) { return (Uint16((v >> 0) & 15) << 12) | (Uint16((v >> 4) & 15) << 8) | (Uint16((v >> 8) & 15) << 4) | (Uint16((v >> 12) & 15) << 0); } Uint16 raw16() { Uint16 tmp = state[0] - state[1]; state[0] = state[1] + state[4]; state[1] = _bswap16(state[2]);...
I forgot to emphasise to use the init from https://github.com/IgneousRed/PrngBoard/blob/main/FSR/FSR.odin FSR64_init :: proc(seed: u64) -> FSR64 { rng := FSR64{seed, seed, seed, seed, seed} for _ in 0 ..< 15 do FSR64_u64(&rng) return rng }
4-part byteswap on 16 bit variant, huh? Okay, I'm switching to that, named it nibbleswap. Still fails mod3n at 64MB if I don't put in an xor though, and the 16 bit version still fails around the 512 GB to 1 TB range. For u8 and u16 I used a custom instruction for swap, as u8 swap does nothing and for u16 it is same as rot8. I wonder if your implementation is identical... Here are hex values for first 5 outputs: u8 3a 97 c8 0a d4 u16 4ddf d1f6 4b87 9ebc d254 u32 f6595109 0b321081 a9cce385 6cb7aa32...
4-part byteswap on 16 bit variant, huh? Okay, I'm switching to that, named it nibbleswap. Still fails mod3n at 64MB if I don't put in an xor though, and the 16 bit version still fails around the 512 GB to 1 TB range. For u8 and u16 I used a custom instruction for swap, as u8 swap does nothing and for 16 it is same as rot4. I wonder if your implementation is identical... Here are hex values for first 5 outputs: u8 3a 97 c8 0a d4 u16 4ddf d1f6 4b87 9ebc d254 u32 f6595109 0b321081 a9cce385 6cb7aa32...
I made a chaotic counting PRNG with 5 states, 6 instructions and uses byteswap instruction. It may be just me... But it is soo beautiful :D FSR64_u64 :: proc(r: ^FSR64) -> u64 { s := r[0] - r[1] r[0] = r[1] + r[4] r[1] = bits.byte_swap(r[2]) r[2] = r[3] + s r[3] = bits.rotate_left64(s, 15) r[4] += lib.ODD_PHI_64 return r[0] } Tests and more information: https://github.com/IgneousRed/PrngBoard/tree/main/FSR This is not the final version. Not sure what rotate would be best, and not sure about add/sub/xor...
16 bit variants with xor fail at either 512 GB or 1 TB. 32 bit variants have passed 8 TB so far. I target mid-high end CPUs and byteswap exists on both x86 and Arm. It does indeed. But just because the instruction exists in the ISA doesn't mean that the engineers prioritized its performance, that varies from one design to the next. I generally think of it as safer to stick to the most popular opcodes, since I've seen a number of cases of less popular opcodes performing poorly on specific implementations,...
Also for 8bit and 16bit I used a byteswap emulation with 4 parts (as if those words are 32bit)
I target mid-high end CPUs and byteswap exists on both x86 and Arm. Having 2 different bit moves that preserve all bits is very handy. I was reluctant to use xor as it makes DC7-9 indep worse, tho I am open to the idea. There are test results in the folder I linked. I did mention that +++ fails mod3n, are you using a different increment? ODD_PHI_64 :: 0x9e3779b97f4a7c15 ODD_PHI_32 :: 0x9e3779b9 ODD_PHI_16 :: 0x9e37 ODD_PHI_8 :: 0x9f You did not answer my question about going from 4 to 5 states. It...
Byte order reversal is indeed very nice for such purposes, probably slightly better than bit rotation. There are reasons why bit rotation is preferred anyway though. Bit rotation is much better supported by high-level languages. Partially because of that, CPU designers are more inclined to include optimized hardware implementations. On ISAs that don't support bit-rotation, it's easy to emulate efficiently (just an immediate left-shift, an immediate right-shift, and a bitwise or are sufficient to...
Byte order reversal is indeed very nice for such purposes, probably slightly better than bit rotation. There are reasons why bit rotation is preferred anyway though. Bit rotation is much better supported by high-level languages. Partially because of that, CPU designers are more inclined to include optimized hardware implementations. On ISAs that don't support bit-rotation, it easy to emulate (just an immediate left-shift, an immediate right-shift, and a bitwise or are sufficient to replace an immediate...
Byte order reversal is indeed very nice for such purposes, probably slightly better than bit rotation. There are reasons why bit rotation is preferred anyway though. Bit rotation is much better supported by high-level languages. Because of that, CPU designers are more inclined to include optimized hardware implementations. On ISAs that don't support bit-rotation, it easy to emulate (just an immediate left-shift, an immediate right-shift, and a bitwise or are sufficient to replace an immediate rotation),...
Byte order reversal is indeed very nice for such purposes, probably slightly better than bit rotation. There are reasons why bit rotation is preferred anyway though. Bit rotation is much better supported by high-level languages. Because of that, CPU designers are more inclined to include optimized hardware implementations. On ISAs that don't support bit-rotation, it easy to emulate (just an immediate left-shift, an immediate right-shift, and a bitwise or are sufficient), while byte order reversal...
I made a chaotic counting PRNG with 5 states, 6 instructions and uses byteswap instruction. It may be just me... But it is soo beautiful :D FSR64_u64 :: proc(r: ^FSR64) -> u64 { s := r[0] - r[1] r[0] = r[1] + r[4] r[1] = bits.byte_swap(r[2]) r[2] = r[3] + s r[3] = bits.rotate_left64(s, 15) r[4] += lib.ODD_PHI_64 return r[0] } Tests and more information: https://github.com/IgneousRed/PrngBoard/tree/main/FSR This is not the final version. Not sure what rotate would be best, and not sure about +/-/~...
If a 4 word PRNG shows the (Extended+MaxFold) failures as: 8bit word => 2^30 16bit word => 2^36 32bit word => 2^42 64bit word => 2^45+ Is what I wrote. Is this ModernSoftware™ moment? :D Maybe I am just dumb
If a 4 word PRNG shows the (Extended+MaxFold) failures as: 8bit word => 2^30 16bit word => 2^36 32bit word => 2^42 64bit word => 2^45+ Is what I wrote. Is this ModernSoftware™ moment? :D
Multiplication is... decent? I mean, it's fast, on a lot of CPUs, and tends to mix things better than add/xor. The class of CPUs it's fast on smaller than the class of CPUs that the basics (add, xor, shift) are fast on, so a PRNG that uses multiplication is inherently targeting a smaller niche. Not much smaller these days, since fast multiplication is pretty common. And low multiplication alone is never enough, you inherently need shifts, and probably add or xor too, or maybe just decimated output...
What? If a 4 word PRNG shows the (Extended+MaxFold) failures as: 16bit word => 2^36 16 bits is 2 bytes. 4 times 2 bytes is 8 bytes of state. Is that not saying that failures occur after 2^36 bytes with a state size of 8 bytes?
From what I understand, your stance is that multiplication should be avoided since it is too slow? I would totally agree with that stance 10years ago. But for last several years every CPU can do full multiply with decent speed. Given that it does alot more mixing than an add, I think it more than worth it's weight. When choosing a PRNG for low-end CPUs, it should be avoided. But why not list such RNGs for people to have a choice, and read it's Pros and Cons? MWC is a classic at this point, but adding...
Just now caught an error... And the fact that it fails at 2^36 bytes for an 8 byte state and 2^42 bytes for a 16 byte state would worry me. That is not what I wrote, also 8bit version has 32bit state, and failure after 2^30 is about as much as would be expected. The problem is that it scales bad.
Just now caught an error... And the fact that it fails at 2^36 bytes for an 8 byte state and 2^42 bytes for a 16 byte state would worry me. That is not what I wrote, also 8bit version has 32bit state, and failure after 2^30 is about as much as would be expected. The problem is that it scales bad.
The RNG I am currently making is indeed a CountingChaotic type and is almost twice as fast on AArch64 than SFC, without using multiplication and should be easy to make it SIMD . I will create another topic when my long tests confirm some things. Also I want to say thank you for creating this amazing software! Cant wait for the next version (been quite a while), hopefully the 0 generator won't make a fatal error, and for folded FPF test not failing randomly with data below 2^20 :D
The RNG I am currently making is indeed a CountingChaotic type and is almost twice as fast on AArch64 than SFC, without using multiplication and should be easyto make it SIMD . I will create another topic when my long tests confirm some things. Also I want to say thank you for creating this amazing software! Cant wait for the next version (been quite a while), hopefully the 0 generator won't make a fatal error, and for folded FPF test not failing randomly with data below 2^20 :D
The RNG I am currently making is indeed a CountingChaotic type and is almost twice as fast on AArch64 than SFC, without using multiplication and should be easily to make it SIMD . I will create another topic when my long tests confirm some things. Also I want to say thank you for creating this amazing software! Cant wait for the next version (been quite a while), hopefully the 0 generator won't make a fatal error, and for folded FPF test not failing randomly with data below 2^20 :D
Yeah, performance of high-end CPUs on short loops has been weird ever since... the early to mid 1990s I think? PPros. And these days it's not just high-end CPUs anymore. sfc is sort of optimized for x86. Not much really, but the value of the left-shift was chosen, at least in part, to work well with the LEA instruction on x86.
Probably not. I mean, if there are other PRNGs that seem comparable in most ways but are better quality, I'd tend to recommend the better quality ones for whatever niche they were competing in. And the fact that it fails at 2^36 bytes for an 8 byte state and 2^42 bytes for a 16 byte state would worry me. That implies that even if the 64-bit version passed very large tests, it's might be because the tests couldn't keep track of correlations at the right distances or whatever - and there would be a...
On my Mac M2 Pro incrementing by random odd number (instead of 1) speeds up SFC by almost 10%. Found this to be true also on my PRNGs.
If a 4 word PRNG shows the (Extended+MaxFold) failures as: 8bit word => 2^30 16bit word => 2^36 32bit word => 2^42 64bit word => 2^45+ It is not state efficient, but is that fine if it is fast? You surely won't name it "future-proof", but would you recommend 64bit version? And why/not? Also from https://pracrand.sourceforge.net/ I get the sense that you hold 8bit PRNGs to the same standard of randomness as 64bit ones, why is that? From my perspective If one uses 8bit on a micro-controller they probably...
PractRand targets normal integer sizes - 8, 16, 32, 64, and unknown. For a 24 bit source I would recommend either 8 (stdin8 / As8) or unknown (stdin / AsUnknown). Or better running tests twice, once on the lowest 16 bits and then again the highest 16 bits. Eventually the command line will get powerful enough to allow you to specify that, but it's not there yet, so currently you'll have use your own code to drop 8 bits out of every 24.
Or could you please add stdin24?
Well, i dont "have to" do it, iam just playing with this for fun. Brute force is a bit boring. So heres my implementation of finding the seeds from 10+ output values and probably too terse description of how that works. But bascially it iterates over the 17-19 MSB of a compute the corresponding MSB of b and c from that (c requires us to explore multiple cases) and then move around so we split the now "known" part of c so the lower part becomes the MSB of c and then check that. That allows us to discard...
Going from the output to the internal state looks quite feasible. Running the algorithm backwards should be even easier. The seed is simply the internal state 12 steps before output starts. There's probably some intelligent way to compute the state algebraically from the output, but if I had to do it, I'd just brute force it - guess a random value of 'c', use the output to calculate later values of 'c' from the guessed initial value, use the calculated values of c to calculate what the 3rd & 4th...
Hi all Wasnt sure i should start a new thread but i guess as its about sfc64 this would reach people interrested. I was playing around first wondering how hard it would be to find the seeds from the output and then instead looking for patterns in the output based on my plan on how to find the seeds from the output. The shortest code i found that shows statistical defects in sfc64 i got after maybe an hour today of looking into it is this: uint64_t v1 = 0, v0 = 0, c = 0, a = 0; for (int i = 0; i<128;...
Expanded test error "gap product is zero"
I tried some tests on the code you posted. TestU01: 1/1/1 That is, it fails 1 subtest (out of 15) of SmallCrush, 1 (out of 144) of Crush, and 1 (out of 160) of BigCrush. So about 10-15 seconds for first failure. In every case, it was a birthday spacings test that found bias. Also, some people like to run TestU01 on bit-backwards versions of PRNGs (because TestU01 is less stringent on the higher bits, completely skipping the highest IIRC) I also tried both bit-backwards and byte-backwards ; bit-backwards...
I've seen some failures reported on other tests (I remember MaxOft) as well, but with p-values around 1e-4, so not really definitive. The birthday spacings is the one that fails hard. I'll keep an eye on other results. In my code, the 48 bit LCG portion is fixed, and the 32 bit hash sequence is generated by an algorithm that tries out random sequences of candidate instructions . Each generator is then tested by calling PractRand, which I have modified to return a simple quality score. The set of...
No, wait, I'm confused. I hadn't touched PractRand in a while and was accidentally testing with version 0.92. Testing with version 0.95, it does adequately on those files at default settings. rng=RNG_file(d:\inc\out), seed=0 length= 1 gigabyte (2^30 bytes), time= 24.0 seconds Test Name Raw Processed Evaluation mod3n(5):(0,9-0) R= +74.0 p = 1.9e-40 FAIL !!! ...and 222 test result(s) without anomalies rng=RNG_file(d:\inc\out1), seed=0 length= 32 megabytes (2^25 bytes), time= 4.6 seconds Test Name Raw...
No, wait, I'm confused. I hadn't touched PractRand in a while and was accidentally testing with version 0.92. Testing with version 0.95, it does adequately on those files at default settings. rng=RNG_file(d:\inc\out), seed=0 length= 1 gigabyte (2^30 bytes), time= 24.0 seconds Test Name Raw Processed Evaluation mod3n(5):(0,9-0) R= +74.0 p = 1.9e-40 FAIL !!! ...and 222 test result(s) without anomalies rng=RNG_file(d:\inc\out1), seed=0 length= 32 megabytes (2^25 bytes), time= 4.6 seconds Test Name Raw...
No, wait, I'm confused. I hadn't touched PractRand in a while and was accidentally testing with version 0.92. Testing with version 0.95, it does adequately on those files at default settings. rng=RNG_file(d:\inc\out), seed=0 length= 1 gigabyte (2^30 bytes), time= 24.0 seconds Test Name Raw Processed Evaluation mod3n(5):(0,9-0) R= +74.0 p = 1.9e-40 FAIL !!! ...and 222 test result(s) without anomalies rng=RNG_file(d:\inc\out1), seed=0 length= 32 megabytes (2^25 bytes), time= 4.6 seconds Test Name Raw...
It is rather embarrassing that SmallCrush is detecting flaws that PractRand needs hours or more on. Perhaps I should have a small variant of the birthday spacings test incorporated in to the base test set... I presume it's all birthday spacings tests it's failing? Let's see... for this variation at least, the PRNG state transition function looks like an 48 bit LCG with additive constant 69, multiplicative constant 0x010101010101, and modulus 0x1000000000000, plus an 8-bit accumulator used in a rather...
Possibly related: I noticed this issue while working on a project to create a good PRNG for 8 bit targets, using a genetic/hill climbing algorithm that uses PractRand to test candidates. I've seen it happen quite regularly that PractRand gives a pass for 128GB or even 1TB, but that the same PRNG output fails in SmallCrush/Crush, usually on the BirthdaySpacings test. The same thing happened with the output files above. The contrast between PractRand running for an hour or more and seeing no problems...
Quirky, but it's looking like correct behavior to me so far. I think the third nibble out of each 64 bits has some kind of detectable pattern, and removing the first byte phase shifts that to the first nibble out of each 64 bits, which is much easier for PractRand to notice.