-multithreaded causes failures when none should occur
statistical tests & psuedo- random number generators (RNGs, PRNGs)
Status: Beta
Brought to you by:
orz
I stumbled upon a bug which is probably a race condition. I made 4 batch files all with the same command line (see picture) and one fails while the first 3 started do not. The failure is easy to reproduce if all of my cores are busy doing other things and I try to run Pract Rand with the -multithreaded command line.
I just ran 10 instances of "RNG_test sfc16 -multithreaded -tlmin 10 -tlmax 32 -seed 0 -e 1" on my 8-hardware-threads box. All produced identical output except the timings.
Unless you can reproduce that using a built-in generator, or are accusing the "stdin" generator itself of having a race condition, I suspect that your issue is simply that blake2b is using different seeds for your different runs.
Ah then it is probably a bug dealing with stdin. I wrote/compiled blake2b.exe myself and tested it thouroughly, it matches the test vectors and it also outputs the same stream every time; it doesn't use seeds, it just hashes 0, then 1, then 2 ... in counter mode, every time.
I just ran another test using sha2-512 in feedback mode and the same bug occurs.
I can confirm that I can reproduce this bug 100% of the time in 100% the same spot irregardless of test input. I can also confirm the bug never happens when I do the same test but without the -multithreaded command line param.
sha2-512.exe | RNG_test.exe stdin works
sha2-512.exe | RNG_test.exe stdin -multithreaded will fail almost instantly
Last edit: Brett K 2019-09-09
I am not familiar with the interfaces or libraries that you are using, but... which line of that code seeds the generator? So far as I can tell, you are using unitialized memory as the seed in both cases, which should have undefined behavior that may vary depending upon many arcane factors, including whether or not PractRand is run with the -multithreaded flag.
I just tried running: "RNG_output sfc16 inf 0 | RNG_test stdin -tlmin 10 -tlmax 32 -multithreaded -e 2"
Five times simultaneously in different windows to check stdin. All outputs were identical again. I didn't manage to run as many instances this time, but it was enough total threads to make my machine rather unresponsive, and I was playing a (very laggy) multithreaded game in another window, so I figure that's probably enough to qualify as fully loaded.
I should also comment that you are using the word "failure" rather liberally. You might mean an error on PractRand's part, in which case that is reasonable terminology, but since we're discussing a tool that look for PRNG failures a different term would better in this context. On the other hand, if you mean a PRNG failure, that has not happened in any of your screenshots. RNG_test refers to printed lines you highlighted as "anomalies", and the evaluation given to each one is "unusual". Evaluations of "unusual" are quite common, even on perfect PRNGs. Just like if you flip coins a lot, some unusual patterns will show up now and then. RNG_test has a sequence of severity of ratings it gives test results: "normal" (omitted by default), "normalish" (also omitted by default), "unusual", "mildly suspicious", "suspicious", "very suspicious", " VERY SUSPICIOUS", " FAIL", " FAIL !", etc, with increasing numbers of exclamation marks after the "FAIL" line to indicate increasing severity. Generally, only result evaluations including the word "FAIL" should be called failures.
edit: Clarification: the standard RNG_test aims for is that on a perfect RNG, when it prints a results summary (rng name, seed, test length, time, list of anomalies, and number of results without anomalies), on average about one in ten such summaries should have one or more "unusual" anomalies, and about one in 100 should have one or more "mildly suspicious" anomalies, and so on with increasing rarity. On a perfect RNG, about one in 300 million results summaries should include the word "FAIL"... in theory. In practice it's much less than that when you get to the extremes, due to some paranoia factors included in the p-value normalization process, though exactly how much rare varies from subtest to subtest.
Last edit: 2019-09-09
.bss segment memory (Globals, Statics) are initialized to 0 at process creation as per the C standard, so the memory is initialized. I have added initialization but it made no difference in the bug occuring (see attached).
Blake2b is not seeded, it is hashing the number 0, then the number 1, this is the while loop commented for you:
SHA2-512 is doing something a bit different but again there are no seeds here. The hash outputs the same output 100% of the time. This has been confirmed by me dumping it to a binary file and comparing the contents under the same circumstances.
For example, under singlethreaded mode, SHA2-512 has the same "Evaluation Unusual" at the same spot every time, since why would it change? It is in the attached screenshot. I can run it solo, I can run 3 of them at once, it'll always get the same "Evaluation Unusual" in the same spot. So Pract Rand works perfectly in that context. The same input should cause PR to output the same output for the same reasons.
As soon as I add the -multithreaded command option it outputs different things for the same test using the same inputs. So we know that it is not doing something correctly. You can also see that the buggy instance outputs smaller tests more frequently for no reason.
If you want, I can take a large binary file of white noise from Random.org and pipe it into your program to see if the bug still occurs. There are no seeds there, and no chance of different inputs to stdin since it's a file.
Last edit: Brett K 2019-09-09
Also, I'm looking at error3.png and trying to figure out what in it you think shouldn't happen. The stuff in blue just means that the test ran slower. The first test length it shows is time-dependent by default, though you can override that with the -tlmin parameter. The default value is like 1.7 seconds or something, what happens is it checks the time after every power of 2 number of kilobytes, if the time exceeds that threshold then it starts printing output. You can use -tlmin with an integer value and it will treat that as a power of 2 number of bytes, or you can put units on the end of a number, if the units are 's', as in "-tlmin 1.7s" then it use that as a number of seconds, if they're "kb" it will use that as a number of kilobytes, etc. It all looks like correct behavior to me. The -multithreaded command line option will change the timings sometimes, and thus which test results are first shown.
edit: RNG_test prints successive sets of results for increasing test length for the obvious reasons - because it's really useful to have some results you can look at as soon as possible, and varying stringency of test, etc. However, RNG_test doesn't know the speed of the RNG being tested, not even a rough guess - at the low end it could generate less than a single kilobyte per second, at the high end it could generate more than 10 gigabytes per second. Both are circumstances that actually happen.
The intended use is for PractRand to generate preliminary results almost immediately (from a humans sense of time), and then iterate with increasing stringency almost indefinitely. However, while the time taken to test PRNG output is linear with data size, the time taken to generate results is not. It's not quite constant, and the exact time needed varies with settings and test length, but it's generally significant. This means that if PractRand started printing results summaries at the minimum possible point (1 kilobyte), most CPU time would be spent on that for the first half minute or so of execution. Further, for most fast PRNGs, the results of very short tests just don't matter - longer tests are far more likely to produce useful information, and printing out 15 or so useless results summaries will clutter up the output, delay useful output, and contribute nothing. The end result is that starting results summaries after a megabyte or many megabytes makes much more sense in the typical case. However, exactly what point results summaries should start to be generated at varies from PRNG to PRNG. The desired characteristics are that it should be early enough that users get some information quickly, but not so early as to clutter up the output or significantly delay useful output. There is no good way to figure out an ideal starting point, but a rough heuristic is simply to start after an amount of time that users are willing to sit there waiting for results without switching their attention to something else. Which I think is generally on the order of 2 or 3 seconds. Thus, RNG defaults to starting up output after... it looks like I currently have it set to 2 seconds, but that's only checked at powers of 2 times, and doesn't count the evaluation time, so at normal settings that's probably 2.2 to 5.0 seconds, at a guess. The disadvantage of this is that if you want to look at the results summary where the PRNG is only just barely failing then that will get hidden on very low quality PRNGs by this approach by default. But that's a very minor issue since you'll find out within a few seconds and can just restart with -tlmin 10 to see that data.
Last edit: 2019-09-09
My C compile does not initialize all globals or statics to zero in release mode. I just tested that on my compiler with a static local integer, initial value 17609856. There are certain declaration syntaxes I have seen initialize things to zero in some circumstances IIRC, but I cannot recall the specifics.
Nondeterministic behavior is nondeterministic. Trying to evaluate why it changes is generally a foolish endeavor. If you reproduce this without use of your own code I'd be less skeptical. Piping output from RNG_output to RNG_test using stdin produces correct results in my testing, with or without the -multithreaded command line option. If you must use your own generator, sending its output to a file once and then sending that file to RNG_test stdin multiple times would be a way to test that.
There is defintiely a bug, I used a file protected as Read Only and it still has the bugged output. This post has the REGULAR output when single threaded is used. Note that every time I run this test the same output occurs:
So I should get this SAME output every single time. The next post (I am limited to 1 attachment per post) will be the SAME input file going into the process, but with the -multithreaded command line.
.
That is intended behavior. Which test results are shown is a function of run speed, which can vary from run to run. I explained that a bit a few posts up, the one starting with "Also, I'm looking at error3.png".
To see more identical looking output, try adding the command line option "-tlmin 27".
I just performed yet another test, this time I ran just 1 instance of it (not CPU loaded like before) and it's "passes" the file. I re-ran the same test and the next time it found 1 unusual. So it looks like this is not even a problem with my CPU cores being loaded, it looks like -multithreaded is not performing the same tests as singlethreaded.
Last edit: Brett K 2019-09-09
Does "-tlmin 27" fix your the issue bothering you? If so I'm going to mark this as not-a-bug, as minimum test lengths are intended to be a function of testing speed. I can describe the reasons why the default behavior is the default if you want.
If there's some issue here besides varying minimum test lengths, let me know.
So far -tlmin 27 seems to have solved the inconsitencies. But it is after midnight so I cannot test further, I will look at it Monday if I have some free time.
This is a intended behavior, not a bug.
Last edit: 2019-09-09