1.Hi, there is an obvious-seeming criticism of (& response improving) your sfc64 routine for creating a 64-bit psu-random.
enum {BARREL_SHIFT = 24, RSHIFT = 11, LSHIFT = 3};
Uint64tmp = a + b + counter++;
a = b ^ (b >> RSHIFT);
b = c + (c<< LSHIFT);c = ((c << BARREL_SHIFT) | (c >> (64-BARREL_SHIFT))) + tmp;
returntmp;
The "counter++" could be replaced by
"(counter+=UINT64_C(7640891575674082921))"
which presumably is the same speed but more randomness. Any odd increment could be used. You chose 1. The reason for delta=7640891575674082921 is that I once did a day-long search for good Weyl-generator increments. The continued fraction of delta/2^64 has unusually small sum of partial quotients 107 and unusually small maximum partial quotient 5. (Conceivably this would alter the best choice of BARREL, RSHIFT, LSHIFT. You didn't say how you chose them.)
I suspect that sfc64 is actually good enough for cryptography with security level about 2^128, PROVIDED user agrees to use only every Kth output word, for some large-enough K; probably K=1000 is safe.
--Warren D. Smith
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Uint64 tmp = a + b + counter++;
a = b ^ (b >> RSHIFT);
b = c + (c << LSHIFT);
it seems plausible the "a + b + counter"
would improve if replaced by "(a + counter) ^ a".
Again, same speed, but more presumed randomness because if usage of
+ and ^ are systematically alternated, then you get rid of associativity,
get more nonlinearity, etc.
My suggestions are not huge improvements, but seems obvious (to me anyhow) they are
improvements; and if you want to be empirical about it, it might be detectable to your tests if you made an artificially-shrunk mini-state space version of the routine.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hmm, actually if we alter the internal logic of sfc64 further to
maximally-use the "alternate + and xor" principle, we get (what probably is
the best yet; the state is {a,b,c,w}):
uint64 WDSrand64(){
uint64 tmp = (b + w) ^ a;
a = (b << LSHIFT) - b;
b = c ^ (c >> RSHIFT);
w += INCREM;
c = ((c << BARREL) | (c >> (64-BARREL))) + tmp;
return( tmp );
}
where INCREM = UINT64_C(7640891575674082921)
is suggested, and I have no idea what the best values for
LSHIFT, RSHIFT, and BARREL now should be.
INCREM must be odd. The others must be in {1,2,...,63}.
If you want I can investigate and come up
with some explicit numerical suggestions. Email me at warren.wds at gmail.com.
Another possibility is w could be a linear congruential rather
than Weyl generator, still with period 2^64. That would have more
feed-in randomness, but would be a bit slower.
The period necessarily is a multiple of 2^64, and probably
usually over 2^250 for random initial state. If use only every Kth iterate then probably
crypto-secure if K large enough (1000?).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-
2018-08-20
counter++ could be any odd value at the same speed
More or less. On x86-64 the ++ can be done with INC, ADD(immediate), or LEA, whichever the compiler feels like (under some circumstances the LEA will be better). A 64 bit odd constant would require ADD(IP-relative) if I recall correctly. Which... wouldn't make much difference probably. On FPGAs or other exotic hardware I'm less confident on what the difference would be.
I suspect that sfc64 is actually good enough for cryptography with security level about 2^128, PROVIDED user agrees to use only every Kth output word, for some large-enough K; probably K=1000 is safe.
I'd expect K=100 to be adequate, but trying to prove that would be a nightmare and it would be pretty pointless anyway since skipping that many outputs makes it slower than other options.
lots of variations are better quality
...are they? Version 4 of SFC has been tested - hundreds if not thousands of CPU-hours on the algorithm as written, plus that much again on numerous variations of it (mostly reduced quality variations, since even sfc16 frequires 512 terabytes of output testing to show bias), plus hundreds of hours of human time on analysis. Any major changes that aren't self-evidently superior will need substantial testing.
Scaling the counter (making it a Weyl generator) would likely produce a tiny improvement in quality, which might or might not justify the extra complexity. Quick preliminary testing says: negligible difference in quality.
Changing "a + b + counter" to "(a + counter) ^ b"... I know I tested a few variations like that, can't remember about that specific one. Quick preliminary testing says: significant improvement? Huh. I'll try benchmarking it, and a few other tests, but that does seem worth thinking about.
Alternating addition and xor: On x86, the old left-shift was arranged that way to let it be done efficiently as an LEA, that construction doesn't support that. Also, I did test swapping the order of those two constructions like that, though not in conjunction that variation on the value of tmp. Quick preliminary testing says: small reduction in quality.
Last edit: 2018-08-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh, the left shift (x<<L)+x, or (x<<L)-x, I do not care which you use; I thought
perhaps the minus verson might be a teeny bit more random, but it did not
occur to me the plus version has the advantage it can be compiled with an "LEA" instruction.
That seems a good enough reason to stick with plus.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Note that although I used my own implementation of sfc64, it is byte-for-byte identical in output to the built-in sfc64 for the same seed, a randomly chosen 0xdf7e5fd0802274af in this case.
Prior to this point, I'd never seen anything fail the Pat5 tests and was mostly unaware of them. Interestingly sfc32 does not show this problem. It is currently fine out to 128 TB.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-
2018-08-23
Interesting. Thanks for reporting it. I'd practically forgotten about that test.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
FWIW, I've ran Pat5 test using PractRand pre0.95 on sfc64 10 times up to 128 TB, and did not observe any failures. I've also used the 0xdf7e5fd0802274af seed from your run, and it also passed up to 128 TB.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-
2018-09-07
Changing "a + b + counter" to "(a + counter) ^ b" produced better avalanche scores in my testing, but worse output behavior on PractRand standard (from 512 TB down to 64 TB needed to show bias). Not going to make that change for the moment, though it may be worth investigating more later.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all
Wasnt sure i should start a new thread but i guess as its about sfc64 this would reach people interrested. I was playing around first wondering how hard it would be to find the seeds from the output and then instead looking for patterns in the output based on my plan on how to find the seeds from the output.
The shortest code i found that shows statistical defects in sfc64 i got after maybe an hour today of looking into it is this:
uint64_t v1 = 0, v0 = 0, c = 0, a = 0;
for (int i = 0; i<128; i++) {
v0 = v1; v1 = sfc64_get(&sfc64);
a = v1 - a;
printf("%c%3"PRIX64, i%8 ? ' ' : '\n', a - 9*c >> 52);
c = (c << 24 | c >> 40) + v0;
}
One can see the numbers are quite correlated within each column. I dont really think thats a problem as this test is really designed for sfc64 and not a general test for finding a statistical issue. And every (non secure and fast) PRNG likely has similar issues when one designs a test specifically for it. But it may be simply interresting in case this is not known already
As a check to ensure the code doesnt always do that. With another random source we get:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-
2024-01-10
Going from the output to the internal state looks quite feasible. Running the algorithm backwards should be even easier. The seed is simply the internal state 12 steps before output starts. There's probably some intelligent way to compute the state algebraically from the output, but if I had to do it, I'd just brute force it - guess a random value of 'c', use the output to calculate later values of 'c' from the guessed initial value, use the calculated values of c to calculate what the 3rd & 4th & 5th outputs should be, use the hamming distance between the predicted and observed outputs as a score, flip a random bit or two to my guessed initial value of 'c' and check if the score improved, revert if the score got worse, repeat until the score reaches zero. That should get you the full values of 'c' and 'counter' within a few billion cycles, at which point solving for the other two unknowns gets much much simpler.
Your code looks like a distinguishing attack against sfc64. But since sfc64 isn't intended for cryptographic use, distinguishing attacks doesn't really mean much. If you could do that against efiix or salsa or somesuch that would be something, but sfc, xsm, jsf, even arbee should all be highly vulnerable to that kind of thing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, i dont "have to" do it, iam just playing with this for fun. Brute force is a bit boring. So heres my implementation of finding the seeds from 10+ output values and probably too terse description of how that works. But bascially it iterates over the 17-19 MSB of a compute the corresponding MSB of b and c from that (c requires us to explore multiple cases) and then move around so we split the now "known" part of c so the lower part becomes the MSB of c and then check that. That allows us to discard mismatching cases and ATM requires a few million steps to find the solution which is a fraction of a second for my testcase.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
On my Mac M2 Pro incrementing by random odd number (instead of 1) speeds up SFC by almost 10%.
Found this to be true also on my PRNGs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-
4 days ago
Yeah, performance of high-end CPUs on short loops has been weird ever since... the early to mid 1990s I think? PPros. And these days it's not just high-end CPUs anymore.
sfc is sort of optimized for x86. Not much really, but the value of the left-shift was chosen, at least in part, to work well with the LEA instruction on x86.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1.Hi, there is an obvious-seeming criticism of (& response improving) your sfc64 routine for creating a 64-bit psu-random.
The "counter++" could be replaced by
"(counter+=UINT64_C(7640891575674082921))"
which presumably is the same speed but more randomness. Any odd increment could be used. You chose 1. The reason for delta=7640891575674082921 is that I once did a day-long search for good Weyl-generator increments. The continued fraction of delta/2^64 has unusually small sum of partial quotients 107 and unusually small maximum partial quotient 5. (Conceivably this would alter the best choice of BARREL, RSHIFT, LSHIFT. You didn't say how you chose them.)
--Warren D. Smith
Also, in
Uint64 tmp = a + b + counter++;
a = b ^ (b >> RSHIFT);
b = c + (c << LSHIFT);
it seems plausible the "a + b + counter"
would improve if replaced by "(a + counter) ^ a".
Again, same speed, but more presumed randomness because if usage of
+ and ^ are systematically alternated, then you get rid of associativity,
get more nonlinearity, etc.
My suggestions are not huge improvements, but seems obvious (to me anyhow) they are
improvements; and if you want to be empirical about it, it might be detectable to your tests if you made an artificially-shrunk mini-state space version of the routine.
Agggh!
I meant "(a + counter) ^ b".
hmm, actually if we alter the internal logic of sfc64 further to
maximally-use the "alternate + and xor" principle, we get (what probably is
the best yet; the state is {a,b,c,w}):
uint64 WDSrand64(){
uint64 tmp = (b + w) ^ a;
a = (b << LSHIFT) - b;
b = c ^ (c >> RSHIFT);
w += INCREM;
c = ((c << BARREL) | (c >> (64-BARREL))) + tmp;
return( tmp );
}
where INCREM = UINT64_C(7640891575674082921)
is suggested, and I have no idea what the best values for
LSHIFT, RSHIFT, and BARREL now should be.
INCREM must be odd. The others must be in {1,2,...,63}.
If you want I can investigate and come up
with some explicit numerical suggestions. Email me at warren.wds at gmail.com.
Another possibility is w could be a linear congruential rather
than Weyl generator, still with period 2^64. That would have more
feed-in randomness, but would be a bit slower.
The period necessarily is a multiple of 2^64, and probably
usually over 2^250 for random initial state. If use only every Kth iterate then probably
crypto-secure if K large enough (1000?).
More or less. On x86-64 the ++ can be done with INC, ADD(immediate), or LEA, whichever the compiler feels like (under some circumstances the LEA will be better). A 64 bit odd constant would require ADD(IP-relative) if I recall correctly. Which... wouldn't make much difference probably. On FPGAs or other exotic hardware I'm less confident on what the difference would be.
I'd expect K=100 to be adequate, but trying to prove that would be a nightmare and it would be pretty pointless anyway since skipping that many outputs makes it slower than other options.
...are they? Version 4 of SFC has been tested - hundreds if not thousands of CPU-hours on the algorithm as written, plus that much again on numerous variations of it (mostly reduced quality variations, since even sfc16 frequires 512 terabytes of output testing to show bias), plus hundreds of hours of human time on analysis. Any major changes that aren't self-evidently superior will need substantial testing.
Scaling the counter (making it a Weyl generator) would likely produce a tiny improvement in quality, which might or might not justify the extra complexity. Quick preliminary testing says: negligible difference in quality.
Changing "a + b + counter" to "(a + counter) ^ b"... I know I tested a few variations like that, can't remember about that specific one. Quick preliminary testing says: significant improvement? Huh. I'll try benchmarking it, and a few other tests, but that does seem worth thinking about.
Alternating addition and xor: On x86, the old left-shift was arranged that way to let it be done efficiently as an LEA, that construction doesn't support that. Also, I did test swapping the order of those two constructions like that, though not in conjunction that variation on the value of tmp. Quick preliminary testing says: small reduction in quality.
Last edit: 2018-08-22
Oh, the left shift (x<<L)+x, or (x<<L)-x, I do not care which you use; I thought
perhaps the minus verson might be a teeny bit more random, but it did not
occur to me the plus version has the advantage it can be compiled with an "LEA" instruction.
That seems a good enough reason to stick with plus.
FWIW, regarding
sfc64
, with 0.93, when I ran an extended test, I got:Note that although I used my own implementation of
sfc64
, it is byte-for-byte identical in output to the built-in sfc64 for the same seed, a randomly chosen0xdf7e5fd0802274af
in this case.Prior to this point, I'd never seen anything fail the Pat5 tests and was mostly unaware of them. Interestingly sfc32 does not show this problem. It is currently fine out to 128 TB.
Interesting. Thanks for reporting it. I'd practically forgotten about that test.
FWIW, I've ran Pat5 test using PractRand pre0.95 on
sfc64
10 times up to 128 TB, and did not observe any failures. I've also used the0xdf7e5fd0802274af
seed from your run, and it also passed up to 128 TB.Test output (and the patch I've used to only run the Pat5 test) is here: https://gist.github.com/flyingmutant/32a30cb09d130e1cd754ca3577f1d8b2
Changing "a + b + counter" to "(a + counter) ^ b" produced better avalanche scores in my testing, but worse output behavior on PractRand standard (from 512 TB down to 64 TB needed to show bias). Not going to make that change for the moment, though it may be worth investigating more later.
Hi all
Wasnt sure i should start a new thread but i guess as its about sfc64 this would reach people interrested. I was playing around first wondering how hard it would be to find the seeds from the output and then instead looking for patterns in the output based on my plan on how to find the seeds from the output.
The shortest code i found that shows statistical defects in sfc64 i got after maybe an hour today of looking into it is this:
Output of this is:
One can see the numbers are quite correlated within each column. I dont really think thats a problem as this test is really designed for sfc64 and not a general test for finding a statistical issue. And every (non secure and fast) PRNG likely has similar issues when one designs a test specifically for it. But it may be simply interresting in case this is not known already
As a check to ensure the code doesnt always do that. With another random source we get:
Going from the output to the internal state looks quite feasible. Running the algorithm backwards should be even easier. The seed is simply the internal state 12 steps before output starts. There's probably some intelligent way to compute the state algebraically from the output, but if I had to do it, I'd just brute force it - guess a random value of 'c', use the output to calculate later values of 'c' from the guessed initial value, use the calculated values of c to calculate what the 3rd & 4th & 5th outputs should be, use the hamming distance between the predicted and observed outputs as a score, flip a random bit or two to my guessed initial value of 'c' and check if the score improved, revert if the score got worse, repeat until the score reaches zero. That should get you the full values of 'c' and 'counter' within a few billion cycles, at which point solving for the other two unknowns gets much much simpler.
Your code looks like a distinguishing attack against sfc64. But since sfc64 isn't intended for cryptographic use, distinguishing attacks doesn't really mean much. If you could do that against efiix or salsa or somesuch that would be something, but sfc, xsm, jsf, even arbee should all be highly vulnerable to that kind of thing.
Well, i dont "have to" do it, iam just playing with this for fun. Brute force is a bit boring. So heres my implementation of finding the seeds from 10+ output values and probably too terse description of how that works. But bascially it iterates over the 17-19 MSB of a compute the corresponding MSB of b and c from that (c requires us to explore multiple cases) and then move around so we split the now "known" part of c so the lower part becomes the MSB of c and then check that. That allows us to discard mismatching cases and ATM requires a few million steps to find the solution which is a fraction of a second for my testcase.
On my Mac M2 Pro incrementing by random odd number (instead of 1) speeds up SFC by almost 10%.
Found this to be true also on my PRNGs.
Yeah, performance of high-end CPUs on short loops has been weird ever since... the early to mid 1990s I think? PPros. And these days it's not just high-end CPUs anymore.
sfc is sort of optimized for x86. Not much really, but the value of the left-shift was chosen, at least in part, to work well with the LEA instruction on x86.