Menu

RANLUXPP - a new fast version of RANLUX RNG

Jirka
2020-11-17
2021-06-23
  • Jirka

    Jirka - 2020-11-17

    Hello,

    I would like to ask your opinion on the new version of the RANLUX generator, called RANLUXPP. 

    There are two papers on it:
    A revision of the subtract-with-borrow random number generators
    https://arxiv.org/pdf/1705.03123.pdf

    Review of High-Quality Random Number Generators
    https://arxiv.org/pdf/1903.01247.pdf

    RANLUX is a well-established generator in particle physics for Monte Carlo simulations. It's theoretically backed up by the theory of dynamical systems. The original RANLUX's main drawback was that it was very very slow. The reason is that many values (typically 389) were computed but skipped. The new advancement is that RANLUX is equivalent to LCG with constant "a" having 576 bits x(i + 1) = a · x(i) mod m

    This can be efficiently computed using long arithmetic with the help of SSE or AVX2 instructions. More importantly, skipping can be easily realized in one step - one just needs to precompute a^skip mod m. With these improvements, RANLUX is running at speed 1.4 GiB/s on my laptop with Core i7-8650U CPU @ 1.90GHz. The original implementation is here 
    https://github.com/sibidanov/ranluxpp
    and my fork is here
    https://github.com/jirka-h/ranluxpp

    Sibidanov also notes that when using the original skip parameter 389, the value "a^skip mod m"  shows regularities and proposed to increase the skipping parameter to 2048. There is no runtime penalty for doing so. He calls this new version RANLUX++ or RANLUXPP.

    I have tested RANLUXPP using PractRand version 0.95:
    PractRand-RNG_test stdin -tf 2 -tlmax 32T -multithreaded
    and it has passed all the tests.

    qrng=RNG_stdin, seed=unknown
    length= 32 terabytes (2^45 bytes), time= 252950 seconds
    no anomalies in 976 test result(s)

    Let me summarize some of the features of RANLUXPP:
    Family: LCG with skipping
    Background: theory of dynamical systems
    Supports fast skipping (can be used to initialize parallel non-overlapping instances)
    Period: 10^171
    State vector: 72 bytes
    Speed: ~1.4 GiB/s with AVX2
    Passing PractRand tests (tested up to 16T)
    Used at CERN for the Monte Carlo simulations

    It looks like a very good RNG for Monte Carlo simulations to me. Please let me know what you think about it. 

    Thanks a lot
    Jirka

     
  • Lorenzo

    Lorenzo - 2020-11-19

    I can tell you my opinion on Ranlux++, although please note that I'm not at all an expert in the field.

    From a practical point of view, I think Ranlux++ has quite a few good things going for it.
    You already listed them, but to re-iterate:

    1) It is a LCG generator using a very large 576-bit constant and a lot of decimation (skipping) to get rid of short-range correlations. The theory behind it is well-studied, both in general terms and for the specific RANLUX choice of constants.
    2) It has been used for a long time in very high-level academic research, so one could argue that it has been experimentally well-tested and is trusted by a large community of scientists (make of this whatever you want).
    3) The assembly implementation by Sibidanov is very fast (although I haven't tested it myself).
    4) The period is of the order of 2^576 which is extremely long.
    5) By virtue of the linear structure of the generator, we can build jump functions to generate multiple non-overlapping sequences.

    This is all very good, and if you produced some Monte Carlo simulation results using Ranlux++ I don't think anyone in the world could object that you used a 'poor' generator. On the contrary, at least in several areas of physics, RanLux (with high luxury) is a gold standard of good generators. To paraphrase the famous saying about IBM, 'nobody ever got fired for choosing Ranlux'.

    Having ascertained that Ranlux++ is very good, the question is: are there better alternatives around?
    And the answer is: probably, yes.

    Some criticisms I can think of are:
    1) The original RanLux used specific constants so that it could be implemented in single-precision floating-point arithmetic (24 bits of mantissa) without having to explicitely compute the huge 576 bit multiplication. However, as is well known, the resulting original subtract-with-borrow generator (ie, Ranlux without decimation) is rather terrible. To improve its quality decimation was introduced, which made the original Ranlux extremely slow (because you throw away most of the number you generate). In fact, Sibibarov's paper showed that in the original RanLux, even at the highest levels of luxury (extreme decimation), there are clear non-random patterns in the resulting 576-bit multiplicative constant.
    Sibidanov has decided to forget about the special structure of RanLux and to treat it straightforwardly as a LCG, so we can use arbitrarily high decimation at no additional cost. However, at this point one might ask why we should use the constants used for the original RanLux, as probably there are better constants for huge LCG.

    2) Overall, Ranlux is rather complex, and the assembly implementation a further complication. I mean, many thanks to Sibibarov for providing it, it's good to have it, but it introduces the usual problems of portability and, also, comparatively few people are able to understand it (I don't), because assembly programming is a very specialised field.

    3) The period 2^576, and therefore the memory required to store the state and the multiplicative constant, may be considered excessive (although it's still much, much less than the Mersenne Twister).

    The thing is, today we there is a number of generators, such as Vigna's xoshiro256**, which take up only a few lines of standard C, occupy less memory, are fast, have a big-enough period, offer arbitrary skipping and should be adequate for practically any purpose. A couple of another generators based on modifications of LCGs are the bottom of this page http://pcg.di.unimi.it/pcg.php
    And, of course, many other good generators are included in PractRand or have been presented, sometimes based on cyphers (the generators in Random123, or the 'Tyche' generators by Neves and Araujo, which are based on the ChaCha mixing function, see eg https://link.springer.com/chapter/10.1007/978-3-642-55224-3_10 ).

    However all of these are comparatively new and haven't been validated in the field as much as RanLux, so at the end of the day you might still prefer RanLux for practical applications.

     
  • Jirka

    Jirka - 2020-11-29

    Hi Lorenzo,

    thanks for sharing your opinion! I do agree with your criticisms, especially with the point about the assembly implementation and resulting portability problems. The code could be probably rewritten using AVX intrinsic functions to improve readability, but it will not help the portability of the code. I see this as a general issue with the AVX instructions - they are powerful, but not easy to use and creating portability issues. Recently, Linux Torvald has criticized the new Intel's AVX-512 instructions quite harshly. 

    A couple of another generators based on modifications of LCGs are the bottom of this page http://pcg.di.unimi.it/pcg.php

    Thanks for pointing me to that page. I'm looking for a suitable PRNG to be used for the reference implementation of Aperiodic RNG (APRNG), on which I was working a while ago (check the paper here: https://kmlinux.fjfi.cvut.cz/~balkolub/papers/GeneratorsArxiv.pdf or search for this paper: "Statistical properties and implementation of aperiodic pseudorandom number generators"). I want to use the Covid situation to clean the code and publish it. The idea behind APRNG is to use the quasicrystals/aperiodic sequences (like cut and project sequences, or generally speaking Sturmian words like Fibonacci word https://en.wikipedia.org/wiki/Fibonacci_word) to interleave the output of several PRNGs (or several instances of the same PRNG). It's mathematically proven that such an approach improves some weaknesses of underlying PRNG (namely the nonexistence of lattice structure in any dimension). While any PRNG can be used for this approach, I'm looking for RNGs with a simple code. I'm still collecting the candidates and that page provided some useful information. 

    Thanks!
    Jirka

     
  • Lorenzo

    Lorenzo - 2021-06-08

    I wanted to report that a new C++ implementation of Ranlux++ has been very recently presented by Jonas Hahnfeld, Lorenzo Moneta, see https://arxiv.org/abs/2106.02504
    From what I can see from the paper their optimized C++ implementation is portable and can be about as fast as the x86 assembly one.
    They report a performance of about 9 ns/number both on an AMD Ryzen 9 3900 and on the Apple M1 using the Clang compiler (if I understand correctly "number" is a 64bit and they use only one core on each CPUs. These numbers should correspond to about 5.4 cycles/byte on the AMD and 4.0 cycles/byte on the Apple chip. These numbers look a bit slow with respect to the fast generator here http://prng.di.unimi.it/ , which are mostly below 3.0 cycles/byte (but one should run direct tests to compare).

     
  • Jirka

    Jirka - 2021-06-08

    Hi Lorenzo,

    thanks for the notification! I have found the portable implementation:

    https://github.com/root-project/root/blob/ee0ba46eadb889c8a05274e1b7f5ea68c590b2ea/math/mathcore/src/ranluxpp/mulmod.h

    Function multiply9x9 to multiply two 576 bit numbers (stored as 9 numbers of 64 bits each) is rewritten from assembly to portable code.

    I will try to extract the code to create a standalone portable version of Ranlux++ and compare it to other RNGs.

    Jirka

     
  • Jirka

    Jirka - 2021-06-22

    Hi Lorenzo,

    I have created a standalone implementation of Ranlux++ written in C. It's based on code from ROOT project.

    I have directly compared the speed against xoshiro256** generator. The code is here.

    RESULT: xoshiro256** is faster by factor of 3x than Ranlux++

    On Intel i7-8650U CPU, it takes 7.7 seconds to generate 1e9 64-bit numbers with Ranlux++ and only 2.6 seconds to generate 1e9 64-bit numbers with xoshiro256**.

    I hope this helps.
    Jirka

     
  • Lorenzo

    Lorenzo - 2021-06-23

    Hi Jirka, well done! A factor 3x in speed with respect to a very fast generator such as xoshiro256 is very probably more than acceptable in the vast majority of applications.
    Some of the original considerations made for RanLux++, although they are not at all damning, still apply: the implementation of the generator is fairly complex and the generator is not blazing fast. In any case, it's obviously good to have a fast implementation of such a famous generator.
    BTW, Vigna added on his page a couple of generators called GMWC128 and GMWC256 which should be of very good quality while being at the same time fairly simple (perhaps ~30 lines of code) and reasonably fast (about 2.5-3.0 times slower than xoshiro256
    , ie they should comparable to RanLux++).

     
  • Jirka

    Jirka - 2021-06-23

    Hi Lorenzo,

    I agree with your assessments!

    The speed of Ranlux++ is acceptable.  The real issue is a fairly complex code - to mitigate it, one can directly compare the output of Ranlux++ against the output of the original Ranlux generator. Here is reference implementation: https://github.com/sibidanov/ranluxpp/blob/master/tests/ranlux_test.cxx#L137

    BTW, Vigna added on his page a couple of generators called GMWC128 and GMWC256 which should be of very good quality while being at the same time fairly simple (perhaps ~30 lines of code) and reasonably fast (about 2.5-3.0 times slower than xoshiro256, ie they should comparable to RanLux++).

    Thanks for the heads up! These two RNGs look promising, I will check them up. For reference, generators are described here  under Congruential Generators (along with the links to the code).

    Jirka

     

Log in to post a comment.