Practically Random / Bugs / #5 RNG_test with multithreaded options reaches deadlock after processing 512 gigabytes

- 2014-11-18

Trying to reproduce now.

Looking at the code, I'd say it's probably clean of race conditions. If a test crashed, that might cause it to wait indefinitely for the test to complete - but so far as I know PractRand tests don't make a habit of crashing. If pthread_create or CreateThread failed, that would likewise leave it waiting indefinitely - I never got around to checking the return value on those. That's actually sounding like the most plausible explanation to me at the moment... if I'd heard that before releasing .92 I would have tried to make it pass any pthread_create or CreateThread errors along to the user. man pages / MDSN suggests a couple of possibilities for why those functions might start giving me errors after a while of execution - I think I'm not cleaning up after myself properly in some ways, though it wouldn't matter if I were using a proper pool of worker threads instead of a quick-and-dirty solution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2014-11-18

I can't reproduce it. On win32 anyway. Are you using pthreads or win32 threading? Anyway, I have some ideas that might fix it for 0.93.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-18

Hi,

I run RNG_test on Linux. I will try it with 0.92 version and let you know the outcome.

Jirka

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2014-11-18

Okay, my best guess (haven't finished a linux test run yet):

What's happening is that I'm leaking resources in Threading::create_thread, which eventually causes it to refuse to launch more threads for me. Immediately after the lines:
pthread_t thread;
pthread_create(&thread, NULL, func, param);

There should be a third line:
pthread_detach(&thread);

To let it know that it should clean up after the thread automatically when it terminates. The win32 code already does that correctly. It'll take me several hours to confirm that that is what's going on and this fixes it, but good chance that's it. I'll have it fixed in 0.93, you can patch your copy of 0.92 if you want, that's in tools/multithreading.h

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-18

I will try to apply the patch.

0.92 (vanilla) version exhibits the same bug - it happens again after 512 gigabytes of data has been processed.

rng=sfc64, seed=0x6c08d0b3
length= 512 gigabytes (2^39 bytes), time= 2485 seconds
no anomalies in 276 test result(s)

strace shows that RNG_test does nothing but calling
nanosleep({0, 1000}, NULL) = 0
nanosleep({0, 1000}, NULL) = 0

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-18

Short update: the proposed patch should be

pthread_detach(thread);

and not

pthread_detach(&thread);

I have started the test and will report the results back.

Thanks
Jirka

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-18

I have tested the proposed patch but unfortunately it did not help. I still experience the same behaviour. After processing of 512 gigabytes of data program is only calling nanosleep in a loop.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2014-11-19

I can't reproduce your behavior. With the original, on win32 it seemed to just work for me. On my (single-core) linux VM it did the first 512 GB in a little over 2 hours, and failed to reach 1 TB for the next 5 hours before I aborted it - but CPU usage remained at 100% until it was aborted, unlike your observed behavior. With the patch applied, it worked correctly for me on both win32 and linux.

I'm not going to be up for any more testing today. If you want my next best idea: My linux VM doesn't seem to mind, but I just noticed that according to the docs pthread_mutex_destroy isn't supposed to be called on locked mutexes. So, on about line 17 of tools/MultithreadedTestManager.h where it says "delete this;", there should be a "lock.leave();" immediately before that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-19

Hi,

I will try that.

Do you have any debug messages which I could enable (for example during the compilation)? The fact that the bug manifests after processing 0.5 TB of data makes it hard to debug the code. If you can propoase some way how to debug it, please let me know.

Thanks
Jirka

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2014-11-20

Unfortunately there's not any debugging framework for that. I could write one I suppose... may have to if I can't figure out what's going on on your box.

I assume that neither fix did anything for you? Here's an idea that might be easier to test, though far less definitive that the previous two: in MultithreadedTestManager.h, in the constructors parameter list, the default value for max_buffer_amount_ is "1 << (27-10)", try changing it to "1 << (10-10)". That's around line 95.

That won't fix anything, but it might make your problem happen a lot faster, like within the first minute or two. If so, well, that would be informative in its own right, and make other changes easier to test. That change should force it to create threads far faster (one set for every kilobyte instead of for every 128 megabytes), which should force any issue tied to total number of threads (or total number of mutexes) to occur much faster.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-21

Hi,

unfortunately, "lock.leave()" patch has not helped either.

With "1 << (10-10)" problem occurs already after 128MiB as you have predicted. I will try to debug it, let's see if I can find anything.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jirka - 2014-11-21
  
  I see one strange behaviour. With all patches applied
  
  pthread_detach
  
  lock.leave
  
  buffer_amount_ being set to "1 << (10-10)"
  
  and RNG_test (just RNG_test) compiled with -O2 flag (instead of -O3)
  
  g++ -o RNG_test_O2 tools/RNG_test.cpp libPractRand.a -O2 -Iinclude -pthread -std=c++11
  
  the bug seems to be gone. When compiled with -O3 I still see the bug.
  
  Does it help you any further?
  
  I have now started new tests with buffer_amount_ being set to "1 << (27-10)" and with patches pthread_detach and lock.leave applied. RNG_test was compiled with -O2. I should have results by tomorrow
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - - 2014-11-21
    
    Well, isn't that fun. So it's either a compiler bug, or something evil I'm
    doing that it tolerates with optimization off. If it's dependent upon the
    version of gcc used that could explain why I can't reproduce the issue.
    
    Hm... something evil... the first thing that comes to mind is how my
    OS-independence layer hides the implementation - the array of Uint8s cast
    to a pthread_mutex_t is pretty evil. I wouldn't expect optimization to
    have trouble with it, but if I'm violating some obscure
    optimization-related constraint of the language that would be my first
    guess.
    
    I don't know how to tell my VMed Centos7 (a rebranded Redhat) to change gcc
    versions, but just in case I figure it out, what version of gcc are you
    using? I should probably test it out on MinGW too, but after my recent
    hard drive failure I no longer have a working MinGW install, another thing
    to fix sometime.
    
    On Thu, Nov 20, 2014 at 5:11 PM, Jirka jhladky@users.sf.net wrote:
    
    I see one strange behaviour. With all patches applied
    
    pthread_detach
    
    lock.leave
    
    buffer_amount_ being set to "1 << (10-10)"
    
    and RNG_test (just RNG_test) compiled with -O2 flag (instead of -O3)
    
    g++ -o RNG_test_O2 tools/RNG_test.cpp libPractRand.a -O2 -Iinclude -pthread -std=c++11
    
    the bug seems to be gone. When compiled with -O3 I still see the bug.
    
    Does it help you any further?
    
    I have now started new tests with buffer_amount_ being set to "1 <<
    (27-10)" and with patches pthread_detach and lock.leave applied. RNG_test
    was compiled with -O2. I should have results by tomorrow
    
    [bugs:#5] http://sourceforge.net/p/pracrand/bugs/5 RNG_test with
    multithreaded options reaches deadlock after processing 512 gigabytes*
    
    Status: open
    Group: v1.0_(example)
    Created: Tue Nov 18, 2014 12:58 AM UTC by Jirka
    Last Updated: Fri Nov 21, 2014 12:06 AM UTC
    Owner: nobody
    
    Hi,
    
    I experience following bug for RNG_test. When running
    
    RNG_test sfc64 -tlmax 128T -multithreaded
    
    after 512 gigabytes has been processed, RNG_test process gets stalled. It
    seems like deadlock situation has been reached. ps -eLf shoes that no
    threads are running and strace -p <pid> shows series of nanosleeps:</pid>
    
    nanosleep({0, 1000}, NULL) = 0
    nanosleep({0, 1000}, NULL) = 0
    
    It happens for every tested RNG and both in versions 0.90 and 0.91.
    
    To reproduce it please run
    
    RNG_test sfc64 -tlmax 128T -multithreaded
    
    After 512 gigabytes will be processed you should see that system load goes
    to zero and RNG_test does not do any work.
    
    Thanks
    Jirka
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/pracrand/bugs/5/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    Related
    
    Bugs: #5
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- - 2014-11-21
  
  I'm running awfully low on things that could cause it to sleep
  indefinitely. I do have another bug though: If I'm not mistaken, I have
  the two buffers mixed up in the memcpy call in multithread_prep_blocks -
  alt_buffer[0] should be buffer[0], and buffer[prefix_blocks + main_blocks -
  num_prefix_blocks] should be alt_buffer[prefix_blocks + main_blocks -
  num_prefix_blocks].
  
  There's a (very) remote chance that could cause a test to crash, which
  would in turn cause the main thread to wait forever. Probably it just
  causes a negligible error in the test results though.
  
  That's in MultithreadedTestManager.h, around line 84 or so.
  
  On Thu, Nov 20, 2014 at 4:06 PM, Jirka jhladky@users.sf.net wrote:
  
  Hi,
  
  unfortunately, "lock.leave()" patch has not helped either.
  
  With "1 << (10-10)" problem occurs already after 128MiB as you have
  predicted. I will try to debug it, let's see if I can find anything.
  
  [bugs:#5] http://sourceforge.net/p/pracrand/bugs/5 RNG_test with
  multithreaded options reaches deadlock after processing 512 gigabytes*
  
  Status: open
  Group: v1.0_(example)
  Created: Tue Nov 18, 2014 12:58 AM UTC by Jirka
  Last Updated: Thu Nov 20, 2014 06:03 PM UTC
  Owner: nobody
  
  Hi,
  
  I experience following bug for RNG_test. When running
  
  RNG_test sfc64 -tlmax 128T -multithreaded
  
  after 512 gigabytes has been processed, RNG_test process gets stalled. It
  seems like deadlock situation has been reached. ps -eLf shoes that no
  threads are running and strace -p <pid> shows series of nanosleeps:</pid>
  
  nanosleep({0, 1000}, NULL) = 0
  nanosleep({0, 1000}, NULL) = 0
  
  It happens for every tested RNG and both in versions 0.90 and 0.91.
  
  To reproduce it please run
  
  RNG_test sfc64 -tlmax 128T -multithreaded
  
  After 512 gigabytes will be processed you should see that system load goes
  to zero and RNG_test does not do any work.
  
  Thanks
  Jirka
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/pracrand/bugs/5/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: #5
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-21

I have tested it on two different boxes and with following two gcc versions:

gcc (GCC) 4.7.2 20121109

gcc (GCC) 4.8.2 20140120

Behaviour was the same - with O3 it exhibits the bug (falling into infinitive loop of nanosleep calls) , with -O2 (and other fixes applied) it's running fine.

I will test it with gcc 4.9 tomorrow.

Could you please send me the fixed MultithreadedTestManager.h ? The latest memcpy patch is not so easy as previous ones.... You can attach the file directly below this form.

Thanks
Jirka

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2014-11-23

I'm running 4.8.2 20140120 here. So apparently it's not tied to the compiler. I've checked that I'm using -O3. It occurred to me that it could be an issue with the lack of multicore in my linux VM, as windows will switch mutex implementations depending upon the hardware, but switching it to multicore did not help either. And I tried reverting to an older codebase and applying only the first patch, in case something else I did caused it.

No dice. I can't produce anything like what you're seeing once the first patch is applied. So... it's not tied to gcc version. Maybe pthreads version? I'm using NPTL 2.17 (identified via: getconf GNU_LIBPTHREAD_VERSION). Or glibc version? Hm... version number seems to be identical, 2.17 (identified via: ldd --version). To be clear, I'm building 64 bit binaries on linux, and a mix of 32 and 64 on windows.

I don't see how this could have gone wrong, but just to make very sure, this is the corrected function from the first patch:
void create_thread( THREADFUNC_RETURN_TYPE (THREADFUNC_CALLING_CONVENTION func)(void), void *param ) {
pthread_t thread;
pthread_create(&thread, NULL, func, param);
pthread_detach(thread);
}

I'm kind of grasping at straws here. My next guess would be that my portability layers dirty tricks are causing problems, and needs the layer largely removed and/or extra "volatile" keywords added and/or to have the storage space declared at a larger alignment (I think my code should already force it to 64 bit alignment, but it may require up to 512 bit alignment, I can't tell). I'll add some ifdefed compiler-dependent code to improve alignment, for whatever that is worth - I know malloc/new won't do alignments above 64 bit on MSVC, regardless of type properties.

I'll try to make the next version report all unexpected pthreads error codes. They may not actually help though, considering that this is dependent upon optimization levels and not reproducible here, but it might.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jirka - 2014-11-24

Hi,

I have installed Fedora21 Beta with gcc 4.9.2 20141101 and problem is gone after applying pthread_detach(thread) patch. I cannot reproduce the problem any longer also with older compiler version so it seems I have made some mistake during the testing. It could also be that problem does not manifest any longer with newer pthread library version.

In any case, I think that reporting unexpected pthreads error codes is a great idea and will definitely help in the future.

During the testing on Fedora21 I had one crash but unfortunately core dumps where disabled and I was not able to reproduce it.

Have you fixed MultithreadedTestManager.h regarding the memcpy? If so, could you please send me the fixed version?

For the original problem reported here it seems it's fixed with pthread_detach(thread) patch.

Thank you very much for your help on this!
Jirka

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- - 2014-11-24
  
  Whew, finally. The first patch was really the only one that had a high
  chance of fixing things, everything else combined was long shots.
  
  Yeah, I fixed which buffer it was copying where. The corrected code looks
  like:
  std::memcpy(
  &buffer[0],
  &alt_buffer[prefix_blocks + main_blocks - num_prefix_blocks],
  PractRand::Tests::TestBlock::SIZE * num_prefix_blocks
  );
  (at aproximately line 62 of MultithreadedTestManager.h)
  
  To check if you have the correct code enabled, do test runs of the
  following two command line options:
  RNG_test sfc64 -tlmax 1MB -a -seed 0
  and
  RNG_test sfc64 -tlmax 1MB -a -seed 0 -multithreaded
  
  They should produce identical output. In particular, look at the p-values
  for FPF-14+6/16:all and FPF-14+6/16:all2, those are probably the easiest
  place to spot divergence. I was aware they weren't producing identical
  output sometimes, but the results had gotten close enough that it wasn't a
  priority to hunt down the last source or two of minor divergence.
  
  On Mon, Nov 24, 2014 at 3:57 AM, Jirka jhladky@users.sf.net wrote:
  
  Hi,
  
  I have installed Fedora21 Beta with gcc 4.9.2 20141101 and problem is gone
  after applying pthread_detach(thread) patch. I cannot reproduce the problem
  any longer also with older compiler version so it seems I have made some
  mistake during the testing. It could also be that problem does not manifest
  any longer with newer pthread library version.
  
  In any case, I think that reporting unexpected pthreads error codes is a
  great idea and will definitely help in the future.
  
  During the testing on Fedora21 I had one crash but unfortunately core
  dumps where disabled and I was not able to reproduce it.
  
  Have you fixed MultithreadedTestManager.h regarding the memcpy? If so,
  could you please send me the fixed version?
  
  For the original problem reported here it seems it's fixed with
  pthread_detach(thread) patch.
  
  Thank you very much for your help on this!
  Jirka
  
  [bugs:#5] http://sourceforge.net/p/pracrand/bugs/5 RNG_test with
  multithreaded options reaches deadlock after processing 512 gigabytes*
  
  Status: open
  Group: v1.0_(example)
  Created: Tue Nov 18, 2014 12:58 AM UTC by Jirka
  Last Updated: Sun Nov 23, 2014 06:23 PM UTC
  Owner: nobody
  
  Hi,
  
  I experience following bug for RNG_test. When running
  
  RNG_test sfc64 -tlmax 128T -multithreaded
  
  after 512 gigabytes has been processed, RNG_test process gets stalled. It
  seems like deadlock situation has been reached. ps -eLf shoes that no
  threads are running and strace -p <pid> shows series of nanosleeps:</pid>
  
  nanosleep({0, 1000}, NULL) = 0
  nanosleep({0, 1000}, NULL) = 0
  
  It happens for every tested RNG and both in versions 0.90 and 0.91.
  
  To reproduce it please run
  
  RNG_test sfc64 -tlmax 128T -multithreaded
  
  After 512 gigabytes will be processed you should see that system load goes
  to zero and RNG_test does not do any work.
  
  Thanks
  Jirka
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/pracrand/bugs/5/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  Related
  
  Bugs: #5
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

- 2014-11-26

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

RNG_test with multithreaded options reaches deadlock after processing 512 gigabytes

statistical tests & psuedo- random number generators (RNGs, PRNGs)

Group

Searches

Help

#5 RNG_test with multithreaded options reaches deadlock after processing 512 gigabytes

Related

Discussion

Related

Related

Related